Process Management
task_struct, fork/clone/exec, and the Lifecycle of a Linux Process
On Linux, every running entity — user processes, kernel threads, the idle task, even
the init process — is a task_struct. The kernel doesn't really distinguish
between processes and threads: both are tasks. The "process" abstraction is a side
effect of which resources they share. Two tasks that share an address space are
threads; two that don't are separate processes. The unifying primitive is
clone(); fork() is just clone() with a particular
flag set.
Process lifecycle on Linux is a state machine driven by syscalls and signals: a task is created, runs, blocks for I/O, gets scheduled out, eventually exits, becomes a zombie until reaped, and finally vanishes from the process table. Understanding each transition — and the bookkeeping the kernel does at each one — is the foundation for everything else: containers, debuggers, supervisors, and crash recovery.
Lifecycle Diagram
From clone() to reaped, with every state transition the scheduler may make.
Key Numbers
task_struct: The Heart of It
Every task in the kernel is one giant struct. It's defined in include/linux/sched.h
and contains hundreds of fields. The major ones:
struct task_struct {
unsigned int __state; /* TASK_RUNNING, TASK_INTERRUPTIBLE, ... */
void *stack; /* kernel stack pointer */
refcount_t usage;
int prio, static_prio, normal_prio;
struct sched_entity se; /* CFS scheduling entity */
struct mm_struct *mm; /* address space (NULL for kernel threads) */
pid_t pid; /* the task's PID (really TID) */
pid_t tgid; /* thread group leader's PID (the "process" PID) */
struct task_struct __rcu *real_parent; /* real parent at clone() time */
struct task_struct __rcu *parent; /* recipient of SIGCHLD (may differ after PTRACE) */
struct list_head children; /* list of children */
struct list_head sibling; /* link in parent's children list */
struct task_struct *group_leader; /* thread group leader */
struct files_struct *files; /* open file descriptor table */
struct fs_struct *fs; /* CWD, root, umask */
struct signal_struct *signal; /* shared signal handlers */
struct sighand_struct *sighand;
struct nsproxy *nsproxy; /* PID/MNT/NET/IPC/UTS/USER namespaces */
const struct cred __rcu *cred; /* UID, GID, capabilities */
char comm[TASK_COMM_LEN]; /* 16-byte name */
/* ... hundreds more fields ... */
};
Notice pid vs tgid. The kernel's pid is what userspace
calls a thread ID (gettid()). The kernel's tgid is what userspace calls
the process ID (getpid()). All threads in a process share a tgid; their pid differs.
fork, clone, and clone3
Three syscalls, one mechanism. All three end up in the kernel's kernel_clone()
function. The differences are in how arguments are passed.
/* fork() — POSIX heritage, no flags */
pid_t pid = fork();
if (pid == 0) { /* child */ }
else if (pid > 0) { /* parent, pid is child's PID */ }
/* clone() — full control over what's shared */
int flags = CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID;
/* exactly the flags glibc uses for pthread_create() */
/* clone3() — newer struct-based interface, allows CLONE_PIDFD */
struct clone_args args = {
.flags = CLONE_PIDFD,
.pidfd = (uint64_t)&pidfd,
.exit_signal = SIGCHLD,
};
syscall(SYS_clone3, &args, sizeof(args)); Important CLONE_* flags:
CLONE_VM— share the address space (this makes it a thread)CLONE_FILES— share the file descriptor tableCLONE_FS— share CWD, umask, rootCLONE_SIGHAND— share signal handlers (must combine with CLONE_VM)CLONE_THREAD— make it part of the same thread group (same tgid, share signals)CLONE_NEWNS/NEWPID/NEWNET/NEWUSER— new namespace (containers)CLONE_PIDFD— allocate a pidfd referring to the child
The Five States You'll Actually See
| State | Letter | Meaning | Kill -9? |
|---|---|---|---|
| TASK_RUNNING | R | Currently running on CPU or on a runqueue waiting for CPU | Yes |
| TASK_INTERRUPTIBLE | S | Sleeping waiting for event (most idle processes) | Yes |
| TASK_UNINTERRUPTIBLE | D | Blocked in kernel I/O, signals deferred | No (kernel will not deliver) |
| TASK_STOPPED | T | SIGSTOP / SIGTSTP received | Yes (after SIGCONT) |
| TASK_TRACED | t | Stopped by ptrace (debugger attached) | Yes |
| EXIT_ZOMBIE | Z | Exited, awaiting wait() from parent | Already dead |
| EXIT_DEAD | X | Reaped, about to disappear (rarely seen) | — |
Zombies, Orphans, and Reaping
When a process exits, its task_struct sticks around in EXIT_ZOMBIE until the
parent calls one of the wait() family of syscalls to read the exit code. That's the
only way to free the slot.
# See zombies on a system
$ ps -eo pid,ppid,state,comm | awk '$3=="Z"'
3142 1742 Z [chrome] <defunct>
# Reaping in C
int status;
pid_t kid = waitpid(-1, &status, WNOHANG); /* non-blocking */
if (kid > 0) {
if (WIFEXITED(status)) printf("exit %d\n", WEXITSTATUS(status));
if (WIFSIGNALED(status)) printf("signal %d\n", WTERMSIG(status));
}
# Or just ignore SIGCHLD — kernel auto-reaps
signal(SIGCHLD, SIG_IGN); /* portable but loses exit status */ Orphans are children whose parent died first. They get re-parented to
either (a) the nearest ancestor that called prctl(PR_SET_CHILD_SUBREAPER, 1),
or (b) PID 1 (init/systemd) if no subreaper exists. Subreapers are how systemd's
service supervisors keep tabs on their entire process tree without making everyone in
the tree explicitly aware of them.
/* Become a subreaper for your descendants */
prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0);
/* Auto-die when parent dies (per-thread, cleared on execve) */
prctl(PR_SET_PDEATHSIG, SIGTERM, 0, 0, 0); pidfd: PIDs Without the Race
Traditional kill(pid, sig) suffers from a race: between you reading the pid and calling kill, the original process can die and the PID can be reused. pidfd fixes this.
/* Get a pidfd for an existing process */
int pidfd = syscall(SYS_pidfd_open, target_pid, 0);
/* Or get one at clone time (race-free) */
int pidfd;
struct clone_args args = { .flags = CLONE_PIDFD, .pidfd = (uint64_t)&pidfd,
.exit_signal = SIGCHLD };
pid_t kid = syscall(SYS_clone3, &args, sizeof(args));
/* Send signal — guaranteed to go to the original process or fail */
syscall(SYS_pidfd_send_signal, pidfd, SIGTERM, NULL, 0);
/* Wait for it to die, with poll/epoll */
struct pollfd pfd = { .fd = pidfd, .events = POLLIN };
poll(&pfd, 1, -1); /* returns when child exits */
/* Steal an FD from another process (CAP_SYS_PTRACE) */
int their_fd = syscall(SYS_pidfd_getfd, pidfd, 5, 0); The /proc/<pid>/ Directory
Linux exposes per-process kernel state through procfs. Every running task has a directory
under /proc/<pid>/. The same data is available under
/proc/<pid>/task/<tid>/ for individual threads.
| Path | What it tells you |
|---|---|
cmdline | argv[], NUL-separated |
comm | 16-byte process name (writable) |
status | Human-readable summary: state, UID, threads, capabilities |
stat | Single line, ~50 numeric fields. Source for ps and top |
maps | Memory mappings (VMAs). Key for understanding memory layout |
smaps | Per-VMA detailed memory stats (RSS, PSS, swap) |
fd/ | Symlinks for each open file descriptor |
fdinfo/ | Per-FD position, flags, mount ID |
environ | Initial environment, NUL-separated |
cgroup | Path within cgroup hierarchy (controller v2) |
ns/ | Symlinks identifying each namespace |
oom_score / oom_score_adj | OOM killer's view; tunable |
limits | RLIMIT_* values |
sched | Scheduler stats (vruntime, runtime, wait time) |
Tradeoffs
- One kernel mechanism (clone) covers processes, threads, and containers
- Procfs makes runtime introspection trivial — no special libraries needed
- pidfd eliminates a 30-year-old class of races
- fork() with copy-on-write is fast; only modified pages duplicate
- D state is genuinely uninterruptible — a stuck NFS mount can wedge processes forever
- Thread groups make signal semantics subtle (sigprocmask is per-thread, signal handlers are shared)
- fork() in multi-threaded programs is dangerous — only async-signal-safe functions allowed before exec()
- PID exhaustion is a real production failure mode for fork-bomb-shaped bugs
Frequently Asked Questions
What's the difference between fork() and clone()?
fork() is the historical UNIX call that duplicates the calling process. On modern Linux, glibc's fork() is implemented as a thin wrapper over clone() with a fixed flag set (no shared address space, no shared file descriptors, no shared signal handlers). clone() exposes the full power: you can decide which resources are shared between parent and child by passing flags like CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND, CLONE_THREAD, CLONE_NEWPID. A POSIX thread is just a clone() call with most CLONE_* flags set. Containers use clone() with CLONE_NEW* flags to create new namespaces.
Why does a zombie process exist?
A zombie is a process that has finished executing (it called exit() or was killed) but whose entry in the kernel process table has not been reclaimed. The kernel keeps the entry around so the parent can read the exit status via wait()/waitpid()/waitid(). Until the parent reaps the child, the PID and a tiny task_struct stub remain. Zombies consume almost no memory but they consume PIDs, and PIDs are a finite resource (typically 4 million on a 64-bit system, but only 32768 on legacy configs). A bug that forks without waiting can exhaust PIDs and prevent new processes from starting.
How does init reap orphans?
When a process dies before its children, those children are re-parented. Historically they all went to PID 1 (init). On modern systemd systems, you can register a 'subreaper' with prctl(PR_SET_CHILD_SUBREAPER, 1, ...) — orphans then go to the nearest ancestor subreaper instead of PID 1. systemd uses this so that service supervisors reap their own descendants. Without a subreaper, services that fork-and-forget produce zombies parented to init, which init reaps in its main loop.
What is pidfd and why does it exist?
A pidfd is a file descriptor that refers to a process. Created via pidfd_open(pid, 0) or returned from clone3() with CLONE_PIDFD. Unlike a raw PID, a pidfd doesn't suffer from PID reuse races — if process 1234 dies and PID 1234 gets reused for an unrelated process, your pidfd still refers to the original (now dead) one. You can wait on it with poll/epoll, send signals with pidfd_send_signal(), or open /proc/<pid>/* via pidfd_getfd(). Modern container runtimes use pidfds extensively.
What's the D state and why can't I kill it?
D is 'uninterruptible sleep' — the process is blocked in the kernel waiting for something (usually I/O) that the kernel guarantees will complete. Signals are not delivered until the wait finishes, which is why kill -9 doesn't work. Common causes: NFS server unreachable, broken disk, stuck device driver. If a process is stuck in D for minutes, the I/O subsystem is wedged and the only fix is usually fixing whatever the kernel is waiting on (or rebooting). Linux added a 'D-killable' variant (TASK_KILLABLE) in 2.6.25 for some paths so SIGKILL can interrupt the wait.
What does PR_SET_PDEATHSIG do?
It tells the kernel to send a specific signal to the calling process when its parent dies. Useful for child processes that should not outlive their parent — e.g., a worker that should exit if the supervisor crashes. Without it, a worker becomes orphaned and re-parented to init/subreaper, potentially running forever. Caveat: the signal is delivered when the parent thread that called clone() dies, not necessarily the entire parent process — multi-threaded parents have edge cases. The flag is per-thread and is cleared on execve().