Process Management

task_struct, fork/clone/exec, and the Lifecycle of a Linux Process

On Linux, every running entity — user processes, kernel threads, the idle task, even the init process — is a task_struct. The kernel doesn't really distinguish between processes and threads: both are tasks. The "process" abstraction is a side effect of which resources they share. Two tasks that share an address space are threads; two that don't are separate processes. The unifying primitive is clone(); fork() is just clone() with a particular flag set.

Process lifecycle on Linux is a state machine driven by syscalls and signals: a task is created, runs, blocks for I/O, gets scheduled out, eventually exits, becomes a zombie until reaped, and finally vanishes from the process table. Understanding each transition — and the bookkeeping the kernel does at each one — is the foundation for everything else: containers, debuggers, supervisors, and crash recovery.

Lifecycle Diagram

From clone() to reaped, with every state transition the scheduler may make.

Key Numbers

~9.7 KB

size of one task_struct on x86_64 (varies)

4,194,304

default kernel.pid_max on 64-bit

CLONE_* flags exposed to userspace

8 KB / 16 KB

kernel stack per task (THREAD_SIZE)

documented process states (R/S/D/T/t/Z/X)

~3 µs

cost of a fork() of a small process

PID 1

init/systemd: ultimate orphan reaper

task_struct: The Heart of It

Every task in the kernel is one giant struct. It's defined in include/linux/sched.h and contains hundreds of fields. The major ones:

struct task_struct {
    unsigned int            __state;        /* TASK_RUNNING, TASK_INTERRUPTIBLE, ... */
    void                    *stack;         /* kernel stack pointer */
    refcount_t              usage;
    int                     prio, static_prio, normal_prio;
    struct sched_entity     se;             /* CFS scheduling entity */
    struct mm_struct        *mm;            /* address space (NULL for kernel threads) */
    pid_t                   pid;            /* the task's PID (really TID) */
    pid_t                   tgid;           /* thread group leader's PID (the "process" PID) */
    struct task_struct __rcu *real_parent;  /* real parent at clone() time */
    struct task_struct __rcu *parent;       /* recipient of SIGCHLD (may differ after PTRACE) */
    struct list_head        children;       /* list of children */
    struct list_head        sibling;        /* link in parent's children list */
    struct task_struct      *group_leader;  /* thread group leader */
    struct files_struct     *files;         /* open file descriptor table */
    struct fs_struct        *fs;            /* CWD, root, umask */
    struct signal_struct    *signal;        /* shared signal handlers */
    struct sighand_struct   *sighand;
    struct nsproxy          *nsproxy;       /* PID/MNT/NET/IPC/UTS/USER namespaces */
    const struct cred __rcu *cred;          /* UID, GID, capabilities */
    char                    comm[TASK_COMM_LEN];  /* 16-byte name */
    /* ... hundreds more fields ... */
};

Notice pid vs tgid. The kernel's pid is what userspace calls a thread ID (gettid()). The kernel's tgid is what userspace calls the process ID (getpid()). All threads in a process share a tgid; their pid differs.

fork, clone, and clone3

Three syscalls, one mechanism. All three end up in the kernel's kernel_clone() function. The differences are in how arguments are passed.

/* fork() — POSIX heritage, no flags */
pid_t pid = fork();
if (pid == 0) { /* child */ }
else if (pid > 0) { /* parent, pid is child's PID */ }

/* clone() — full control over what's shared */
int flags = CLONE_VM | CLONE_FS | CLONE_FILES |
            CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM |
            CLONE_SETTLS | CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID;
/* exactly the flags glibc uses for pthread_create() */

/* clone3() — newer struct-based interface, allows CLONE_PIDFD */
struct clone_args args = {
    .flags = CLONE_PIDFD,
    .pidfd = (uint64_t)&pidfd,
    .exit_signal = SIGCHLD,
};
syscall(SYS_clone3, &args, sizeof(args));

Important CLONE_* flags:

CLONE_VM — share the address space (this makes it a thread)
CLONE_FILES — share the file descriptor table
CLONE_FS — share CWD, umask, root
CLONE_SIGHAND — share signal handlers (must combine with CLONE_VM)
CLONE_THREAD — make it part of the same thread group (same tgid, share signals)
CLONE_NEWNS / NEWPID / NEWNET / NEWUSER — new namespace (containers)
CLONE_PIDFD — allocate a pidfd referring to the child

The Five States You'll Actually See

State	Letter	Meaning	Kill -9?
TASK_RUNNING	R	Currently running on CPU or on a runqueue waiting for CPU	Yes
TASK_INTERRUPTIBLE	S	Sleeping waiting for event (most idle processes)	Yes
TASK_UNINTERRUPTIBLE	D	Blocked in kernel I/O, signals deferred	No (kernel will not deliver)
TASK_STOPPED	T	SIGSTOP / SIGTSTP received	Yes (after SIGCONT)
TASK_TRACED	t	Stopped by ptrace (debugger attached)	Yes
EXIT_ZOMBIE	Z	Exited, awaiting wait() from parent	Already dead
EXIT_DEAD	X	Reaped, about to disappear (rarely seen)	—

Zombies, Orphans, and Reaping

When a process exits, its task_struct sticks around in EXIT_ZOMBIE until the parent calls one of the wait() family of syscalls to read the exit code. That's the only way to free the slot.

# See zombies on a system
$ ps -eo pid,ppid,state,comm | awk '$3=="Z"'
  3142  1742 Z [chrome] <defunct>

# Reaping in C
int status;
pid_t kid = waitpid(-1, &status, WNOHANG);   /* non-blocking */
if (kid > 0) {
    if (WIFEXITED(status))   printf("exit %d\n", WEXITSTATUS(status));
    if (WIFSIGNALED(status)) printf("signal %d\n", WTERMSIG(status));
}

# Or just ignore SIGCHLD — kernel auto-reaps
signal(SIGCHLD, SIG_IGN);   /* portable but loses exit status */

Orphans are children whose parent died first. They get re-parented to either (a) the nearest ancestor that called prctl(PR_SET_CHILD_SUBREAPER, 1), or (b) PID 1 (init/systemd) if no subreaper exists. Subreapers are how systemd's service supervisors keep tabs on their entire process tree without making everyone in the tree explicitly aware of them.

/* Become a subreaper for your descendants */
prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0);

/* Auto-die when parent dies (per-thread, cleared on execve) */
prctl(PR_SET_PDEATHSIG, SIGTERM, 0, 0, 0);

pidfd: PIDs Without the Race

Traditional kill(pid, sig) suffers from a race: between you reading the pid and calling kill, the original process can die and the PID can be reused. pidfd fixes this.

/* Get a pidfd for an existing process */
int pidfd = syscall(SYS_pidfd_open, target_pid, 0);

/* Or get one at clone time (race-free) */
int pidfd;
struct clone_args args = { .flags = CLONE_PIDFD, .pidfd = (uint64_t)&pidfd,
                            .exit_signal = SIGCHLD };
pid_t kid = syscall(SYS_clone3, &args, sizeof(args));

/* Send signal — guaranteed to go to the original process or fail */
syscall(SYS_pidfd_send_signal, pidfd, SIGTERM, NULL, 0);

/* Wait for it to die, with poll/epoll */
struct pollfd pfd = { .fd = pidfd, .events = POLLIN };
poll(&pfd, 1, -1);   /* returns when child exits */

/* Steal an FD from another process (CAP_SYS_PTRACE) */
int their_fd = syscall(SYS_pidfd_getfd, pidfd, 5, 0);

The /proc/<pid>/ Directory

Linux exposes per-process kernel state through procfs. Every running task has a directory under /proc/<pid>/. The same data is available under /proc/<pid>/task/<tid>/ for individual threads.

Path	What it tells you
`cmdline`	argv[], NUL-separated
`comm`	16-byte process name (writable)
`status`	Human-readable summary: state, UID, threads, capabilities
`stat`	Single line, ~50 numeric fields. Source for ps and top
`maps`	Memory mappings (VMAs). Key for understanding memory layout
`smaps`	Per-VMA detailed memory stats (RSS, PSS, swap)
`fd/`	Symlinks for each open file descriptor
`fdinfo/`	Per-FD position, flags, mount ID
`environ`	Initial environment, NUL-separated
`cgroup`	Path within cgroup hierarchy (controller v2)
`ns/`	Symlinks identifying each namespace
`oom_score` / `oom_score_adj`	OOM killer's view; tunable
`limits`	RLIMIT_* values
`sched`	Scheduler stats (vruntime, runtime, wait time)

Tradeoffs

Pros

One kernel mechanism (clone) covers processes, threads, and containers
Procfs makes runtime introspection trivial — no special libraries needed
pidfd eliminates a 30-year-old class of races
fork() with copy-on-write is fast; only modified pages duplicate

Cons

D state is genuinely uninterruptible — a stuck NFS mount can wedge processes forever
Thread groups make signal semantics subtle (sigprocmask is per-thread, signal handlers are shared)
fork() in multi-threaded programs is dangerous — only async-signal-safe functions allowed before exec()
PID exhaustion is a real production failure mode for fork-bomb-shaped bugs

Frequently Asked Questions

What's the difference between fork() and clone()?

fork() is the historical UNIX call that duplicates the calling process. On modern Linux, glibc's fork() is implemented as a thin wrapper over clone() with a fixed flag set (no shared address space, no shared file descriptors, no shared signal handlers). clone() exposes the full power: you can decide which resources are shared between parent and child by passing flags like CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND, CLONE_THREAD, CLONE_NEWPID. A POSIX thread is just a clone() call with most CLONE_* flags set. Containers use clone() with CLONE_NEW* flags to create new namespaces.

Why does a zombie process exist?

A zombie is a process that has finished executing (it called exit() or was killed) but whose entry in the kernel process table has not been reclaimed. The kernel keeps the entry around so the parent can read the exit status via wait()/waitpid()/waitid(). Until the parent reaps the child, the PID and a tiny task_struct stub remain. Zombies consume almost no memory but they consume PIDs, and PIDs are a finite resource (typically 4 million on a 64-bit system, but only 32768 on legacy configs). A bug that forks without waiting can exhaust PIDs and prevent new processes from starting.

How does init reap orphans?

When a process dies before its children, those children are re-parented. Historically they all went to PID 1 (init). On modern systemd systems, you can register a 'subreaper' with prctl(PR_SET_CHILD_SUBREAPER, 1, ...) — orphans then go to the nearest ancestor subreaper instead of PID 1. systemd uses this so that service supervisors reap their own descendants. Without a subreaper, services that fork-and-forget produce zombies parented to init, which init reaps in its main loop.

What is pidfd and why does it exist?

A pidfd is a file descriptor that refers to a process. Created via pidfd_open(pid, 0) or returned from clone3() with CLONE_PIDFD. Unlike a raw PID, a pidfd doesn't suffer from PID reuse races — if process 1234 dies and PID 1234 gets reused for an unrelated process, your pidfd still refers to the original (now dead) one. You can wait on it with poll/epoll, send signals with pidfd_send_signal(), or open /proc/<pid>/* via pidfd_getfd(). Modern container runtimes use pidfds extensively.

What's the D state and why can't I kill it?

D is 'uninterruptible sleep' — the process is blocked in the kernel waiting for something (usually I/O) that the kernel guarantees will complete. Signals are not delivered until the wait finishes, which is why kill -9 doesn't work. Common causes: NFS server unreachable, broken disk, stuck device driver. If a process is stuck in D for minutes, the I/O subsystem is wedged and the only fix is usually fixing whatever the kernel is waiting on (or rebooting). Linux added a 'D-killable' variant (TASK_KILLABLE) in 2.6.25 for some paths so SIGKILL can interrupt the wait.

What does PR_SET_PDEATHSIG do?

It tells the kernel to send a specific signal to the calling process when its parent dies. Useful for child processes that should not outlive their parent — e.g., a worker that should exit if the supervisor crashes. Without it, a worker becomes orphaned and re-parented to init/subreaper, potentially running forever. Caveat: the signal is delivered when the parent thread that called clone() dies, not necessarily the entire parent process — multi-threaded parents have edge cases. The flag is per-thread and is cleared on execve().