Container Internals

Namespaces, cgroups, capabilities, seccomp, OverlayFS, runc

A "container" isn't a kernel object. It's a process — usually multiple processes — wrapped in a particular configuration of namespaces (for isolation), cgroups (for resource limits), capabilities (for permission attenuation), seccomp filters (for syscall whitelisting), and a layered filesystem mounted via overlayfs. Docker's contribution was packaging these primitives behind a single command and shipping the layered images that go with it.

The runtime stack is layered too: docker CLI talks to dockerd; dockerd delegates lifecycle to containerd; containerd spawns runc (or crun, youki) per container. Each component does one thing. Kubernetes skips dockerd entirely now and talks to containerd via CRI.

Architecture

Key Numbers

namespace types: pid, mnt, net, ipc, user, uts, cgroup

~40

Linux capabilities (CAP_*)

~370

Linux syscalls; seccomp default blocks ~50

128

overlay2 max layers per image

2008 / 2014

cgroups merged / Docker released

~30 ms

runc cold-start to user process exec

The Seven Namespaces

Namespace	Flag	What it virtualizes
PID	CLONE_NEWPID	Process IDs — your container's first process is PID 1
Mount	CLONE_NEWNS	Mount table — see your own root, no host filesystem
Network	CLONE_NEWNET	Interfaces, routing tables, iptables rules, sockets
IPC	CLONE_NEWIPC	System V IPC, POSIX message queues, /dev/shm
UTS	CLONE_NEWUTS	Hostname, domain name (sethostname returns container's name)
User	CLONE_NEWUSER	UID/GID mapping — be root inside, unprivileged outside
Cgroup	CLONE_NEWCGROUP	Hide the host's cgroup hierarchy from /proc

# Inspect a container's namespaces
$ docker inspect <container> --format '{{.State.Pid}}'
1234
$ ls -l /proc/1234/ns/
lrwxrwxrwx 1 root root 0 May 3 12:34 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 3 12:34 ipc    -> 'ipc:[4026532789]'
lrwxrwxrwx 1 root root 0 May 3 12:34 mnt    -> 'mnt:[4026532787]'
lrwxrwxrwx 1 root root 0 May 3 12:34 net    -> 'net:[4026532790]'
lrwxrwxrwx 1 root root 0 May 3 12:34 pid    -> 'pid:[4026532791]'
lrwxrwxrwx 1 root root 0 May 3 12:34 user   -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 3 12:34 uts    -> 'uts:[4026532788]'

# Enter a container's namespaces from the host
$ nsenter -t 1234 -a /bin/bash

cgroups v2: One Tree, All Controllers

# Each container gets its own cgroup
$ ls /sys/fs/cgroup/system.slice/docker-<cid>.scope/
cgroup.controllers   memory.current      cpu.weight       io.max
cgroup.events        memory.max          cpu.max          io.stat
cgroup.procs         memory.high         cpu.stat         pids.current
cpu.weight.nice      memory.events       cpuset.cpus      pids.max

# Set CPU and memory at run time
$ docker run --cpus=2 --memory=4g --memory-swap=4g nginx
# Maps to: cpu.max = "200000 100000", memory.max = "4294967296"

# Inspect live usage
$ cat /sys/fs/cgroup/system.slice/docker-<cid>.scope/memory.current
1845231616
$ cat /sys/fs/cgroup/system.slice/docker-<cid>.scope/cpu.stat
usage_usec 12345678
user_usec 8901234
system_usec 3444444

Capabilities and seccomp

# Default Docker capabilities (allowed inside container)
chown, dac_override, fowner, fsetid, kill, setgid, setuid,
setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

# Drop everything, add only what you need
$ docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-image

# seccomp profile (default blocks ~50 syscalls)
$ docker run --security-opt seccomp=/path/to/profile.json my-image

# A minimal seccomp.json
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    { "names": ["mount", "umount2", "reboot", "clone3", "kexec_load",
                  "ptrace", "bpf"],
       "action": "SCMP_ACT_ERRNO" }
  ]
}

# AppArmor / SELinux profile is also applied by default
$ aa-status | grep docker
docker-default ()

OverlayFS: How Layers Work

# An image is a stack of read-only layers.
# A running container adds one writable upper layer.

/var/lib/docker/overlay2/
  <layer1>/diff/    # base OS (e.g., debian)
  <layer2>/diff/    # apt installed packages
  <layer3>/diff/    # COPY app/
  <container>/upper/  # writable layer
  <container>/work/   # overlayfs internal
  <container>/merged/ # what the container sees as /

# Mounted as
$ mount | grep overlay
overlay on /var/lib/docker/overlay2/.../merged type overlay
  (rw,relatime,lowerdir=L1:L2:L3,upperdir=upper,workdir=work)

# Whiteout: deleting a file in upper shadows it in lower
# Implemented as a character device with major/minor 0/0:
$ ls -la upper/
crw-r--r-- 1 root root 0, 0 May 3 12:34 deleted-file

What runc Actually Does

# containerd hands runc a config.json (OCI runtime spec) and a rootfs path
$ runc create --bundle /run/containerd/.../bundle/ my-container
$ runc start my-container

# Roughly, runc does this:
1. clone3() with CLONE_NEWPID|NEWNS|NEWUTS|NEWIPC|NEWNET|NEWUSER|NEWCGROUP
2. Set up the cgroup (write to /sys/fs/cgroup/.../<cid>.scope/cgroup.procs)
3. unshare(CLONE_NEWNS), then mount overlayfs at the new root
4. pivot_root() into the new root
5. mount tmpfs over /proc, /sys, /dev with the right ones masked
6. Apply seccomp filter (prctl PR_SET_SECCOMP)
7. Apply capability bounding set (capset())
8. Apply AppArmor/SELinux label
9. setuid/setgid to the requested user
10. execve(args[0], args)

runc vs crun vs youki

Aspect	runc	crun	youki
Language	Go	C	Rust
Cold start	~30 ms	~5 ms	~10 ms
Memory footprint	~14 MB	~2 MB	~3 MB
Default in	Docker, containerd	Podman, RHEL CRI-O	(experimental)
cgroup v2 support	Yes	Yes (first to ship)	Yes

Tradeoffs

Strengths

No hypervisor — containers share the host kernel; near-zero overhead
Layered images give content-addressable, deduplicated storage
OCI standards mean interchangeable images and runtimes
Capabilities + seccomp + namespaces compose into a tight sandbox

Sharp edges

Shared kernel: a kernel CVE breaks all containers on the host
--privileged disables almost all isolation; easy to footgun
OverlayFS is fast for reads but copy-up is expensive on writes
User namespaces interact poorly with NFS, fuse, and some legacy software

Frequently Asked Questions

What's the difference between cgroup v1 and v2?

cgroup v1 has a separate hierarchy per controller (cpu, memory, blkio, ...), each mounted somewhere under /sys/fs/cgroup/<controller>/. A process can be in different cgroups across controllers, which makes consistent resource accounting hard. cgroup v2 uses a single unified hierarchy under /sys/fs/cgroup, with all controllers active in the same tree. v2 also adds PSI (Pressure Stall Information), better memory accounting, and the io controller (replacing blkio). Modern distros (Fedora 31+, Ubuntu 22.04+, RHEL 9) default to cgroup v2 only. Kubernetes 1.25+ supports v2.

What does runc actually do?

runc is the OCI-compliant low-level runtime. Given a config.json (the OCI runtime spec) and a root filesystem, it sets up the namespaces, cgroups, capabilities, seccomp filter, AppArmor/SELinux labels, mounts, and pivot_root, then execs the user's process with all that in place. It's not a daemon — it just creates the container and exits, leaving the contained process running. containerd or Docker daemon supervises the lifecycle. runc is the reference; alternatives include crun (faster, written in C, used by Podman) and youki (in Rust).

Why are namespaces and cgroups separate?

They solve different problems. Namespaces virtualize what a process sees (its own PID 1, its own /, its own network interfaces) — they're about isolation and naming. Cgroups limit what a process can use (CPU shares, memory cap, I/O bandwidth) — they're about resource accounting. A namespace without cgroups gets full access to the host's resources but in its own little world; cgroups without namespaces meter resource use but the process can see everything else. Containers want both, plus capabilities and seccomp on top.

What is overlayfs and how does it implement layers?

overlayfs is a Linux union filesystem. You give it a 'lower' directory (read-only, possibly stack of multiple) and an 'upper' directory (read-write); the mount presents a merged view. Reads check upper first, then lower; writes go to upper (with copy-up if the file existed in lower). Docker maps each image layer to one of the lowers and a fresh upper for the container's writable layer. Whiteouts (deletions in upper that shadow files in lower) are stored as character device files. Modern Docker uses overlay2 driver, which can stack up to 128 lowers.

What's the difference between containerd and the Docker daemon?

Historically the Docker daemon (dockerd) did everything: image management, build, networking, registry, container lifecycle. In 2017, Docker carved out the lifecycle parts as containerd — a daemon that just manages containers. Now: containerd talks to runc to start containers; dockerd talks to containerd for lifecycle but also handles build, swarm, the Docker CLI's higher-level features. Kubernetes used dockerd via the dockershim until 1.24, then switched to talking to containerd directly via CRI. Today most Kubernetes clusters run containerd or CRI-O, no Docker daemon.

What capabilities does Docker drop by default?

Docker starts containers with a curated subset of Linux capabilities: SETPCAP, MKNOD, AUDIT_WRITE, CHOWN, NET_RAW, DAC_OVERRIDE, FOWNER, FSETID, KILL, SETGID, SETUID, NET_BIND_SERVICE, SYS_CHROOT, SETFCAP. It explicitly drops dangerous ones like SYS_ADMIN, NET_ADMIN, SYS_PTRACE, SYS_MODULE. Drop more with --cap-drop=ALL --cap-add=NET_BIND_SERVICE. Combined with the default seccomp profile (which blocks ~50 of ~370 syscalls outright) and a no-new-privileges flag, the attack surface is smaller than running as root on the host — but a privileged container (--privileged) bypasses all of this.