Container Internals
Namespaces, cgroups, capabilities, seccomp, OverlayFS, runc
A "container" isn't a kernel object. It's a process — usually multiple processes — wrapped in a particular configuration of namespaces (for isolation), cgroups (for resource limits), capabilities (for permission attenuation), seccomp filters (for syscall whitelisting), and a layered filesystem mounted via overlayfs. Docker's contribution was packaging these primitives behind a single command and shipping the layered images that go with it.
The runtime stack is layered too: docker CLI talks to dockerd;
dockerd delegates lifecycle to containerd;
containerd spawns runc (or crun, youki)
per container. Each component does one thing. Kubernetes skips dockerd entirely now
and talks to containerd via CRI.
Architecture
Key Numbers
The Seven Namespaces
| Namespace | Flag | What it virtualizes |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs — your container's first process is PID 1 |
| Mount | CLONE_NEWNS | Mount table — see your own root, no host filesystem |
| Network | CLONE_NEWNET | Interfaces, routing tables, iptables rules, sockets |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues, /dev/shm |
| UTS | CLONE_NEWUTS | Hostname, domain name (sethostname returns container's name) |
| User | CLONE_NEWUSER | UID/GID mapping — be root inside, unprivileged outside |
| Cgroup | CLONE_NEWCGROUP | Hide the host's cgroup hierarchy from /proc |
# Inspect a container's namespaces
$ docker inspect <container> --format '{{.State.Pid}}'
1234
$ ls -l /proc/1234/ns/
lrwxrwxrwx 1 root root 0 May 3 12:34 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 3 12:34 ipc -> 'ipc:[4026532789]'
lrwxrwxrwx 1 root root 0 May 3 12:34 mnt -> 'mnt:[4026532787]'
lrwxrwxrwx 1 root root 0 May 3 12:34 net -> 'net:[4026532790]'
lrwxrwxrwx 1 root root 0 May 3 12:34 pid -> 'pid:[4026532791]'
lrwxrwxrwx 1 root root 0 May 3 12:34 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 3 12:34 uts -> 'uts:[4026532788]'
# Enter a container's namespaces from the host
$ nsenter -t 1234 -a /bin/bash cgroups v2: One Tree, All Controllers
# Each container gets its own cgroup
$ ls /sys/fs/cgroup/system.slice/docker-<cid>.scope/
cgroup.controllers memory.current cpu.weight io.max
cgroup.events memory.max cpu.max io.stat
cgroup.procs memory.high cpu.stat pids.current
cpu.weight.nice memory.events cpuset.cpus pids.max
# Set CPU and memory at run time
$ docker run --cpus=2 --memory=4g --memory-swap=4g nginx
# Maps to: cpu.max = "200000 100000", memory.max = "4294967296"
# Inspect live usage
$ cat /sys/fs/cgroup/system.slice/docker-<cid>.scope/memory.current
1845231616
$ cat /sys/fs/cgroup/system.slice/docker-<cid>.scope/cpu.stat
usage_usec 12345678
user_usec 8901234
system_usec 3444444 Capabilities and seccomp
# Default Docker capabilities (allowed inside container)
chown, dac_override, fowner, fsetid, kill, setgid, setuid,
setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
# Drop everything, add only what you need
$ docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE my-image
# seccomp profile (default blocks ~50 syscalls)
$ docker run --security-opt seccomp=/path/to/profile.json my-image
# A minimal seccomp.json
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{ "names": ["mount", "umount2", "reboot", "clone3", "kexec_load",
"ptrace", "bpf"],
"action": "SCMP_ACT_ERRNO" }
]
}
# AppArmor / SELinux profile is also applied by default
$ aa-status | grep docker
docker-default () OverlayFS: How Layers Work
# An image is a stack of read-only layers.
# A running container adds one writable upper layer.
/var/lib/docker/overlay2/
<layer1>/diff/ # base OS (e.g., debian)
<layer2>/diff/ # apt installed packages
<layer3>/diff/ # COPY app/
<container>/upper/ # writable layer
<container>/work/ # overlayfs internal
<container>/merged/ # what the container sees as /
# Mounted as
$ mount | grep overlay
overlay on /var/lib/docker/overlay2/.../merged type overlay
(rw,relatime,lowerdir=L1:L2:L3,upperdir=upper,workdir=work)
# Whiteout: deleting a file in upper shadows it in lower
# Implemented as a character device with major/minor 0/0:
$ ls -la upper/
crw-r--r-- 1 root root 0, 0 May 3 12:34 deleted-file What runc Actually Does
# containerd hands runc a config.json (OCI runtime spec) and a rootfs path
$ runc create --bundle /run/containerd/.../bundle/ my-container
$ runc start my-container
# Roughly, runc does this:
1. clone3() with CLONE_NEWPID|NEWNS|NEWUTS|NEWIPC|NEWNET|NEWUSER|NEWCGROUP
2. Set up the cgroup (write to /sys/fs/cgroup/.../<cid>.scope/cgroup.procs)
3. unshare(CLONE_NEWNS), then mount overlayfs at the new root
4. pivot_root() into the new root
5. mount tmpfs over /proc, /sys, /dev with the right ones masked
6. Apply seccomp filter (prctl PR_SET_SECCOMP)
7. Apply capability bounding set (capset())
8. Apply AppArmor/SELinux label
9. setuid/setgid to the requested user
10. execve(args[0], args) runc vs crun vs youki
| Aspect | runc | crun | youki |
|---|---|---|---|
| Language | Go | C | Rust |
| Cold start | ~30 ms | ~5 ms | ~10 ms |
| Memory footprint | ~14 MB | ~2 MB | ~3 MB |
| Default in | Docker, containerd | Podman, RHEL CRI-O | (experimental) |
| cgroup v2 support | Yes | Yes (first to ship) | Yes |
Tradeoffs
- No hypervisor — containers share the host kernel; near-zero overhead
- Layered images give content-addressable, deduplicated storage
- OCI standards mean interchangeable images and runtimes
- Capabilities + seccomp + namespaces compose into a tight sandbox
- Shared kernel: a kernel CVE breaks all containers on the host
- --privileged disables almost all isolation; easy to footgun
- OverlayFS is fast for reads but copy-up is expensive on writes
- User namespaces interact poorly with NFS, fuse, and some legacy software
Frequently Asked Questions
What's the difference between cgroup v1 and v2?
cgroup v1 has a separate hierarchy per controller (cpu, memory, blkio, ...), each mounted somewhere under /sys/fs/cgroup/<controller>/. A process can be in different cgroups across controllers, which makes consistent resource accounting hard. cgroup v2 uses a single unified hierarchy under /sys/fs/cgroup, with all controllers active in the same tree. v2 also adds PSI (Pressure Stall Information), better memory accounting, and the io controller (replacing blkio). Modern distros (Fedora 31+, Ubuntu 22.04+, RHEL 9) default to cgroup v2 only. Kubernetes 1.25+ supports v2.
What does runc actually do?
runc is the OCI-compliant low-level runtime. Given a config.json (the OCI runtime spec) and a root filesystem, it sets up the namespaces, cgroups, capabilities, seccomp filter, AppArmor/SELinux labels, mounts, and pivot_root, then execs the user's process with all that in place. It's not a daemon — it just creates the container and exits, leaving the contained process running. containerd or Docker daemon supervises the lifecycle. runc is the reference; alternatives include crun (faster, written in C, used by Podman) and youki (in Rust).
Why are namespaces and cgroups separate?
They solve different problems. Namespaces virtualize what a process sees (its own PID 1, its own /, its own network interfaces) — they're about isolation and naming. Cgroups limit what a process can use (CPU shares, memory cap, I/O bandwidth) — they're about resource accounting. A namespace without cgroups gets full access to the host's resources but in its own little world; cgroups without namespaces meter resource use but the process can see everything else. Containers want both, plus capabilities and seccomp on top.
What is overlayfs and how does it implement layers?
overlayfs is a Linux union filesystem. You give it a 'lower' directory (read-only, possibly stack of multiple) and an 'upper' directory (read-write); the mount presents a merged view. Reads check upper first, then lower; writes go to upper (with copy-up if the file existed in lower). Docker maps each image layer to one of the lowers and a fresh upper for the container's writable layer. Whiteouts (deletions in upper that shadow files in lower) are stored as character device files. Modern Docker uses overlay2 driver, which can stack up to 128 lowers.
What's the difference between containerd and the Docker daemon?
Historically the Docker daemon (dockerd) did everything: image management, build, networking, registry, container lifecycle. In 2017, Docker carved out the lifecycle parts as containerd — a daemon that just manages containers. Now: containerd talks to runc to start containers; dockerd talks to containerd for lifecycle but also handles build, swarm, the Docker CLI's higher-level features. Kubernetes used dockerd via the dockershim until 1.24, then switched to talking to containerd directly via CRI. Today most Kubernetes clusters run containerd or CRI-O, no Docker daemon.
What capabilities does Docker drop by default?
Docker starts containers with a curated subset of Linux capabilities: SETPCAP, MKNOD, AUDIT_WRITE, CHOWN, NET_RAW, DAC_OVERRIDE, FOWNER, FSETID, KILL, SETGID, SETUID, NET_BIND_SERVICE, SYS_CHROOT, SETFCAP. It explicitly drops dangerous ones like SYS_ADMIN, NET_ADMIN, SYS_PTRACE, SYS_MODULE. Drop more with --cap-drop=ALL --cap-add=NET_BIND_SERVICE. Combined with the default seccomp profile (which blocks ~50 of ~370 syscalls outright) and a no-new-privileges flag, the attack surface is smaller than running as root on the host — but a privileged container (--privileged) bypasses all of this.