π Container Runtime
CRI Β· containerd Β· runc Β· cgroups v2 β The Full Path from Pod Spec to Running Containers
When you apply a Pod spec, Kubernetes doesn't run containers itself β it delegates. The kubelet talks to a Container Runtime Interface (CRI) shim, which speaks to containerd, which ultimately spawns runc to launch each container inside properly isolated namespaces and enforced by cgroups v2. Understanding this stack is how you go from "kubectl apply" to "ps aux shows nginx running in its own pid namespace."
π± Pod Creation Stepper
Click Next Step to trace every hop from Pod spec to running container processes.
Pod Spec
When you run kubectl apply -f pod.yaml, the API server authenticates the request,
validates it against admission webhooks, and stores the Pod object in etcd.
The scheduler notices an unassigned Pod and sets the spec.nodeName field,
placing the Pod in the Assigned state. The kubelet on the target node detects
this change and begins its sync loop.
# pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.27
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
memory: 128Mi π Container Runtime Interface (CRI)
The plugin interface between kubelet and container runtimes. gRPC over a Unix socket.
π‘ gRPC API
The CRI is a gRPC API defined in api.pb.go. The kubelet acts
as the gRPC client; the runtime (containerd shim) is the server. Messages are
Protocol Buffers β no JSON, no REST. The socket lives at
/var/run/dockershim.sock (dockershim, now removed from k8s) or
/var/run/containerd/containerd.sock (containerd CRI).
πΌοΈ ImageService
ListImages, PullImage, RemoveImage,
ImageStatus. Before a container can run, its image must be pulled.
containerd downloads layers from a registry, unpacks them into a content store,
and records metadata in bolt DB.
π¬ RuntimeService
The core: RunPodSandbox, CreateContainer, StartContainer,
StopContainer, RemoveContainer, ContainerStatus,
ExecSync (sync exec in a container). PodSandbox is the pause container β
the network namespace holder.
πΊ Streaming API
Exec, Attach, PortForward are streaming β
they can't go over the main CRI gRPC channel (blocking). containerd spins up a
streaming server (a separate TCP port or named pipe) and returns
credentials to the kubelet, which proxies kubectl exec connections through
the API server.
π¦ containerd Architecture
containerd is not just a daemon β it's a layered system with distinct components.
π Shim Model β Why It Matters
When containerd-shim spawns runc, the shim stays between runc
and containerd. This means: containerd can restart without killing
containers. runc exits after container setup; the shim holds the container's PID.
Without shims, a containerd upgrade would kill every running container.
πΎ bolt DB β Metadata Store
containerd uses bolt DB (a Go embedded key-value store) for all metadata: container config, image references, snapshot records. It's the source of truth. The content store holds actual layer tarballs. bolt DB entries reference content by digest. On restart, bolt DB is read β containerd knows what should be running.
𧬠cgroups v2 Hierarchy
Every pod gets its own cgroup subtree. See how the hierarchy nests β and what OOM really means.
cgroups v2 Overview
cgroups v2 (unified cgroup hierarchy) organizes all processes into a single tree.
Each controller (cpu, io, memory, pids) constrains resources.
Kubernetes creates a cgroup per Pod under sys/fs/cgroup/kubernetes.slice.
CFS bandwidth, cpu.max, cpu.weight
IO weight, throttle (iostat)
memory.high, memory.max, current
pids.max, pids.current
π OOM in cgroups v2
When memory.current β₯ memory.max, the kernel triggers
memory.reclaim. If reclaim fails, it sends SIGKILL
to the highest memory usage process inside the cgroup. Not the whole pod β just the
container that went over.
memory.high vs memory.max
memory.high = soft limit. When exceeded, kernel throttles allocation
and reclaims. memory.max = hard limit. When hit, allocations block
or fail (OOM). Kubernetes sets memory.high=memory limit Γ 0.95 and
memory.max=memory limit β giving headroom for graceful reclaim before SIGKILL.
π runc & OCI Runtime Spec
runc is the OCI reference implementation. It takes an OCI bundle and runs a container.
π OCI Bundle
An OCI bundle is a directory containing config.json (the runtime spec) and
a rootfs/ (the container filesystem). containerd generates the spec from
Pod/Container info, then calls runc create β runc start.
π§Ώ Namespaces
pid: each container sees only its own processes (init is PID 1).
net: own network stack (eth0, lo, routes).
ipc: isolated SysV IPC objects.
mnt: own mount table β can't see host filesystem.
uts: own hostname & domain name.
user: UID/GID remapping β root inside container β root outside.
βοΈ cgroups v2 Enforcement
runc writes cgroup files under /sys/fs/cgroup/ to constrain the container's
process. runc run creates a libcontainer (Go) instance that
configures all controllers before execve()-ing the container process.
The kernel enforces limits β runc just sets them up.
π Capabilities & seccomp
Linux capabilities (CAP_DAC_READ_SEARCH, CAP_NET_ADMIN, etc.) can be
granted or dropped. seccomp (secure computing) filters syscalls β
containers can be blocked from calling mount, ptrace,
or reboot. Default seccomp profile in k8s blocks ~44 syscalls.
βοΈ Runtime Comparison: runc vs Kata vs gVisor
Choose the right isolation level for your threat model.
(default)
(microVM)
(user-space kernel)
runtimeClassName: runcruntimeClassName: kataruntimeClassName: gvisorπ‘ Using RuntimeClass
# 1. Create RuntimeClass
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata
---
# 2. Use it in a Pod
apiVersion: v1
kind: Pod
metadata:
name: sandboxed-nginx
spec:
runtimeClassName: kata
containers:
- name: nginx
image: nginx:1.27 π Full Creation Flow (Code Walkthrough)
Annotated code showing what each component actually does.
// kubelet/pkg/kubelet/kuberuntime/kuberuntime_manager.go
// SyncPod creates the pod sandbox if necessary, and then
// creates containers in the pod.
func (m *KubeGenericRuntimeManager) SyncPod(
ctx context.Context,
pod *v1.Pod,
_ container.PodStatus,
_ *v1.PodStatus,
pullSecrets []v1.Secret,
_) error {
// Step 1: Pull images
if err := m.pullImages(...); err != nil { return err }
// Step 2: Create pod sandbox (pause container)
podSandboxID, _, err := m.createPodSandbox(ctx, pod)
if err != nil { return err }
// Step 3: Create containers
for _, container := range pod.Spec.Containers {
if err := m.createContainer(ctx, pod, container,
podSandboxID, pullSecrets); err != nil { return err }
}
// Step 4: Start containers
for _, container := range pod.Spec.Containers {
if err := m.startContainer(ctx, containerID); err != nil { return err }
}
return nil
}