πŸƒ Container Runtime

CRI Β· containerd Β· runc Β· cgroups v2 β€” The Full Path from Pod Spec to Running Containers

When you apply a Pod spec, Kubernetes doesn't run containers itself β€” it delegates. The kubelet talks to a Container Runtime Interface (CRI) shim, which speaks to containerd, which ultimately spawns runc to launch each container inside properly isolated namespaces and enforced by cgroups v2. Understanding this stack is how you go from "kubectl apply" to "ps aux shows nginx running in its own pid namespace."

πŸ”± Pod Creation Stepper

Click Next Step to trace every hop from Pod spec to running container processes.

πŸ“„
Pod Spec
kubectl apply β†’ API server stores in etcd
β†’
🟒
kubelet
Syncs Pod status, calls CRI API
β†’
πŸ”Œ
CRI (gRPC)
ImageService + RuntimeService via Unix socket
β†’
πŸ“¦
containerd
Pulls image, creates snapshot, spawns shim
β†’
πŸš€
runc
Creates OCI bundle, configures namespaces + cgroups
β†’
🐳
Container
Processes running in isolated namespaces
πŸ“„

Pod Spec

When you run kubectl apply -f pod.yaml, the API server authenticates the request, validates it against admission webhooks, and stores the Pod object in etcd. The scheduler notices an unassigned Pod and sets the spec.nodeName field, placing the Pod in the Assigned state. The kubelet on the target node detects this change and begins its sync loop.

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.27
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
        memory: 64Mi
      limits:
        memory: 128Mi

πŸ”Œ Container Runtime Interface (CRI)

The plugin interface between kubelet and container runtimes. gRPC over a Unix socket.

πŸ“‘ gRPC API

The CRI is a gRPC API defined in api.pb.go. The kubelet acts as the gRPC client; the runtime (containerd shim) is the server. Messages are Protocol Buffers β€” no JSON, no REST. The socket lives at /var/run/dockershim.sock (dockershim, now removed from k8s) or /var/run/containerd/containerd.sock (containerd CRI).

πŸ–ΌοΈ ImageService

ListImages, PullImage, RemoveImage, ImageStatus. Before a container can run, its image must be pulled. containerd downloads layers from a registry, unpacks them into a content store, and records metadata in bolt DB.

🎬 RuntimeService

The core: RunPodSandbox, CreateContainer, StartContainer, StopContainer, RemoveContainer, ContainerStatus, ExecSync (sync exec in a container). PodSandbox is the pause container β€” the network namespace holder.

πŸ“Ί Streaming API

Exec, Attach, PortForward are streaming β€” they can't go over the main CRI gRPC channel (blocking). containerd spins up a streaming server (a separate TCP port or named pipe) and returns credentials to the kubelet, which proxies kubectl exec connections through the API server.

πŸ“¦ containerd Architecture

containerd is not just a daemon β€” it's a layered system with distinct components.

Client Layer
ctr CLI
containerd-shim
gRPC API client
↓
Metadata (bolt DB)
Namespaces
Containers
Snapshots
Tasks
↓
Content Store
Image layers
Content addressable
Digest (sha256)
↓
Snapshotter
Overlayfs snap
Device mapper
Btrfs / ZFS
↓
Runtime
runc
Kata / gVisor
Custom OCI

πŸ”€ Shim Model β€” Why It Matters

When containerd-shim spawns runc, the shim stays between runc and containerd. This means: containerd can restart without killing containers. runc exits after container setup; the shim holds the container's PID. Without shims, a containerd upgrade would kill every running container.

πŸ’Ύ bolt DB β€” Metadata Store

containerd uses bolt DB (a Go embedded key-value store) for all metadata: container config, image references, snapshot records. It's the source of truth. The content store holds actual layer tarballs. bolt DB entries reference content by digest. On restart, bolt DB is read β†’ containerd knows what should be running.

🧬 cgroups v2 Hierarchy

Every pod gets its own cgroup subtree. See how the hierarchy nests β€” and what OOM really means.

cgroups v2 Overview

cgroups v2 (unified cgroup hierarchy) organizes all processes into a single tree. Each controller (cpu, io, memory, pids) constrains resources. Kubernetes creates a cgroup per Pod under sys/fs/cgroup/kubernetes.slice.

βš™οΈ
cpu
CFS bandwidth, cpu.max, cpu.weight
πŸ’Ύ
io
IO weight, throttle (iostat)
🧠
memory
memory.high, memory.max, current
πŸ”’
pids
pids.max, pids.current

πŸ’€ OOM in cgroups v2

When memory.current β‰₯ memory.max, the kernel triggers memory.reclaim. If reclaim fails, it sends SIGKILL to the highest memory usage process inside the cgroup. Not the whole pod β€” just the container that went over.

memory.high vs memory.max

memory.high = soft limit. When exceeded, kernel throttles allocation and reclaims. memory.max = hard limit. When hit, allocations block or fail (OOM). Kubernetes sets memory.high=memory limit Γ— 0.95 and memory.max=memory limit β€” giving headroom for graceful reclaim before SIGKILL.

πŸš€ runc & OCI Runtime Spec

runc is the OCI reference implementation. It takes an OCI bundle and runs a container.

πŸ“ OCI Bundle

An OCI bundle is a directory containing config.json (the runtime spec) and a rootfs/ (the container filesystem). containerd generates the spec from Pod/Container info, then calls runc create β†’ runc start.

🧿 Namespaces

pid: each container sees only its own processes (init is PID 1).
net: own network stack (eth0, lo, routes).
ipc: isolated SysV IPC objects.
mnt: own mount table β€” can't see host filesystem.
uts: own hostname & domain name.
user: UID/GID remapping β€” root inside container β‰  root outside.

βš–οΈ cgroups v2 Enforcement

runc writes cgroup files under /sys/fs/cgroup/ to constrain the container's process. runc run creates a libcontainer (Go) instance that configures all controllers before execve()-ing the container process. The kernel enforces limits β€” runc just sets them up.

πŸ”’ Capabilities & seccomp

Linux capabilities (CAP_DAC_READ_SEARCH, CAP_NET_ADMIN, etc.) can be granted or dropped. seccomp (secure computing) filters syscalls β€” containers can be blocked from calling mount, ptrace, or reboot. Default seccomp profile in k8s blocks ~44 syscalls.

βš”οΈ Runtime Comparison: runc vs Kata vs gVisor

Choose the right isolation level for your threat model.

Feature
runc
(default)
Kata Containers
(microVM)
gVisor
(user-space kernel)
Isolation
Linux namespaces + cgroups
Minimal VM (Rust/VMM)
Sentry kernel emulation
Performance
βœ… Near-native
⚑ ~95% of native (VM overhead)
⚑ ~85% of native (intercepted I/O)
Boot Time
~100ms
~1–2s (VM boot)
~200ms (Sentry init)
Kernel Sharing
βœ… Host kernel
❌ Minimal Linux kernel
❌ Sentry (emulated)
Host Vulnerability Risk
⚠️ Shared kernel β†’ container escape
βœ… VM boundary
βœ… User-space kernel
Use Case
Trusted workloads
Multi-tenant untrusted
Secure sandboxing
RuntimeClass
runtimeClassName: runc
runtimeClassName: kata
runtimeClassName: gvisor

πŸ’‘ Using RuntimeClass

# 1. Create RuntimeClass
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
---
# 2. Use it in a Pod
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-nginx
spec:
  runtimeClassName: kata
  containers:
  - name: nginx
    image: nginx:1.27

πŸ”„ Full Creation Flow (Code Walkthrough)

Annotated code showing what each component actually does.

// kubelet/pkg/kubelet/kuberuntime/kuberuntime_manager.go

// SyncPod creates the pod sandbox if necessary, and then
// creates containers in the pod.
func (m *KubeGenericRuntimeManager) SyncPod(
    ctx context.Context,
    pod *v1.Pod,
    _ container.PodStatus,
    _ *v1.PodStatus,
    pullSecrets []v1.Secret,
    _) error {

    // Step 1: Pull images
    if err := m.pullImages(...); err != nil { return err }

    // Step 2: Create pod sandbox (pause container)
    podSandboxID, _, err := m.createPodSandbox(ctx, pod)
    if err != nil { return err }

    // Step 3: Create containers
    for _, container := range pod.Spec.Containers {
        if err := m.createContainer(ctx, pod, container,
            podSandboxID, pullSecrets); err != nil { return err }
    }

    // Step 4: Start containers
    for _, container := range pod.Spec.Containers {
        if err := m.startContainer(ctx, containerID); err != nil { return err }
    }
    return nil
}