🏃 Container Runtime

CRI · containerd · runc · cgroups v2 — The Full Path from Pod Spec to Running Containers

When you apply a Pod spec, Kubernetes doesn't run containers itself — it delegates. The kubelet talks to a Container Runtime Interface (CRI) shim, which speaks to containerd, which ultimately spawns runc to launch each container inside properly isolated namespaces and enforced by cgroups v2. Understanding this stack is how you go from "kubectl apply" to "ps aux shows nginx running in its own pid namespace."

🔱 Pod Creation Stepper

Click Next Step to trace every hop from Pod spec to running container processes.

📄

Pod Spec

kubectl apply → API server stores in etcd

→

🟢

kubelet

Syncs Pod status, calls CRI API

→

🔌

CRI (gRPC)

ImageService + RuntimeService via Unix socket

→

📦

containerd

Pulls image, creates snapshot, spawns shim

→

🚀

runc

Creates OCI bundle, configures namespaces + cgroups

→

🐳

Container

Processes running in isolated namespaces

📄

Pod Spec

When you run kubectl apply -f pod.yaml, the API server authenticates the request, validates it against admission webhooks, and stores the Pod object in etcd. The scheduler notices an unassigned Pod and sets the spec.nodeName field, placing the Pod in the Assigned state. The kubelet on the target node detects this change and begins its sync loop.

# pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.27
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
        memory: 64Mi
      limits:
        memory: 128Mi

🔌 Container Runtime Interface (CRI)

The plugin interface between kubelet and container runtimes. gRPC over a Unix socket.

📡 gRPC API

The CRI is a gRPC API defined in api.pb.go. The kubelet acts as the gRPC client; the runtime (containerd shim) is the server. Messages are Protocol Buffers — no JSON, no REST. The socket lives at /var/run/dockershim.sock (dockershim, now removed from k8s) or /var/run/containerd/containerd.sock (containerd CRI).

🖼️ ImageService

ListImages, PullImage, RemoveImage, ImageStatus. Before a container can run, its image must be pulled. containerd downloads layers from a registry, unpacks them into a content store, and records metadata in bolt DB.

🎬 RuntimeService

The core: RunPodSandbox, CreateContainer, StartContainer, StopContainer, RemoveContainer, ContainerStatus, ExecSync (sync exec in a container). PodSandbox is the pause container — the network namespace holder.

📺 Streaming API

Exec, Attach, PortForward are streaming — they can't go over the main CRI gRPC channel (blocking). containerd spins up a streaming server (a separate TCP port or named pipe) and returns credentials to the kubelet, which proxies kubectl exec connections through the API server.

📦 containerd Architecture

containerd is not just a daemon — it's a layered system with distinct components.

Client Layer

ctr CLI

containerd-shim

gRPC API client

↓

Metadata (bolt DB)

Namespaces

Containers

Snapshots

Tasks

↓

Content Store

Image layers

Content addressable

Digest (sha256)

↓

Snapshotter

Overlayfs snap

Device mapper

Btrfs / ZFS

↓

Runtime

runc

Kata / gVisor

Custom OCI

🔀 Shim Model — Why It Matters

When containerd-shim spawns runc, the shim stays between runc and containerd. This means: containerd can restart without killing containers. runc exits after container setup; the shim holds the container's PID. Without shims, a containerd upgrade would kill every running container.

💾 bolt DB — Metadata Store

containerd uses bolt DB (a Go embedded key-value store) for all metadata: container config, image references, snapshot records. It's the source of truth. The content store holds actual layer tarballs. bolt DB entries reference content by digest. On restart, bolt DB is read → containerd knows what should be running.

🧬 cgroups v2 Hierarchy

Every pod gets its own cgroup subtree. See how the hierarchy nests — and what OOM really means.

cgroups v2 Overview

cgroups v2 (unified cgroup hierarchy) organizes all processes into a single tree. Each controller (cpu, io, memory, pids) constrains resources. Kubernetes creates a cgroup per Pod under sys/fs/cgroup/kubernetes.slice.

⚙️

cpu
CFS bandwidth, cpu.max, cpu.weight

💾

io
IO weight, throttle (iostat)

🧠

memory
memory.high, memory.max, current

🔢

pids
pids.max, pids.current

💀 OOM in cgroups v2

When memory.current ≥ memory.max, the kernel triggers memory.reclaim. If reclaim fails, it sends SIGKILL to the highest memory usage process inside the cgroup. Not the whole pod — just the container that went over.

memory.high vs memory.max

memory.high = soft limit. When exceeded, kernel throttles allocation and reclaims. memory.max = hard limit. When hit, allocations block or fail (OOM). Kubernetes sets memory.high=memory limit × 0.95 and memory.max=memory limit — giving headroom for graceful reclaim before SIGKILL.

🚀 runc & OCI Runtime Spec

runc is the OCI reference implementation. It takes an OCI bundle and runs a container.

📁 OCI Bundle

An OCI bundle is a directory containing config.json (the runtime spec) and a rootfs/ (the container filesystem). containerd generates the spec from Pod/Container info, then calls runc create → runc start.

🧿 Namespaces

pid: each container sees only its own processes (init is PID 1).
net: own network stack (eth0, lo, routes).
ipc: isolated SysV IPC objects.
mnt: own mount table — can't see host filesystem.
uts: own hostname & domain name.
user: UID/GID remapping — root inside container ≠ root outside.

⚖️ cgroups v2 Enforcement

runc writes cgroup files under /sys/fs/cgroup/ to constrain the container's process. runc run creates a libcontainer (Go) instance that configures all controllers before execve()-ing the container process. The kernel enforces limits — runc just sets them up.

🔒 Capabilities & seccomp

Linux capabilities (CAP_DAC_READ_SEARCH, CAP_NET_ADMIN, etc.) can be granted or dropped. seccomp (secure computing) filters syscalls — containers can be blocked from calling mount, ptrace, or reboot. Default seccomp profile in k8s blocks ~44 syscalls.

⚔️ Runtime Comparison: runc vs Kata vs gVisor

Choose the right isolation level for your threat model.

Feature

runc
(default)

Kata Containers
(microVM)

gVisor
(user-space kernel)

Isolation

Linux namespaces + cgroups

Minimal VM (Rust/VMM)

Sentry kernel emulation

Performance

✅ Near-native

⚡ ~95% of native (VM overhead)

⚡ ~85% of native (intercepted I/O)

Boot Time

~100ms

~1–2s (VM boot)

~200ms (Sentry init)

Kernel Sharing

✅ Host kernel

❌ Minimal Linux kernel

❌ Sentry (emulated)

Host Vulnerability Risk

⚠️ Shared kernel → container escape

✅ VM boundary

✅ User-space kernel

Use Case

Trusted workloads

Multi-tenant untrusted

Secure sandboxing

RuntimeClass

runtimeClassName: runc

runtimeClassName: kata

runtimeClassName: gvisor

💡 Using RuntimeClass

# 1. Create RuntimeClass
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
---
# 2. Use it in a Pod
apiVersion: v1
kind: Pod
metadata:
  name: sandboxed-nginx
spec:
  runtimeClassName: kata
  containers:
  - name: nginx
    image: nginx:1.27

🔄 Full Creation Flow (Code Walkthrough)

Annotated code showing what each component actually does.

// kubelet/pkg/kubelet/kuberuntime/kuberuntime_manager.go

// SyncPod creates the pod sandbox if necessary, and then
// creates containers in the pod.
func (m *KubeGenericRuntimeManager) SyncPod(
    ctx context.Context,
    pod *v1.Pod,
    _ container.PodStatus,
    _ *v1.PodStatus,
    pullSecrets []v1.Secret,
    _) error {

    // Step 1: Pull images
    if err := m.pullImages(...); err != nil { return err }

    // Step 2: Create pod sandbox (pause container)
    podSandboxID, _, err := m.createPodSandbox(ctx, pod)
    if err != nil { return err }

    // Step 3: Create containers
    for _, container := range pod.Spec.Containers {
        if err := m.createContainer(ctx, pod, container,
            podSandboxID, pullSecrets); err != nil { return err }
    }

    // Step 4: Start containers
    for _, container := range pod.Spec.Containers {
        if err := m.startContainer(ctx, containerID); err != nil { return err }
    }
    return nil
}

// CRI v1alpha2 API (containerd CRI plugin)

rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);

// RunPodSandbox → creates the pause container that holds the
// network namespace (net ns) for all containers in the pod.
// All containers join this net ns via --network=container:<sandbox>

message RunPodSandboxRequest {
    string runtime_config = 1;  // Kubernetes uses linux for ns
    map annotations = 2;
    string hostname = 3;
}

message CreateContainerRequest {
    string pod_sandbox_id = 1;   // from RunPodSandbox
    string config = 2;           // ContainerConfig protobuf
    string image = 3;            // ImageSpec
}

// containerd/services/containers/service.go

// NewContainer creates a new container metadata record in bolt DB
func (s *Service) NewContainer(ctx context.Context, req *api.CreateContainerRequest) {
    container := containers.Container{
        ID:        req.ID,
        Runtime:   req.Runtime.Name,
        RootFS:   req.RootFS,  // snapshotter reference
        Labels:    req.Labels,
        CreatedAt: time.Now(),
    }
    // Write to bolt DB (transactions are serialised)
    tx := s.db.Writer()
    bkt := tx.Bucket([]byte("containers"))
    bkt.Put([]byte(req.ID), marshal(container))
}

// containerd/snapshots/overlay/overlay.go

// Prepare unpacks image layers into a snapshot
func (sn *snapshotter) Prepare(ctx context.Context, key string, parent string) error {
    // 1. Mount the parent snapshot
    // 2. Apply the diff (untar image layer)
    // 3. Create an overlay mount: upperdir=this layer, lowerdir=parent
    mount := overlayMount(upperdir, lowerdirs, workdir)
    return sn.storeMount(key, mount)
}

// runc create → runc start

// Step 1: runc create reads config.json (OCI runtime spec)
// and sets up cgroups, namespaces, and mounts.
$ runc create --pid-file /tmp/nginx.pid \
              --bundle /var/lib/containerd/io.containerd.runc.v2/ \
              nginx

// Inside runc create:
// - config.json.spec.ROOTFS = rootfs mount
// - spec.linux.namespaces = [pid, net, ipc, mnt, uts, user]
// - spec.linux.cgroup = { devices: [deny], memory: { limit: 134217728 }}
// - pivot_root() into rootfs/
// - write pid to /tmp/nginx.pid
// - fork + execve() container process

// Step 2: runc start resumes the container (already created)
// $ runc start nginx
//   → reads pid from /tmp/nginx.pid
//   → sends container process a SIGCONT (it's frozen after create)
//   → container process begins execution at its entrypoint

🔱 Pod Creation Stepper

Pod Spec

🔌 Container Runtime Interface (CRI)

📡 gRPC API

🖼️ ImageService

🎬 RuntimeService

📺 Streaming API

📦 containerd Architecture

🔀 Shim Model — Why It Matters

💾 bolt DB — Metadata Store

🧬 cgroups v2 Hierarchy

cgroups v2 Overview

💀 OOM in cgroups v2

memory.high vs memory.max

🚀 runc & OCI Runtime Spec

📁 OCI Bundle

🧿 Namespaces

⚖️ cgroups v2 Enforcement

🔒 Capabilities & seccomp

⚔️ Runtime Comparison: runc vs Kata vs gVisor

💡 Using RuntimeClass

🔄 Full Creation Flow (Code Walkthrough)

🔗 Related Topics