Pod Lifecycle

Phases, Probes, Sidecars, and Graceful Shutdown

A Pod is the smallest scheduleable unit in Kubernetes — one or more containers sharing a network namespace and (optionally) volumes. Its lifecycle is a well-defined state machine: scheduling, image pull, init containers, sidecars, main containers running, probes evaluating health, eventually graceful (or not) termination. Each transition emits events you can watch.

The lifecycle has gotten more nuanced over time: startup probes (1.16) decoupled slow-init from liveness, native sidecars (1.29) fixed shutdown ordering, and topology spread constraints (1.19) made HA placement declarative. Knowing where each tunable belongs is the difference between resilient services and 3 AM pages.

Lifecycle State Machine

Pending scheduling Pending init containers Running probes evaluating Ready in Service endpoints Terminating SIGTERM, drain Succeeded all containers exit 0 Failed non-zero exit / OOM Unknown node lost contact with API server

Key Numbers

5
phases: Pending, Running, Succeeded, Failed, Unknown
3
probe types: liveness, readiness, startup
30 s
default terminationGracePeriodSeconds
1.29
k8s version: native sidecar containers
1.27
version: gRPC probes GA
137
exit code reported on SIGKILL (128+9)

Pod Spec with Every Lifecycle Knob

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  terminationGracePeriodSeconds: 60

  initContainers:
    - name: wait-for-db
      image: busybox
      command: [sh, -c, "until nc -z db 5432; do sleep 1; done"]

    - name: envoy             # 1.29+ native sidecar
      image: envoyproxy/envoy
      restartPolicy: Always   # makes it a sidecar (started in init order, runs alongside main)
      lifecycle:
        preStop:
          exec:
            command: [sh, -c, "curl -X POST localhost:9090/healthcheck/fail; sleep 25"]

  containers:
    - name: app
      image: myapp:1.0
      ports: [ { containerPort: 8080 } ]

      startupProbe:
        httpGet: { path: /healthz, port: 8080 }
        failureThreshold: 30      # 30 * 10s = 5 min to start
        periodSeconds: 10

      readinessProbe:
        httpGet: { path: /ready, port: 8080 }
        periodSeconds: 5
        failureThreshold: 3       # 15s before traffic stops

      livenessProbe:
        httpGet: { path: /alive, port: 8080 }
        periodSeconds: 10
        failureThreshold: 3       # 30s before container restart

      lifecycle:
        preStop:
          exec:
            command: [sh, -c, "sleep 5 && kill -SIGTERM 1"]
            # gives Service controller time to remove from endpoints

      resources:
        requests: { cpu: 100m, memory: 128Mi }
        limits:   { cpu: 500m, memory: 512Mi }

Init Containers vs Sidecars

# Init container — runs to completion, in order, BEFORE main containers
# Use for: schema migrations, fetching configs, waiting for dependencies

# Sidecar (1.29+ native) — initContainer with restartPolicy: Always
# Use for: service mesh proxy (Istio envoy), log shipper (Fluent Bit),
#          metrics scraper, secret rotator
#
# Critical difference vs old "just put it as a regular container":
# - Starts BEFORE main containers (in init order)
# - Restarts independently if it crashes
# - Receives SIGTERM ONLY AFTER all main containers have exited
# - Lets the proxy drain after the app is gone — finally fixes Istio's
#   "envoy died before app drained connections" race

Probes: HTTP, TCP, exec, gRPC

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
    httpHeaders:
      - name: X-Health-Check, value: kubelet
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 2
  successThreshold: 1
  failureThreshold: 3

# TCP probe (just opens a socket)
livenessProbe:
  tcpSocket: { port: 5432 }

# exec probe (any command exit-zero is success)
livenessProbe:
  exec:
    command: [pg_isready, -h, localhost]

# gRPC probe (1.27+)
readinessProbe:
  grpc:
    port: 9000
    service: my.health.v1.HealthService

Graceful Shutdown

# Timeline of a deletion:
T=0:   API server marks Pod for deletion (deletionTimestamp set)
T=0:   Endpoints controller removes Pod from Service (kube-proxy/eBPF updates rules)
T=0:   kubelet runs preStop hooks
T=preStop_done: kubelet sends SIGTERM to main process (PID 1)
T=preStop_done + grace_period: kubelet sends SIGKILL if still alive

# Common pattern for HTTP servers:
lifecycle:
  preStop:
    exec:
      command: [sh, -c, "sleep 10"]   # let endpoints propagate
# Then SIGTERM → app drains in-flight requests → exits.

# Apps that don't handle SIGTERM end up SIGKILL'd at the grace period.
# Java's System.exit(0) handlers, Go's signal.Notify, Node's process.on('SIGTERM')

PodDisruptionBudget and Topology Spread

# PDB — at most 1 pod down at a time (or expressed as %)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: api }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: api } }

# OR
spec:
  maxUnavailable: 1
  selector: { matchLabels: { app: api } }

# Topology Spread — spread across zones AND across nodes
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector: { matchLabels: { app: api } }
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector: { matchLabels: { app: api } }

Tradeoffs

Strengths
  • Three probe types let you express slow start, transient unavailability, deadlock detection
  • Native sidecars finally make service mesh shutdown ordering correct
  • Topology spread provides declarative HA placement
  • Graceful termination is the default; raw kill -9 is opt-in
Sharp edges
  • Endpoint removal isn't atomic — you usually need a preStop sleep
  • PDB doesn't protect against involuntary disruptions (node crash)
  • Liveness probe failures during overload can kill healthy pods (use startupProbe)
  • Init container failures retry forever by default — use a backoff or fail-fast logic

Frequently Asked Questions

What's the difference between liveness, readiness, and startup probes?

Liveness: 'is this container still alive?' If it fails, kubelet kills the container and restarts it per the Pod's restartPolicy. Use for processes that can hang (deadlocks). Readiness: 'is this container ready to serve traffic?' If it fails, the Pod is removed from Service endpoints — traffic stops, but the container keeps running. Use during startup or temporary unavailability. Startup: 'has the container finished initializing?' Liveness and readiness probes are paused until startup succeeds. Use for slow-starting apps so liveness doesn't kill them mid-init. Probe types: HTTP GET, TCP socket open, exec command, gRPC (1.27+).

Why are sidecar containers special since 1.29?

Before 1.29, sidecars were just regular containers, with two consequences: they all started in parallel (no ordering), and they didn't gracefully exit before main containers shut down — leading to lost logs, unflushed metrics. Sidecars-as-init-containers (1.29 GA) lets you mark an initContainer with restartPolicy: Always — Kubernetes treats it as a sidecar: starts before main containers (in init order), runs alongside them, and gets a SIGTERM only after all main containers have exited. This is how Istio's envoy proxy now properly drains before the app shuts down.

What happens during terminationGracePeriodSeconds?

When a Pod is deleted (kubectl delete or eviction), kubelet sends SIGTERM to the main process of every container. The Pod gets terminationGracePeriodSeconds (default 30) to exit gracefully. Concurrently, the Pod's IP is removed from Service endpoints (but propagation isn't instant — readiness gates and PodReadyHooks help). After the grace period expires, kubelet sends SIGKILL. Set this to your longest reasonable shutdown time (e.g., 60s for a web server that drains connections, longer for stateful apps).

What does PodDisruptionBudget actually prevent?

PDB caps voluntary disruptions — drains, evictions during cluster autoscaler scale-down, rolling node replacements. It says 'at most N pods of this group can be unavailable at once' or 'at least N must remain available.' It does NOT protect against involuntary disruptions: node crash, kernel panic, AZ failure, OOM kill. PDBs are advisory in the sense that the eviction API respects them; raw kubectl delete pod still works. Pair with multi-replica Deployments and topology spread for true HA.

How does Pod Topology Spread Constraints differ from podAffinity?

podAntiAffinity is binary — either you match a target or you don't, requiredDuringScheduling all-or-nothing. Topology Spread Constraints (1.19+) work continuously: 'spread these pods across zones, allowing at most N more in any one zone than another'. You can combine multiple constraints (spread across zones AND across nodes). The scheduler tries to satisfy them as a soft preference (preferredDuringScheduling) or hard requirement (DoNotSchedule). Spread is what you want for HA workloads where you care about distribution, not exact placement.

Why do my Pods stay Pending?

Pending means the Pod has been accepted by the API server but at least one container hasn't started yet. Common causes: (1) no node has enough resources (CPU, memory) — check 'kubectl describe' for FailedScheduling events; (2) image pull failure — networking, registry auth, image doesn't exist; (3) PVC not bound — no matching StorageClass or volume; (4) admission webhook denial; (5) taints on all nodes that the Pod doesn't tolerate. The 'Events' section of describe is almost always the answer.