Docker Security

Defense in depth: namespaces, capabilities, seccomp, scanning, signing, runtime detection

A Docker container is a process with a different view of the kernel: separate mount, network, PID, UTS, IPC, user, and cgroup namespaces, plus capability and seccomp filters limiting which syscalls it can make. None of this is a security boundary by default — it's a set of dials. With everything turned to the most permissive setting (the historical default), a container escape is one kernel bug or one mounted Docker socket away. With everything turned up, container breakout requires defeating multiple independent layers.

Production Docker security is a checklist: rootless daemon, drop ALL caps + add only what's needed, custom seccomp profile (or use Docker's default minus what you don't use), AppArmor or SELinux confinement, read-only root filesystem, image signed and scanned in CI, runtime Falco watching for anomalies. Each layer is cheap; the combination is what makes containers a meaningful boundary.

The Layers of Defense

Each layer answers one question. None alone is sufficient; the combination is the thing that works.

Key Numbers

capabilities granted by default to a container

~40

capabilities total in modern Linux

~300

syscalls allowed in Docker's default seccomp profile

~440

total Linux syscalls (x86_64)

namespaces unshared per container

UID 0

in container = UID >0 outside, with userns-remap

RFC 9162

Sigstore / cosign for image signing

Rootless Docker

The classic dockerd runs as root and any user in the docker group has effectively root on the host (because they can mount / into a container). Rootless mode runs the daemon as an unprivileged user via user namespaces and slirp4netns for networking.

{`# Install rootless dockerd (per-user)
dockerd-rootless-setuptool.sh install

# Verify
docker info | grep -i rootless
# Server: ... rootless: true

# What changed:
# - dockerd runs as your user (not root)
# - Container UID 0 maps to your host UID
# - Container UID 1000 maps to host UID 100999 (subuid range)
# - No need to add users to "docker" group
# - Network uses slirp4netns/rootlesskit (slower than vanilla bridge)

# Caveats:
# - Cannot use ports < 1024 without setcap on rootlesskit
# - Performance overhead for networking (~5-10% on small packets)
# - Some volume mount patterns need different ownership`}

User Namespaces

With user namespace remapping, a container's UID 0 (root) is a non-privileged UID on the host. A breakout that gets root inside the container only gets a normal user outside.

{`# /etc/docker/daemon.json
{
  "userns-remap": "default"
}

# /etc/subuid
dockremap:100000:65536    # container UID 0..65535 -> host 100000..165535

# Or per-user remap (more isolation, more complexity):
{
  "userns-remap": "myuser"
}

# Verification
docker run --rm alpine id
# uid=0(root) gid=0(root) groups=0(root),1(bin)...
# But on the host:
ps -ef | grep "alpine"
# 100000   12345  ... # actually running as host UID 100000

# Tradeoff: shared images but per-user runtime data dirs
# Volumes need careful ownership planning - bind mounts lose UID mapping`}

Linux Capabilities

Capabilities split root's powers into ~40 distinct privileges. Containers should drop ALL and add back only what's needed.

{`# Default (DON'T DO THIS)
docker run alpine
# Has: AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID, KILL,
#      MKNOD, NET_BIND_SERVICE, NET_RAW, SETFCAP, SETGID, SETPCAP,
#      SETUID, SYS_CHROOT (14 caps)

# Production pattern - drop ALL, add only what's strictly needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE alpine
# Only allows binding to ports < 1024

# Common needs:
# NET_BIND_SERVICE  bind to ports < 1024
# CHOWN             change file ownership (only useful in init scripts)
# DAC_OVERRIDE      bypass file permission checks
# (typically a web service needs nothing - drop all)

# Check what your app actually needs at runtime:
docker run --cap-drop=ALL --rm myapp
# If it crashes: re-add the missing one. Iterate.

# In Compose:
services:
  api:
    cap_drop: [ALL]
    cap_add: [NET_BIND_SERVICE]
    security_opt:
      - no-new-privileges:true

# In Kubernetes:
securityContext:
  capabilities:
    drop: ["ALL"]
    add:  ["NET_BIND_SERVICE"]`}

Seccomp Profiles

Seccomp filters which syscalls a process can make. Docker's default profile blocks ~50 syscalls including kexec_load, perf_event_open, add_key, and the full set of namespace-creation syscalls. You can tighten further with a custom profile.

{`{
  "defaultAction": "SCMP_ACT_ERRNO",        // deny everything by default
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "open", "openat", "close", "stat",
                "fstat", "mmap", "mprotect", "munmap", "brk",
                "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
                "ioctl", "pread64", "pwrite64", "readv", "writev",
                "access", "pipe", "select", "sched_yield", "mremap",
                "msync", "mincore", "madvise", "shmget", "shmat",
                "exit", "exit_group", "wait4", "kill", "uname"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Apply
docker run --security-opt seccomp=./profile.json myapp

# Generate from observation - run app under strace, derive needed syscalls,
# put exactly those in the profile, deny everything else.

# Tools:
# - github.com/genuinetools/bane          - AppArmor profile generator
# - github.com/jessfraz/amicontained      - inspect what's allowed
# - oci-seccomp-bpf-hook                  - record syscalls then build profile`}

AppArmor / SELinux

Mandatory Access Control on top of capabilities and seccomp. AppArmor uses path-based profiles (Ubuntu/Debian default); SELinux uses label-based (RHEL/Fedora default). Both can confine what files a process reads, what network ops it does, what capabilities are even attemptable.

{`# AppArmor profile snippet for a web app container
profile docker-myapp flags=(attach_disconnected,mediate_deleted) {
  # Default deny
  deny /etc/shadow r,
  deny /proc/sys/kernel/** w,
  deny mount,

  # Allowed paths
  /usr/bin/myapp ix,
  /etc/myapp/** r,
  /var/log/myapp/** rw,
  /tmp/** rw,

  # Network
  network tcp,
  deny network raw,

  # Caps
  capability net_bind_service,
}

# Load and apply
sudo apparmor_parser -r ./profile
docker run --security-opt apparmor=docker-myapp myapp

# SELinux equivalent (RHEL/CentOS):
docker run --security-opt label=type:container_t myapp
# container_t is the standard confined type from container-selinux package`}

Read-Only Filesystem & no-new-privileges

Two cheap, high-value flags that prevent whole classes of attack.

{`# read-only root filesystem - immune to most code drops, web shells
docker run --read-only \\
           --tmpfs /tmp \\
           --tmpfs /var/run \\
           myapp

# no-new-privileges - prevents setuid binaries from elevating
docker run --security-opt=no-new-privileges:true myapp
# Even if /usr/bin/sudo is in the image with setuid bit, exec returns EPERM

# Combine with read-only and you cripple most exploit chains:
# - Can't write a payload to disk (read-only)
# - Can't exec setuid to elevate (no-new-privs)
# - Can't kexec another kernel (seccomp blocks it)
# - Can't load kernel modules (caps don't include CAP_SYS_MODULE)

# Compose:
services:
  api:
    read_only: true
    tmpfs:
      - /tmp
      - /var/run
    security_opt:
      - no-new-privileges:true`}

Image Scanning: Trivy / Grype

CVE scanning of installed packages. Run in CI on every image; block deploys with critical/high CVEs. Both tools have similar coverage; Trivy is more popular, Grype integrates with Syft for SBOM generation.

{`# Trivy
trivy image --severity HIGH,CRITICAL myapp:v1.42.0
# 2024-08-12T14:23:01.123Z  INFO  Vulnerability scanning is enabled
# myapp:v1.42.0 (alpine 3.18)
# =============================
# Total: 3 (HIGH: 2, CRITICAL: 1)
#
# +----------+------------+----------+--------+--------+
# | LIBRARY  | VULN ID    | SEVERITY | INSTALLED | FIXED |
# +----------+------------+----------+-----------+-------+
# | openssl  | CVE-2023-X | CRITICAL | 3.0.10    | 3.0.11 |

# In CI - fail the build on critical
trivy image --exit-code 1 --severity CRITICAL myapp:$TAG

# Grype with SBOM
syft myapp:v1.42.0 -o spdx-json > sbom.json
grype sbom:./sbom.json --fail-on high

# Caveats:
# - Scanners report what's INSTALLED. They don't check if the vulnerable
#   code path is reachable. Distroless images report fewer CVEs but only
#   because they have fewer packages, not necessarily fewer real risks.
# - First-party application code is NOT scanned by these tools - that's
#   SAST/DAST territory.`}

Supply Chain: cosign & Sigstore

Signing images cryptographically lets the runtime verify "this image was built by our CI from this commit." Sigstore makes this keyless via OIDC: the signing identity is the GitHub Actions workflow that produced it.

{`# Sign at build time (in CI)
cosign sign \\
  --identity-token $ACTIONS_ID_TOKEN \\
  ghcr.io/me/myapp:v1.42.0

# Verify at deploy time
cosign verify \\
  --certificate-identity-regexp 'https://github.com/me/myapp/.*' \\
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \\
  ghcr.io/me/myapp:v1.42.0
# Verification for ghcr.io/me/myapp:v1.42.0 --
# The following checks were performed on each of these signatures:
#   - The cosign claims were validated
#   - Existence of the claims in the transparency log
#   - The signatures were verified against the specified public key

# In Kubernetes, the policy-controller admission webhook enforces
# verification at apply time:
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata: { name: signed-images-only }
spec:
  images:
  - glob: "ghcr.io/me/**"
  authorities:
  - keyless:
      identities:
      - issuer: https://token.actions.githubusercontent.com
        subjectRegExp: 'https://github.com/me/.*'

# SBOMs and provenance attestations attach to images via cosign attest
cosign attest --predicate sbom.json --type spdxjson ghcr.io/me/myapp:v1.42.0`}

Runtime Security: Falco

Falco runs as a daemon (or eBPF program) watching syscalls in real time and matching them against rules. Detects "shell spawned in a container", "sensitive file opened", "outbound connection to non-whitelisted IP" — the kind of post-compromise activity static scans miss.

{`# Falco rule example
- rule: Shell in Container
  desc: Notice when a shell is spawned in a container
  condition: >
    container and
    proc.name in (bash, sh, zsh, ksh) and
    not proc.pname in (bash, sh, zsh, sshd)
  output: >
    Shell spawned in container
    (user=%user.name container_id=%container.id image=%container.image.repository
     command=%proc.cmdline)
  priority: WARNING
  tags: [container, shell, mitre_execution]

- rule: Write below /etc
  desc: Detect writes below /etc, often a sign of compromise
  condition: >
    open_write and
    fd.name startswith /etc and
    not proc.name in (apt, dpkg, yum, dnf, ...)
  output: >
    File below /etc opened for writing
    (user=%user.name command=%proc.cmdline file=%fd.name)
  priority: ERROR

# Run
docker run --privileged \\
           -v /var/run/docker.sock:/host/var/run/docker.sock \\
           -v /dev:/host/dev \\
           -v /proc:/host/proc:ro \\
           falcosecurity/falco`}

The dockerd Socket Attack Surface

Mounting /var/run/docker.sock into a container is the most common Docker security mistake. The container can launch peer containers with any privilege, mount any host path, and effectively become root on the host.

{`# DON'T DO THIS
docker run -v /var/run/docker.sock:/var/run/docker.sock myapp
# Inside myapp, anyone with the docker CLI can:
docker run --privileged -v /:/host alpine chroot /host bash
# -> root shell on the host

# Better alternatives:
# 1. Don't share the socket. Use the API over TLS with a scoped client cert.
# 2. Use a socket proxy (e.g., tecnativa/docker-socket-proxy) that filters
#    which API endpoints are allowed:
docker run --name docker-proxy \\
  -v /var/run/docker.sock:/var/run/docker.sock \\
  -e CONTAINERS=1 -e POST=0 \\
  -p 127.0.0.1:2375:2375 \\
  tecnativa/docker-socket-proxy

# Then the consumer connects to the proxy, getting only read access.

# 3. For build use cases, use BuildKit's --remote rather than mounting the socket.
# 4. For docker-in-docker (CI), use sysbox or kaniko instead of bind-mounting.`}

Tradeoffs

Security vs developer ergonomics

Every restriction adds friction. docker run with sane defaults is so easy that the temptation is to ship that to production. The fix: enforce hardening at the orchestrator (K8s PodSecurity, OPA), not in developer muscle memory.

Image scanning is noisy

Most CVEs reported by Trivy are not exploitable in your context. Scanner output without prioritization is alert fatigue. Tag, suppress, retest weekly.

Rootless cost

Slirp4netns adds 5-10% network overhead. For most workloads invisible; for high-throughput data-plane services it matters. Run rootless on dev, root-in-userns in prod, depending on perf budget.

Falco false positives

Default rules trigger on legitimate operations (CI containers spawning shells, init scripts writing /etc). Tune ruthlessly or you'll page on noise.

FAQ

Is Docker a security boundary?

By default, no. With rootless + user namespaces + dropped caps + seccomp + AppArmor + read-only fs + no-new-privs, yes — you've stacked enough layers that breakout requires multiple kernel exploits. For multi-tenant workloads needing strong isolation, consider gVisor or Firecracker microVMs as additional boundaries.

Should I use distroless or Alpine?

Distroless is smaller (no shell, no package manager) which reduces attack surface and CVE count. Alpine has musl quirks (DNS resolution, threading) that can break Java/Go apps. Distroless is preferred when it works; Alpine when you need a shell for debugging or muscle for runtime config.

How do I scan my own application code?

Trivy/Grype scan installed packages. They don't analyze your application source. For that you need SAST tools (Semgrep, Snyk, GitHub CodeQL) in CI and DAST scanners hitting the running app. Both are separate from container image scanning.

Why not just use VMs?

Containers boot in 50ms; VMs in 5-30s. Containers share the host kernel; VMs each carry a full kernel. Density and startup time are the wins; isolation is what you trade. Modern microVMs (Firecracker, Kata) try to combine both at the cost of complexity.

What's the difference between Falco and a regular IDS?

Falco hooks into kernel syscalls (via libsinsp or eBPF) and has container-native context: container_id, image, K8s namespace. A traditional HIDS like Wazuh works on host events without container awareness. Falco is purpose-built for the Docker/K8s threat model.

How do I rotate secrets in containers?

Don't bake secrets into images. Mount them at runtime via Docker secrets, K8s secrets (preferably with external-secrets-operator), or short-lived tokens from Vault. Image-baked secrets show up in registries forever.