Docker Security
Defense in depth: namespaces, capabilities, seccomp, scanning, signing, runtime detection
A Docker container is a process with a different view of the kernel: separate mount, network, PID, UTS, IPC, user, and cgroup namespaces, plus capability and seccomp filters limiting which syscalls it can make. None of this is a security boundary by default — it's a set of dials. With everything turned to the most permissive setting (the historical default), a container escape is one kernel bug or one mounted Docker socket away. With everything turned up, container breakout requires defeating multiple independent layers.
Production Docker security is a checklist: rootless daemon, drop ALL caps + add only what's needed, custom seccomp profile (or use Docker's default minus what you don't use), AppArmor or SELinux confinement, read-only root filesystem, image signed and scanned in CI, runtime Falco watching for anomalies. Each layer is cheap; the combination is what makes containers a meaningful boundary.
The Layers of Defense
Each layer answers one question. None alone is sufficient; the combination is the thing that works.
Key Numbers
Rootless Docker
The classic dockerd runs as root and any user in the docker group has
effectively root on the host (because they can mount / into a
container). Rootless mode runs the daemon as an unprivileged user via user
namespaces and slirp4netns for networking.
{`# Install rootless dockerd (per-user)
dockerd-rootless-setuptool.sh install
# Verify
docker info | grep -i rootless
# Server: ... rootless: true
# What changed:
# - dockerd runs as your user (not root)
# - Container UID 0 maps to your host UID
# - Container UID 1000 maps to host UID 100999 (subuid range)
# - No need to add users to "docker" group
# - Network uses slirp4netns/rootlesskit (slower than vanilla bridge)
# Caveats:
# - Cannot use ports < 1024 without setcap on rootlesskit
# - Performance overhead for networking (~5-10% on small packets)
# - Some volume mount patterns need different ownership`} User Namespaces
With user namespace remapping, a container's UID 0 (root) is a non-privileged UID on the host. A breakout that gets root inside the container only gets a normal user outside.
{`# /etc/docker/daemon.json
{
"userns-remap": "default"
}
# /etc/subuid
dockremap:100000:65536 # container UID 0..65535 -> host 100000..165535
# Or per-user remap (more isolation, more complexity):
{
"userns-remap": "myuser"
}
# Verification
docker run --rm alpine id
# uid=0(root) gid=0(root) groups=0(root),1(bin)...
# But on the host:
ps -ef | grep "alpine"
# 100000 12345 ... # actually running as host UID 100000
# Tradeoff: shared images but per-user runtime data dirs
# Volumes need careful ownership planning - bind mounts lose UID mapping`} Linux Capabilities
Capabilities split root's powers into ~40 distinct privileges. Containers should drop ALL and add back only what's needed.
{`# Default (DON'T DO THIS)
docker run alpine
# Has: AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID, KILL,
# MKNOD, NET_BIND_SERVICE, NET_RAW, SETFCAP, SETGID, SETPCAP,
# SETUID, SYS_CHROOT (14 caps)
# Production pattern - drop ALL, add only what's strictly needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE alpine
# Only allows binding to ports < 1024
# Common needs:
# NET_BIND_SERVICE bind to ports < 1024
# CHOWN change file ownership (only useful in init scripts)
# DAC_OVERRIDE bypass file permission checks
# (typically a web service needs nothing - drop all)
# Check what your app actually needs at runtime:
docker run --cap-drop=ALL --rm myapp
# If it crashes: re-add the missing one. Iterate.
# In Compose:
services:
api:
cap_drop: [ALL]
cap_add: [NET_BIND_SERVICE]
security_opt:
- no-new-privileges:true
# In Kubernetes:
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]`} Seccomp Profiles
Seccomp filters which syscalls a process can make. Docker's default profile blocks
~50 syscalls including kexec_load, perf_event_open,
add_key, and the full set of namespace-creation syscalls. You can
tighten further with a custom profile.
{`{
"defaultAction": "SCMP_ACT_ERRNO", // deny everything by default
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": ["read", "write", "open", "openat", "close", "stat",
"fstat", "mmap", "mprotect", "munmap", "brk",
"rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
"ioctl", "pread64", "pwrite64", "readv", "writev",
"access", "pipe", "select", "sched_yield", "mremap",
"msync", "mincore", "madvise", "shmget", "shmat",
"exit", "exit_group", "wait4", "kill", "uname"],
"action": "SCMP_ACT_ALLOW"
}
]
}
# Apply
docker run --security-opt seccomp=./profile.json myapp
# Generate from observation - run app under strace, derive needed syscalls,
# put exactly those in the profile, deny everything else.
# Tools:
# - github.com/genuinetools/bane - AppArmor profile generator
# - github.com/jessfraz/amicontained - inspect what's allowed
# - oci-seccomp-bpf-hook - record syscalls then build profile`} AppArmor / SELinux
Mandatory Access Control on top of capabilities and seccomp. AppArmor uses path-based profiles (Ubuntu/Debian default); SELinux uses label-based (RHEL/Fedora default). Both can confine what files a process reads, what network ops it does, what capabilities are even attemptable.
{`# AppArmor profile snippet for a web app container
profile docker-myapp flags=(attach_disconnected,mediate_deleted) {
# Default deny
deny /etc/shadow r,
deny /proc/sys/kernel/** w,
deny mount,
# Allowed paths
/usr/bin/myapp ix,
/etc/myapp/** r,
/var/log/myapp/** rw,
/tmp/** rw,
# Network
network tcp,
deny network raw,
# Caps
capability net_bind_service,
}
# Load and apply
sudo apparmor_parser -r ./profile
docker run --security-opt apparmor=docker-myapp myapp
# SELinux equivalent (RHEL/CentOS):
docker run --security-opt label=type:container_t myapp
# container_t is the standard confined type from container-selinux package`} Read-Only Filesystem & no-new-privileges
Two cheap, high-value flags that prevent whole classes of attack.
{`# read-only root filesystem - immune to most code drops, web shells
docker run --read-only \\
--tmpfs /tmp \\
--tmpfs /var/run \\
myapp
# no-new-privileges - prevents setuid binaries from elevating
docker run --security-opt=no-new-privileges:true myapp
# Even if /usr/bin/sudo is in the image with setuid bit, exec returns EPERM
# Combine with read-only and you cripple most exploit chains:
# - Can't write a payload to disk (read-only)
# - Can't exec setuid to elevate (no-new-privs)
# - Can't kexec another kernel (seccomp blocks it)
# - Can't load kernel modules (caps don't include CAP_SYS_MODULE)
# Compose:
services:
api:
read_only: true
tmpfs:
- /tmp
- /var/run
security_opt:
- no-new-privileges:true`} Image Scanning: Trivy / Grype
CVE scanning of installed packages. Run in CI on every image; block deploys with critical/high CVEs. Both tools have similar coverage; Trivy is more popular, Grype integrates with Syft for SBOM generation.
{`# Trivy
trivy image --severity HIGH,CRITICAL myapp:v1.42.0
# 2024-08-12T14:23:01.123Z INFO Vulnerability scanning is enabled
# myapp:v1.42.0 (alpine 3.18)
# =============================
# Total: 3 (HIGH: 2, CRITICAL: 1)
#
# +----------+------------+----------+--------+--------+
# | LIBRARY | VULN ID | SEVERITY | INSTALLED | FIXED |
# +----------+------------+----------+-----------+-------+
# | openssl | CVE-2023-X | CRITICAL | 3.0.10 | 3.0.11 |
# In CI - fail the build on critical
trivy image --exit-code 1 --severity CRITICAL myapp:$TAG
# Grype with SBOM
syft myapp:v1.42.0 -o spdx-json > sbom.json
grype sbom:./sbom.json --fail-on high
# Caveats:
# - Scanners report what's INSTALLED. They don't check if the vulnerable
# code path is reachable. Distroless images report fewer CVEs but only
# because they have fewer packages, not necessarily fewer real risks.
# - First-party application code is NOT scanned by these tools - that's
# SAST/DAST territory.`} Supply Chain: cosign & Sigstore
Signing images cryptographically lets the runtime verify "this image was built by our CI from this commit." Sigstore makes this keyless via OIDC: the signing identity is the GitHub Actions workflow that produced it.
{`# Sign at build time (in CI)
cosign sign \\
--identity-token $ACTIONS_ID_TOKEN \\
ghcr.io/me/myapp:v1.42.0
# Verify at deploy time
cosign verify \\
--certificate-identity-regexp 'https://github.com/me/myapp/.*' \\
--certificate-oidc-issuer https://token.actions.githubusercontent.com \\
ghcr.io/me/myapp:v1.42.0
# Verification for ghcr.io/me/myapp:v1.42.0 --
# The following checks were performed on each of these signatures:
# - The cosign claims were validated
# - Existence of the claims in the transparency log
# - The signatures were verified against the specified public key
# In Kubernetes, the policy-controller admission webhook enforces
# verification at apply time:
apiVersion: policy.sigstore.dev/v1beta1
kind: ClusterImagePolicy
metadata: { name: signed-images-only }
spec:
images:
- glob: "ghcr.io/me/**"
authorities:
- keyless:
identities:
- issuer: https://token.actions.githubusercontent.com
subjectRegExp: 'https://github.com/me/.*'
# SBOMs and provenance attestations attach to images via cosign attest
cosign attest --predicate sbom.json --type spdxjson ghcr.io/me/myapp:v1.42.0`} Runtime Security: Falco
Falco runs as a daemon (or eBPF program) watching syscalls in real time and matching them against rules. Detects "shell spawned in a container", "sensitive file opened", "outbound connection to non-whitelisted IP" — the kind of post-compromise activity static scans miss.
{`# Falco rule example
- rule: Shell in Container
desc: Notice when a shell is spawned in a container
condition: >
container and
proc.name in (bash, sh, zsh, ksh) and
not proc.pname in (bash, sh, zsh, sshd)
output: >
Shell spawned in container
(user=%user.name container_id=%container.id image=%container.image.repository
command=%proc.cmdline)
priority: WARNING
tags: [container, shell, mitre_execution]
- rule: Write below /etc
desc: Detect writes below /etc, often a sign of compromise
condition: >
open_write and
fd.name startswith /etc and
not proc.name in (apt, dpkg, yum, dnf, ...)
output: >
File below /etc opened for writing
(user=%user.name command=%proc.cmdline file=%fd.name)
priority: ERROR
# Run
docker run --privileged \\
-v /var/run/docker.sock:/host/var/run/docker.sock \\
-v /dev:/host/dev \\
-v /proc:/host/proc:ro \\
falcosecurity/falco`} The dockerd Socket Attack Surface
Mounting /var/run/docker.sock into a container is the most common
Docker security mistake. The container can launch peer containers with any privilege,
mount any host path, and effectively become root on the host.
{`# DON'T DO THIS
docker run -v /var/run/docker.sock:/var/run/docker.sock myapp
# Inside myapp, anyone with the docker CLI can:
docker run --privileged -v /:/host alpine chroot /host bash
# -> root shell on the host
# Better alternatives:
# 1. Don't share the socket. Use the API over TLS with a scoped client cert.
# 2. Use a socket proxy (e.g., tecnativa/docker-socket-proxy) that filters
# which API endpoints are allowed:
docker run --name docker-proxy \\
-v /var/run/docker.sock:/var/run/docker.sock \\
-e CONTAINERS=1 -e POST=0 \\
-p 127.0.0.1:2375:2375 \\
tecnativa/docker-socket-proxy
# Then the consumer connects to the proxy, getting only read access.
# 3. For build use cases, use BuildKit's --remote rather than mounting the socket.
# 4. For docker-in-docker (CI), use sysbox or kaniko instead of bind-mounting.`} Tradeoffs
Security vs developer ergonomics
Every restriction adds friction. docker run with sane defaults is so easy that the temptation is to ship that to production. The fix: enforce hardening at the orchestrator (K8s PodSecurity, OPA), not in developer muscle memory.
Image scanning is noisy
Most CVEs reported by Trivy are not exploitable in your context. Scanner output without prioritization is alert fatigue. Tag, suppress, retest weekly.
Rootless cost
Slirp4netns adds 5-10% network overhead. For most workloads invisible; for high-throughput data-plane services it matters. Run rootless on dev, root-in-userns in prod, depending on perf budget.
Falco false positives
Default rules trigger on legitimate operations (CI containers spawning shells, init scripts writing /etc). Tune ruthlessly or you'll page on noise.
FAQ
Is Docker a security boundary?
By default, no. With rootless + user namespaces + dropped caps + seccomp + AppArmor + read-only fs + no-new-privs, yes — you've stacked enough layers that breakout requires multiple kernel exploits. For multi-tenant workloads needing strong isolation, consider gVisor or Firecracker microVMs as additional boundaries.
Should I use distroless or Alpine?
Distroless is smaller (no shell, no package manager) which reduces attack surface and CVE count. Alpine has musl quirks (DNS resolution, threading) that can break Java/Go apps. Distroless is preferred when it works; Alpine when you need a shell for debugging or muscle for runtime config.
How do I scan my own application code?
Trivy/Grype scan installed packages. They don't analyze your application source. For that you need SAST tools (Semgrep, Snyk, GitHub CodeQL) in CI and DAST scanners hitting the running app. Both are separate from container image scanning.
Why not just use VMs?
Containers boot in 50ms; VMs in 5-30s. Containers share the host kernel; VMs each carry a full kernel. Density and startup time are the wins; isolation is what you trade. Modern microVMs (Firecracker, Kata) try to combine both at the cost of complexity.
What's the difference between Falco and a regular IDS?
Falco hooks into kernel syscalls (via libsinsp or eBPF) and has container-native context: container_id, image, K8s namespace. A traditional HIDS like Wazuh works on host events without container awareness. Falco is purpose-built for the Docker/K8s threat model.
How do I rotate secrets in containers?
Don't bake secrets into images. Mount them at runtime via Docker secrets, K8s secrets (preferably with external-secrets-operator), or short-lived tokens from Vault. Image-baked secrets show up in registries forever.