Networking

CNI, Services, kube-proxy, and the eBPF Dataplane

Kubernetes' networking model is deceptively simple: every Pod gets a routable IP, and every Pod can reach every other Pod without NAT. How that is implemented is delegated entirely to the CNI plugin you install — Calico uses BGP, Flannel uses a VXLAN overlay, Cilium uses eBPF, AWS VPC CNI uses ENI trunking. The cluster operator picks a CNI and trades cost, performance, and policy features.

On top of the Pod network sits Services (virtual IPs that load-balance to a set of Pods), implemented by kube-proxy via iptables/IPVS/nftables — or replaced entirely by eBPF in modern stacks. NetworkPolicy adds firewall semantics; the Gateway API adds rich L7 routing. Each layer is independently swappable.

The Stack

Key Numbers

Service types: ClusterIP, NodePort, LoadBalancer, ExternalName, headless

30000-32767

default NodePort range

~110

recommended max Pods/node (default kubelet)

/24

typical per-node Pod CIDR

1.31

k8s version added nftables proxy mode (alpha)

2023

Gateway API GA

Container Networking Fundamentals

Before diving into Kubernetes, understand what the Linux kernel provides. Docker and container runtimes use four network modes — the same primitives K8s CNI builds on top of.

The Four Container Network Modes

Docker's --network flag maps directly to Linux network namespace plumbing. Kubernetes always uses the container mode at runtime (kubelet handles the namespace creation), but understanding the modes clarifies what a CNI plugin actually does.

bridge

Creates a docker0 bridge on the host. Containers get IPs from a bridge-local DHCP or static assignment. They can reach the host and external network via NAT (MASQUERADE). This is Docker's default.
Use case: Dev/test on single hosts, legacy setups.

✅ Familiar networking (like VMs) ✅ Containers reachable from host ❌ NAT adds latency ❌ Port conflicts between containers ❌ Not directly reachable from outside host without -p publishing

host

Container shares the host's network namespace — no isolation. The container's processes bind directly to the host's interfaces. If nginx in the container binds to port 80, it's port 80 on the host.
Use case: System daemons, performance-critical networking where you can't afford NAT overhead.

✅ Zero NAT overhead — bare-metal speed ✅ Full host network access ❌ No network isolation between containers ❌ Port conflicts are your problem ❌ Security boundary removed

none

Container gets the lo interface only — fully isolated. Useful for batch jobs that do no networking, or as the starting point for custom CNI wiring.
Use case: Offline batch workloads, air-gapped workloads.

✅ Completely isolated ✅ No unexpected network exposure ❌ Must be manually configured if networking needed

container

Container reuses another container's network namespace. Two containers share the same lo, same interfaces, same port bindings. Kubernetes uses this — kubelet creates a network namespace per Pod, then runs containers inside it.
Use case: Kubernetes Pods (all containers in a Pod share netns), logging sidecars that need to intercept the main container's traffic.

✅ Containers in Pod communicate via localhost ✅ Shared network namespace — easy service discovery ❌ Tight coupling between containers in Pod

veth Pair — The Fundamental Pipe

Virtual Ethernet (veth) pairs are the building block for container networking. A veth pair is a point-to-point link — like a pipe with two ends. Whatever goes in one end comes out the other. When a CNI plugin "connects" a container to the host network, it creates a veth pair: one end stays named eth0 inside the container's network namespace, the other end appears on the host (often named vethXXXX). The host end gets plugged into a bridge or routed directly.

# Inspect a running container's network namespace
$ docker inspect my-container --format '{{.NetworkSettings.SandboxKey}}'
/var/run/docker/netns/default

# Enter the container's namespace to see eth0
$ nsenter --net=/var/run/docker/netns/default ip addr
1: lo: <LOOPBACK,UP> mtu 65536 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
8: eth0@if26: <BROADCAST,MULTICAST,UP,mru 1622> mtu 1420
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.0.255 scope global eth0
    inet6 fe80::42:acff:fe11:2/64 link tentative

# On the host, the other end of the veth pair shows up as vethXXX
$ ip addr | grep veth
26: vethb1e1e86@if5: <BROADCAST,MULTICAST,UP,mru 1622> mtu 1420
    link/ether 3a:92:13:21:3e:9f brd ff:ff:ff:ff:ff:ff
    master docker0
    inet6 fe80::3892:13ff:fe21:3e9f/64

# The bridge connects them
$ brctl show docker0
bridge name  bridge id       STP enabled   interfaces
docker0      8000.0242ac110002  no           vethb1e1e86
                                                        veth2a3f4c91
                                                        veth9d1e7f32

# MTU consideration: veth pairs inherit MTU from the parent interface.
# Flannel uses VXLAN, which adds 50 bytes overhead. Default docker0 MTU is 1500,
# so the container eth0 MTU is often smaller (e.g., 1420) to account for
# encapsulation. This is a common source of mysterious connectivity issues.
$ ip link show docker0
5: docker0: <BROADCAST,MULTICAST,UP,mru 1500> qdisc noqueue master docker0

Network Namespaces (netns)

Network namespaces are the Linux kernel's fundamental network isolation primitive. They partition the networking stack: interfaces, routes, iptables rules, ARP tables, and sockets are all per-namespace. When Kubernetes creates a Pod, it creates a new network namespace and moves the Pod's containers into it via the CNI plugin.

# Create a network namespace manually (like kubelet does)
ip netns add myns

# Create a veth pair
ip link add veth-host type veth peer name veth-pod

# Move one end into the namespace
ip link set veth-pod netns myns

# Configure the namespace's end
ip netns exec myns ip addr add 10.244.1.10/24 dev veth-pod
ip netns exec myns ip link set veth-pod up
ip netns exec myns ip link set lo up

# Configure the host end
ip addr add 10.244.1.1/24 dev veth-host
ip link set veth-host up

# From the namespace, you can now reach the host at 10.244.1.1
ip netns exec myns ping -c 1 10.244.1.1
PING 10.244.1.1 (10.244.1.1) 56(84) bytes of data.
64 bytes from 10.244.1.1: icmp_seq=1 ttl=64 time=0.031 ms

# List all network namespaces on the host
ip netns list
ls /var/run/netns/   # alternative view

# Delete when done
ip netns delete myns

⚠️ Gotcha: Network namespaces are reference-counted. If a process is still running inside a netns when you try to delete it, the delete fails silently (or renames the namespace to ns-<pid>). Always stop containers before manually deleting netns entries. Kubernetes handles this correctly via the CNI DEL command.

The Kubernetes Networking Model

Kubernetes codifies four strict rules that every CNI plugin must satisfy. These rules are why pod-to-pod communication "just works" across nodes — the model removes ambiguity.

Every Pod has its own IP. No container-level port sharing on the same IP. If two containers on the same node both want port 80, they must be in separate Pods with separate IPs. This eliminates the port-mapping complexity that Docker's -p publish model creates.

Pods on the same node can communicate via the loopback interface. Containers in the same Pod share the network namespace, so they communicate via localhost. Containers in different Pods on the same node go through the host's bridge or routing stack — never localhost.

Pods on different nodes communicate without NAT. This is the critical one. A Pod with IP 10.244.1.5 on node A must be reachable as 10.244.1.5 from node B. The CNI plugin is responsible for making this routing work — via flat L2 (Calico in policy纯粹的 BGP mode), overlay (Flannel VXLAN), or L3 routing (Cilium eBPF). No NAT at the pod IP layer.

The IP a Pod sees for itself is the same IP other pods see. A Pod's eth0 reports 10.244.1.5, and that same address is what node B uses to reach it. There is no "internal NAT" hiding the Pod's true IP. This matters for mTLS (the certificate must match the IP DNS resolves to).

Why IP-Per-Pod Matters

The IP-per-Pod model is what enables Kubernetes' compositionality. Because every Pod has a unique IP, you can place any workload on any node without coordinating port numbers. A Service wrapping N Pods doesn't care if those Pods are on 1 node or 50 nodes — the Service IP routes to them identically. Compare this to the old model where you'd publish port 8080 on every node and hope containers didn't collide.

# Confirm the IP-per-Pod model with a concrete example
$ kubectl get pod -o wide
NAME          READY   STATUS    IP           NODE      NOMINATED NODE
web-0         1/1     Running   10.244.1.37  node-1    <none>
web-1         1/1     Running   10.244.2.19  node-2    <none>
api-0         1/1     Running   10.244.3.8   node-3    <none>

# web-0 on node-1 is reachable from api-0 on node-3 as 10.244.1.37
# (no NAT, no port mapping, just routing)
$ kubectl exec api-0 -- curl -s --connect-timeout 2 http://10.244.1.37:8080/health
{"status":"ok"}

# The service IP (10.0.5.6) is a virtual IP that kube-proxy rewrites
# to one of those backend Pod IPs
$ kubectl get svc web
NAME   TYPE        CLUSTER-IP    PORT(S)   SELECTOR
web    ClusterIP   10.0.5.6     8080/TCP  app=web

# DNS maps the service name to the ClusterIP, not the Pod IPs
$ kubectl exec api-0 -- nslookup web.default.svc.cluster.local
Server:    10.96.0.10
Address:   10.96.0.10#53
Name:      web.default.svc.cluster.local
Address:   10.0.5.6

CNI Plugins

The Container Network Interface (CNI) is a spec from the Cloud Native Computing Foundation. It defines how a container runtime (Docker, containerd, cri-o) talks to a network plugin when containers are created and deleted. The spec is intentionally minimal — it specifies only the interface, not the implementation.

The CNI Specification

CNI defines two operations: ADD (wire a container into a network) and DEL (tear it down). A third optional operation CHECK (validate setup) was added later. There is no "GET" — state is not queryable through the spec. Plugins are binaries invoked by the container runtime via stdin with JSON.

# The CNI ADD call — kubelet calls this for every pod
# JSON is written to the plugin's stdin; result comes on stdout

# Example ADD call (what kubelet generates):
{
  "cniVersion": "1.0.0",
  "name": "k8s-pod-network",
  "type": "calico",           # which plugin binary to call
  "containerID": "abc123def456",
  "netns": "/var/run/netns/cni-abc123def456",
  "ifName": "eth0",           # what to name the interface inside the container
  "ip": {
    "version": "4",
    "address": "10.244.1.37/24",
    "gateway": "10.244.1.1"
  },
  "dns": {
    "nameservers": ["10.96.0.10"],
    "search": ["default.svc.cluster.local", "svc.cluster.local", "cluster.local"]
  }
}

# The plugin returns:
{
  "cniVersion": "1.0.0",
  "ips": [
    {
      "version": "4",
      "address": "10.244.1.37/24",
      "gateway": "10.244.1.1",
      "interface": 0
    }
  ],
  "dns": {
    "nameservers": ["10.96.0.10"],
    "search": ["default.svc.cluster.local"]
  }
  "routes": [
    { "dst": "0.0.0.0/0", "gw": "10.244.1.1" }
  ]
}

# kubelet auto-discovers plugins in this order:
$ ls /opt/cni/bin/
bandwidth  calico  cilium-cni  flannel  host-local  loopback  portmap  tuning

# Plugins register by dropping config files into /etc/cni/net.d/
# The first file lexically (after sorting) is the active plugin
$ cat /etc/cni/net.d/10-calico.conflist
{
  "name": "k8s-pod-network",
  "cniVersion": "1.0.0",
  "plugins": [
    {
      "type": "calico",
      "ipam": { "type": "calico-ipam" },
      "policy": { "type": "calico-policy" }
    },
    {
      "type": "portmap",     # CNI chaining: portmap for HostPort
      "capabilities": { "portMappings": true }
    },
    {
      "type": "bandwidth",   # CNI chaining: bandwidth limiting
      "capabilities": { "bandwidth": true }
    }
  ]
}

IP Address Management (IPAM)

CNI separates the what (IP allocation) from the how (interface creation). IPAM plugins handle just the IP/subnet allocation. Kubernetes ships with host-local (allocates from a configured subnet per node) and dhcp (获取 DHCP lease for the interface). Calico, Cilium, and AWS VPC CNI ship their own IPAM plugins that integrate with their control-plane.

# host-local IPAM — simplest, per-node subnet allocation
# kubelet --pod-infra-container-image=... sets up the network,
# then calls the CNI ADD. host-local reads /var/lib/cni/networks/<network>
# to track allocations.

# IPAM config for host-local:
{
  "type": "host-local",
  "subnet": "10.244.1.0/24",   # range to allocate from
  "rangeStart": "10.244.1.10", # skip first few (gateway, DNS, etc.)
  "rangeEnd": "10.244.1.250",
  "gateway": "10.244.1.1",
  "routes": [ { "dst": "0.0.0.0/0", "gw": "10.244.1.1" } ]
}

# Calico IPAM — distributes allocations across etcd or Kubernetes API
# Each node gets a /26 from the global CIDR. Calico tracks allocations
# in its own CRDs (WorkloadEndpoint, IPPool).
$ kubectl get ippool
NAME           CIDR           NAT   DISABLED
default-pool   10.244.0.0/16   true   false

$ kubectl get workloadendpoints -A
NAMESPACE   NAME      NETWORK      INTERFACE
default     web-0     default      eth0
default     web-1     default      eth0

# Cilium IPAM — per-node CIDRs managed by Cilium Operator
$ kubectl get ciliumippool
NAME         VERSION   CIDR
default      4         10.244.0.0/16

Popular CNI Plugins

CalicoL3 + BGP

Calico routes Pod IPs directly using BGP — nodes exchange routes just like an internet router. No encapsulation overhead. Can run without an overlay (flat L3) if nodes are on the same layer-2 network (e.g., in the same VPC subnet). The Felix agent programs iptables for NetworkPolicy.

Control plane: BIRD BGP daemon per node, orchestrated by calico/kube-controllers. For large clusters, route reflectors reduce the full-mesh BGP N×(N-1)/2 sessions problem.

Best for: Bare-metal clusters, high-throughput workloads, environments where you want route-level visibility (standard tcptraceroute works).

# Install Calico
$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

# BGP peer status — confirms BGP sessions are up
$ calicoctl node status
Calico process is running.
IPv4 BGP peering status
Node      Peer Address   BGP State   ... 
node-1    192.168.1.102   established
node-2    192.168.1.103   established

# On each node, you see kernel routes for every pod CIDR
$ ip route | grep 10.244
10.244.1.0/26 via 192.168.1.101 dev eth0  proto bird
10.244.2.0/26 via 192.168.1.102 dev eth0  proto bird

# No encapsulation — plain IP routing
$ tcpdump -i eth0 -c 5 icmp &
$ kubectl exec web-0 -- ping -c 1 10.244.2.19
# You'll see plain ICMP packets with pod IPs as src/dst (no VXLAN wrapper)

FlannelOverlay / VXLAN

Flannel creates an overlay network. Each node gets a /24 (configurable) from a shared backend CIDR. Cross-node traffic is encapsulated in VXLAN packets, which are UDP packets sent to port 4789. The flanneld daemon on each node manages theVXLAN interface (flannel.1).

Backends: udp (legacy), vxlan (recommended), host-gw (direct routing, no encapsulation, requires L2 connectivity), wireguard (encrypted).

Best for: Quick setup, homelabs, environments where simplicity matters more than performance. Not recommended for production at large scale due to VXLAN overhead.

# Install Flannel (must be done BEFORE Docker/containerd on each node)
$ kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

# flanneld writes this to /run/flannel/subnet.env
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.1.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

# VXLAN backend creates flannel.1 interface
$ ip -d link show flannel.1
61: flannel.1: <BROADCAST,MULTICAST,UP,mru 1450> mtu 1450
    tunnel mode vxlan local 192.168.1.101 dstport 8472
    # dstport 8472 is the established VXLAN port (not 4789 — older spec)

# Cross-node packet path: pod → flannel.1 (VXLAN encap) → eth0 (UDP 8472)
# The receiving node's kernel decapsulates and delivers to the pod's veth
$ ip route show | grep -A1 "10.244.2"
10.244.2.0/24 via 192.168.1.102 dev flannel.1 onlink

CiliumeBPF

Cilium replaces the kernel's packet filtering entirely with eBPF programs attached to TC (traffic control) hooks and socket operations. Service load balancing, NetworkPolicy, and observability are all handled in-kernel at wire speed. No iptables chains to traverse, no overlays to encapsulate.

Control plane: Cilium Agent (DaemonSet) on every node, Cilium Operator for CRD management (IPAM, identities). Hubble is the built-in observability layer (per-flow visibility without a service mesh).

Best for: Large-scale production clusters, environments needing NetworkPolicy + observability without Istio, bare-metal with high throughput requirements.

# Install Cilium
$ helm install cilium cilium/cilium \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost=<control-plane> \
    --set k8sServicePort=6443

# Cilium endpoints — every pod gets a CiliumEndpoint CRD
$ kubectl get ciliumendpoints -A
NAMESPACE   NAME      ENDPOINT ID   IDENTITY   POLICY LEVEL
default     web-0     384           17334     none
default     api-0     522           19582     default

# eBPF maps backing the load balancer
$ cilium bpf lb list
FRONTEND           SERVICE ID   BACKEND               ACTIVE
10.96.0.10:53      1            10.244.1.8:53         yes
10.0.5.6:80        2            10.244.1.37:8080      yes
10.0.5.6:80        2            10.244.2.19:8080      yes

# No kube-proxy processes needed
$ kubectl get daemonset -n kube-system kube-proxy
NAME         DESIRED   CURRENT   READY
kube-proxy   3         3         3   # still there but idle in eBPF mode

Weave NetOverlay / Sleeve

Weave creates its own overlay network using a "Sleeve" mode (UDP encapsulation for cross-node traffic) or "Fast DP" mode (a kernel module for faster data path). Weave's亮点 is automatic mTLS between peers and simple firewall policy (weave deny/weave allow).

Note: Weave has fallen behind on Kubernetes compatibility and CNCF ecosystem engagement. Calico and Cilium are generally preferred for new deployments.

# Install Weave
$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version -o json 2>/dev/null | jq -r '.serverVersion.minor')"

# Weave router status
$ kubectl exec -n kube-system weave-kube-xxxxx-xxxx -- status
Version: 2.8.1
        Peer name: node-1-1921681101
        Encryption: enabled (gcm)
        Target mesh: one-shot
        Name: 52:f2:ca:3e:39:4d

# Weave's fastdp mode uses a kernel module; sleeve falls back to userspace UDP
$ ip link | grep weave
77: weave: <BROADCAST,MULTICAST,UP,mru 1376> mtu 1376

CNI Chaining

CNI supports chaining — multiple plugins execute in sequence for a single ADD call. The output of plugin N becomes the input to plugin N+1. This is how you get capabilities the primary CNI doesn't support natively. For example, Calico handles IP allocation and routing, but the portmap plugin handles HostPort translation, and bandwidth handles traffic shaping.

# Chaining in /etc/cni/net.d/10-multus.conflist (Multus entry point)
# Multus is a "meta-plugin" that calls other plugins based on annotation
{
  "name": "multus-net",
  "cniVersion": "1.0.0",
  "delegates": [
    {
      "type": "calico"        # Primary: does pod IP and policy
    },
    {
      "type": "portmap",       # Secondary: HostPort and host-local port mapping
      "capabilities": { "portMappings": true }
    },
    {
      "type": "bandwidth",    # Tertiary: ingress/egress rate limiting
      "capabilities": { "bandwidth": true }
    }
  ]
}

# When a pod requests additional interfaces via annotation:
# annotations:
#   k8s.v1.cni.cncf.io/networks: |
#     [{"name": "podnic", "interface": "pod1", "cniType": "ipam.dhcp"}]
# Multus delegates the ADD call to ipam.dhcp, which sends a DHCP request
# on the second interface

Kubernetes Services — Deep Dive

A Service is a stable virtual IP (ClusterIP) that load-balances traffic to a set of Pods selected by label. The ClusterIP never changes (until the Service is deleted), decoupling consumers from the transient Pod IPs underneath. kube-proxy implements this virtual IP by programming kernel NAT rules.

ClusterIP Internals

The ClusterIP is allocated from the service CIDR (configured via --service-cluster-ip-range on kube-apiserver, default 10.96.0.0/12). It exists only in the kernel's iptables/IPVS tables — there is no actual interface called svc-10.0.5.6. Packets to the ClusterIP are DNATted by kube-proxy to a backend Pod IP before forwarding.

DNS resolves api.default.svc.cluster.local to 10.0.5.6 (the ClusterIP)

Client sends TCP SYN to 10.0.5.6:80. The kernel's routing layer puts it into the forward chain.

kube-proxy intercepts in the NAT prerouting chain. It replaces 10.0.5.6:80 with 10.244.1.5:8080 (a random backend Pod). This is DNAT.

Packet is routed to the destination Pod (same node or remote). On the receiving end, the destination IP is the Pod IP — no NAT needed.

Response packet has src=PodIP, dst=client PodIP. The connection tracking (conntrack) entry rewrites src back to ClusterIP on the way back, so the client sees 10.0.5.6 as the response address.

Service Types Compared

# ======================
# ClusterIP — default
# ======================
# Virtual IP reachable ONLY from within the cluster.
# Internal tooling, sidecar proxies, other Pods all use this.
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  type: ClusterIP              # this is the default, omit type: to get it
  selector:
    app: api
  ports:
    - name: http
      port: 80                # the Service port (what you hit)
      targetPort: 8080         # the container port the app listens on
      protocol: TCP
    - name: grpc
      port: 50051
      targetPort: 50051
      protocol: TCP

# ======================
# NodePort — exposes on all node IPs at 30000-32767
# ======================
# Adds a static host-port binding. kube-proxy adds an iptables rule:
#   iptables -t nat -A KUBE-NODEPORTS -p tcp --dport 30080 -j KUBE-SVC-XXX
spec:
  type: NodePort
  ports:
    - name: http
      port: 80
      targetPort: 8080
      nodePort: 30080          # explicit; omit for random (30000-32767)
    # NodePort services also get a ClusterIP automatically
---
# Equivalent iptables (simplified):
$ iptables -t nat -L KUBE-NODEPORTS -n -v
target     prot opt source       destination
KUBE-SVC-XXXX  tcp  --  0.0.0.0/0  0.0.0.0/0  tcp dpt:30080

# ======================
# LoadBalancer — provisions a cloud LB (AWS ELB, GCP LB, Azure LB)
# ======================
# The cloud controller manager watches for LoadBalancer services and
# provisions an external LB pointed at the NodePort. kube-proxy is still
# involved (NodePort is the LB's backend).
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 8080
  # The loadBalancerIP field sets the pre-allocated cloud IP
  loadBalancerIP: "203.0.113.10"
  loadBalancerSourceRanges:           # restrict access to the LB
    - "10.0.0.0/8"
  externalTrafficPolicy: Cluster      # default: route to any node
# externalTrafficPolicy: Local       # only route to nodes with running pods
#    Preserves client IP (no intermediate node NAT)
#    May cause uneven load distribution

# ======================
# ExternalName — CNAME, no proxying
# ======================
# Maps the service DNS name to an external DNS name. No ClusterIP,
# no kube-proxy involvement. Useful for pointing to an external db.
spec:
  type: ExternalName
  externalName: db.us-east-1.rds.amazonaws.com
# api.default.svc.cluster.local → db.us-east-1.rds.amazonaws.com

# ======================
# Headless — no ClusterIP, DNS returns Pod IPs directly
# ======================
# Useful for StatefulSets (Cassandra, ZooKeeper) where clients need to
# discover all pod IPs and connect directly (no intermediate proxy).
spec:
  clusterIP: None              # this is the headless signal
  selector:
    app: cassandra
  ports:
    - port: 9042
# DNS query for cassandra.default.svc.cluster.local returns:
#   A  10.244.1.12
#   A  10.244.2.8
#   A  10.244.3.21
# DNS is managed by the endpoints controller — it creates read-write
# headless records (no A/AAAA synthesis by kube-dns for headless).
# With headless, clients can do SRV record discovery for all replicas.

Session Affinity

By default, kube-proxy distributes connections with iptables --probability (random distribution). For stateful protocols, you may want sticky sessions.

spec:
  sessionAffinity: ClientIP    # affinity based on client IP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800    # 3 hours (default); max 86400 (1 day)

# For HTTP/2 or gRPC, the client reconnecting may hit different backends.
# sessionAffinity: None → round-robin (actually weighted random via iptables)

kube-proxy Modes

kube-proxy is a DaemonSet that watches for Service and Endpoint changes and programs the kernel's forwarding rules accordingly. It runs on every node and implements the Service virtual IP → backend Pod IP translation.

iptables Mode

The historical default. kube-proxy creates chains in the nat table named KUBE-SVC-<hash> (one per Service) and KUBE-SEP-<hash> (one per backend endpoint). The Service chain uses statistic mode random probability to distribute load. The problem: for a Service with 100 Pods, the kernel evaluates 100 probability rules on every connection setup. At 10,000 Services, iptables linear scans dominate CPU.

# Inspect kube-proxy iptables rules for a service
$ iptables -t nat -L KUBE-SERVICES -n --line-numbers | grep api
1  KUBE-SVC-XXXXXXXX  tcp  --  0.0.0.0/0  10.0.5.6  /* default/api:http */ tcp dpt:80

$ iptables -t nat -L KUBE-SVC-XXXXXXXX -n -v
KUBE-SVC-XXXXXXXX  tcp  --  anywhere  10.0.5.6  /* default/api:http */ tcp dpt:80
KUBE-SVC-XXXXXXXX  statistic mode random probability 0.33333333333  RETURN
KUBE-SVC-XXXXXXXX  statistic mode random probability 0.50000000000  RETURN
KUBE-SVC-XXXXXXXX  RETURN     all  --  anywhere  anywhere  /* special: export only */

$ iptables -t nat -L KUBE-SEP-AAA -n -v
target     prot opt source       destination
KUBE-MARK-MASQ  all  --  10.244.1.5  anywhere  /* default/api:http */
SNAT       all  --  anywhere  anywhere  /* default/api:http */ to:10.244.1.5

# On a busy node, the nat table can get very large
$ iptables -t nat -L KUBE-SERVICES -n | wc -l
4821   # rules — still manageable at thousands of services

IPVS Mode

IPVS (IP Virtual Server) is a kernel module in the net/netfilter subsystem designed specifically for load balancing. It uses a hash table internally, giving O(1) lookup regardless of the number of services or endpoints. It's the right choice for clusters with thousands of Services.

# Switch kube-proxy to IPVS mode via configmap
apiVersion: kubeproxy.config.k8s.io/v1
kind: KubeProxyConfiguration
mode: IPVS
ipvs:
  scheduler: rr          # round-robin; also: lc (least connections), sh (source hash)
  syncPeriod: 30s       # how often to sync IPVS rules with endpoints
  minSyncPeriod: 10s   # minimum interval between syncs

# Inspect IPVS table — hash table lookup is O(1)
$ ipvsadm -L -n
TCP  10.0.5.6:80 rr
  -> 10.244.1.5:8080            Masq    1      0          0
  -> 10.244.2.6:8080            Masq    1      0          0
  -> 10.244.3.7:8080            Masq    1      0          0

# Masq = SNAT mode (default); with externalTrafficPolicy: Local,
# you'd see Local mode instead of Masq
# For externalTrafficPolicy: Local, IPVS shows:
$ ipvsadm -L -n | grep api
TCP  10.0.5.6:80 lc         # lc = least connections (picks node with fewest local pods)
  -> 10.244.1.5:8080        Local   1      0          0
  -> 10.244.2.6:8080        Local   1      0          0

# IPVS also supports session persistence (same as sessionAffinity: ClientIP)
$ ipvsadm -L -n --persistent -v
Prot LocalAddress:Port
  -> RemoteAddress:Port           Weight   Active   InActve   Persist
TCP  10.0.5.6:80 persistent 10800
  -> 10.244.1.5:8080            1        10       0         0
  -> 10.244.2.6:8080            1        8        0         0

nftables Mode (Kubernetes 1.31+, Alpha)

nftables is the modern successor to iptables. kube-proxy in nftables mode uses the nftables API instead of iptables. It has similar O(N) scaling characteristics to iptables mode but uses a cleaner, more compact ruleset representation. Still alpha in 1.31 — not for production yet.

# Enable nftables mode (requires KUBE_PROXY_NFTABLES=1 env var or flag)
# In KubeProxyConfiguration:
kind: KubeProxyConfiguration
mode: nftables
nftables:
  transactionWriteTimeout: 5s   # batch nft rules into transactions

# Inspect the nftables rules
$ nft list table ip kube-proxy
table ip kube-proxy {         # ip = IPv4; ip6 = IPv6
  chain prerouting {          # nat prerouting hook
    type nat hook prerouting priority dstnat; policy accept;
    ip daddr 10.0.5.6 tcp dport 80 goto service-api
  }

  chain service-api {
    ip protocol tcp ip daddr 10.0.5.6 tcp dport 80 ct state new \
      counter packets 4821 bytes 12049023 accept
  }

  chain postrouting {         # for masquerade
    type nat hook postrouting priority srcnat; policy accept;
    ip oaddr 10.0.0.0/8 masquerade
  }
}

# Compare rule count: nftables uses sets for efficiency
# nftables "sets" are like: { 10.0.5.6, 10.0.6.12, 10.0.7.3 }:80
# rather than N individual iptables rules

Performance Comparison

iptables

O(N) chain scan

~0.1ms

~1ms

~10ms+ latency spike

IPVS

O(1) hash

~0.05ms

nftables

O(N) with sets

~0.1ms

~0.5ms

~5ms

eBPF (Cilium)

O(1) hash in kernel

~0.01ms

DNS for Services

Kubernetes DNS (CoreDNS) is the cluster's DNS server. It answers DNS queries for Service names, Pod names, and external names. All Pods are configured to use kube-dns (the Service for coredns) as their nameserver via /etc/resolv.conf injected by kubelet at pod creation.

DNS Record Types

# ======================
# A/AAAA records for Services
# ======================
# For a ClusterIP service, CoreDNS creates an A record (or AAAA for IPv6):
# api.default.svc.cluster.local → 10.0.5.6
# api.default.svc.cluster.local → 10.0.5.6  (IPv4)
# api.default.svc.cluster.local → 2001:db8::6 (IPv6 dual-stack)

$ kubectl run dnsutils --image=tutum/dnsutils --restart=Never -- sleep 3600
$ kubectl exec dnsutils -- nslookup api.default.svc.cluster.local
Server:    10.96.0.10
Address:   10.96.0.10#53

Name:      api.default.svc.cluster.local
Address:   10.0.5.6

# ======================
# A records for headless StatefulSet Pods
# ======================
# For headless services with StatefulSet pods:
# cassandra-0.cassandra.default.svc.cluster.local → 10.244.1.12
# cassandra-1.cassandra.default.svc.cluster.local → 10.244.2.8
$ kubectl exec dnsutils -- nslookup cassandra-0.cassandra.default.svc.cluster.local
Name:      cassandra-0.cassandra.default.svc.cluster.local
Address:   10.244.1.12

# ======================
# SRV records
# ======================
# _http._tcp.api.default.svc.cluster.local → 0 100 80 api.default.svc.cluster.local
# _https._tcp.api.default.svc.cluster.local → 0 100 443 api.default.svc.cluster.local
# Format: _port-name._protocol.service-name.namespace.svc.cluster.local
# Priority=0 (unused for Services, used for headless), Weight=100, Port=80

$ kubectl exec dnsutils -- dig +short SRV _http._tcp.api.default.svc.cluster.local
0 100 80 api.default.svc.cluster.local.

# SRV records for headless services point to pod A records:
# _http._tcp.cassandra.default.svc.cluster.local → 0 100 9042 cassandra-0.cassandra.default.svc.cluster.local
#                                               → 0 100 9042 cassandra-1.cassandra.default.svc.cluster.local

# ======================
# ExternalName — CNAME
# ======================
# db.us-east-1.rds.amazonaws.com
# → CNAME db.us-east-1.rds.amazonaws.com (no A record synthesized)

# ======================
# Search path — /etc/resolv.conf on every Pod
# ======================
# Kubelet injects this into every pod's /etc/resolv.conf:
$ kubectl exec dnsutils -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# With ndots:5, any query with ≥5 dots skips the search path (assumed FQDN).
# Query "mysql" → "mysql.default.svc.cluster.local" (3 dots, uses search)
# Query "example.com" → "example.com" (1 dot → treated as FQDN, no search)
# This prevents every external DNS query from going through cluster DNS first.

Headless Services and DNS Discovery

When a Service has clusterIP: None, kube-apiserver does not allocate a ClusterIP. Instead, the endpoint controller creates endpoint records that point directly to Pod IPs. CoreDNS receives these EndpointSlice events and creates individual A records — one per pod. The client gets all of them and can choose how to balance (random, round-robin via DNS TTL, or connect to all).

⚠️ Gotcha: DNS-based load balancing is not load balancing. Clients typically cache the first A record they receive and use it. Short TTLs (e.g., 5 seconds) give a rough round-robin effect at DNS level, but this is not real load balancing. Use a Service (with a ClusterIP) for proper load balancing; use headless Services only when clients need direct pod discovery (e.g., StatefulSets, Elasticsearch replicas, ZooKeeper ensembles).

# For headless services, CoreDNS creates multiple A records:
$ kubectl exec dnsutils -- nslookup web.default.svc.cluster.local
Name:      web.default.svc.cluster.local
Address:   10.244.1.37
Address:   10.244.2.19
Address:   10.244.3.8

# Java clients (Elasticsearch, etc.) use this to discover all replica IPs
# Python clients may only use the first record

Ingress Controllers

Ingress provides HTTP/HTTPS routing into the cluster — host-based routing, path-based routing, TLS termination. The Ingress API has been in Kubernetes since 1.1, but the controller is NOT included in a standard cluster — you must install one.

ingress-nginx

The Kubernetes community's NGINX-based Ingress controller (different from the "nginx-ingress" maintained by NGINX Inc). Runs a定制 NGINX that watches Ingress resources and regenerates nginx.conf on changes.

# Install ingress-nginx
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.9.4/deploy/static/provider/cloud/deploy.yaml

# Basic Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    # ^ Required for path-based routing where you strip the prefix
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /v1
            pathType: Prefix
            backend:
              service:
                name: api
                port:
                  number: 80
---
# TLS termination
spec:
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls        # must be a TLS Secret in the same namespace
# ingress-nginx fetches the cert/key from the Secret and terminates TLS

Gateway API (Recommended)

Gateway API is the successor to Ingress. It decouples the role of "who manages the infrastructure" (GatewayClass/Gateway) from "who writes routing rules" (HTTPRoute). This enables multi-team setups where an infrastructure team owns the Gateway and application teams own their HTTPRoutes.

# Gateway API requires a controller (Cilium, Envoy, Traefik, NGIC)
# Install Cilium's Gateway API controller:
$ cilium install --set gateway-api.enabled=true

# Step 1: GatewayClass — declares which controller is available
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: cilium
spec:
  controllerName: io.cilium/gateway-controller

# Step 2: Gateway — the actual load balancer / listener
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: web
  namespace: networking
spec:
  gatewayClassName: cilium
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: Same     # HTTPRoutes must be in same namespace
    - name: https
      port: 443
      protocol: HTTPS
      allowedRoutes:
        namespaces:
          from: Same
      tls:
        mode: Terminate
        certificateRefs:
          - name: web-tls
            kind: Secret
            group: ""
            namespace: networking

# Step 3: HTTPRoute — routing rules (cross-namespace reference)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api
  namespace: default        # different from the Gateway's namespace!
spec:
  parentRefs:
    - kind: Gateway
      name: web
      namespace: networking
      sectionName: http
  hostnames: [ "api.example.com" ]
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
        - headers:
            type: Exact
            name: x-canary
            value: "true"
      backendRefs:
        - name: api-canary
          port: 80
          weight: 10          # 10% canary
        - name: api-stable
          port: 80
          weight: 90
    - matches:
        - path:
            type: PathPrefix
            value: /v2
      backendRefs:
        - name: api-v2
          port: 80

Other Ingress Controllers

Traefik

Written in Go. Native Ingress (v1) and Gateway API support. Has a neat IngressRoute CRD (Traefik-specific) with more features than core Ingress. Good for simple deployments, less suited for ultra-high-throughput than NGINX. Supports ACME auto-certificates.

# Traefik IngressRoute (their custom CRD)
apiVersion: traefik.io/v1
kind: IngressRoute
metadata:
  name: api
spec:
  entryPoints: [web, websecure]
  routes:
    - match: Host(`api.example.com`) && PathPrefix(`/v1`)
      kind: Rule
      services:
        - name: api
          port: 80

Contour + Envoy

Envoy is the data plane (high-performance C++ proxy). Contour is the Kubernetes controller. Gateway API support. Used by Heptio (now VMware, now Broadcom). Good for enterprises wanting a battle-tested Envoy setup.

# Contour's HTTPProxy (similar to IngressRoute)
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: api
spec:
  virtualhost:
    fqdn: api.example.com
  routes:
    - conditions: [prefix: /v1]
      services:
        - name: api
          port: 80

NetworkPolicy

NetworkPolicy is a Kubernetes object for firewalling at the Pod level. It selects Pods with a label selector and defines what inbound (ingress) and outbound (egress) traffic is allowed. The CNI plugin is responsible for enforcing these rules — the Kubernetes API just stores them.

policyTypes and Implicit Default-Deny

If you specify ingress in policyTypes, the namespace gets an implicit default-deny for ingress. If you specify egress, you get an implicit default-deny for egress. This is a common source of surprise: applying a policy without policyTypes has no effect (policyTypes defaults to whatever you listed in spec).

# Default-deny both ingress and egress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: production
spec:
  podSelector: {}           # empty selector = all pods in this namespace
  policyTypes:
    - Ingress
    - Egress

---
# Allow DNS egress (required for any pod to resolve names)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

---
# Allow web → api on TCP 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-to-api
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: web
      ports:
        - protocol: TCP
          port: 8080

Egress Policies and External Destinations

# Allow api pod to reach external Prometheus (monitoring)
# Requires specifying ipBlock with CIDR of the Prometheus server
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-prometheus
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 203.0.113.0/24   # external Prometheus CIDR
      ports:
        - protocol: TCP
          port: 9090

# Allow api pod to reach the internet (0.0.0.0/0 except cluster CIDR)
# ipBlock with except: lets you carve out the cluster CIDR
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-internet
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8        # exclude cluster internal CIDR
              - 172.16.0.0/12
              - 192.168.0.0/16
              - 127.0.0.1/8

⚠️ Gotcha: NetworkPolicy is CNI-implementation-dependent. Not all CNIs implement NetworkPolicy. Calico and Cilium do fully. Flannel does NOT (by default). Weave does partially. Before designing NetworkPolicy-based security, verify your CNI supports it:

# Check if your CNI enforces NetworkPolicy
# For Calico:
$ kubectl get networkpolicy --all-namespaces
# If policies exist and traffic is being blocked, Calico's Felix agent
# is programming the policy via eBPF (modern) or iptables (legacy).

# For Cilium:
$ kubectl get ciliumnetworkpolicy -A
# Cilium's eBPF datapath evaluates policies at line rate.

# Verify enforcement by testing denied traffic:
$ kubectl exec web-0 -- curl -s --connect-timeout 2 http://10.244.3.8:8080
# Should timeout if policy denies it.

Service Mesh — Sidecar Injection, mTLS, and Traffic Interception

A service mesh (Istio, Linkerd, Linkerd2, Kuma) adds per-pod proxy sidecars that intercept all inbound and outbound traffic. The sidecar handles mTLS, retries, circuit breaking, and observability without the application code knowing.

Sidecar Injection

A sidecar is an extra container in the same Pod as your application, sharing the network namespace. The sidecar's lo is the app's lo, so the app believes it's connecting to localhost:8080 when it's actually connecting to the sidecar proxy, which then forward the (possibly mTLS'd) traffic to the destination.

# Istio automatic sidecar injection
# Label the namespace to inject Envoy sidecars into all pods
$ kubectl label namespace default istio-injection=enabled
$ kubectl apply -f deployment.yaml
# kubelet will inject the envoy sidecar container into new pods

# Pod with sidecar — note the extra container
$ kubectl get pod api-0 -o jsonpath='{.spec.containers[*].name}'
api sidecar

# The sidecar container in the pod spec:
$ kubectl get pod api-0 -o jsonpath='{.spec.containers[1]}'
{
  "name": "istio-proxy",
  "image": "docker.io/istio/proxyv2:1.19",
  "args": ["proxy", "sidecar", ...]
}
# Istio's envoy sidecar listens on all ports and routes based on its config.
# The application container's ports are still bound to the same netns.

mTLS in the Mesh

In a service mesh, every pod gets an identity (SPIFFE URI) stored in a certificate. The sidecar proxy terminates and re-encrypts mTLS for every connection. Pod A connecting to Pod B uses a certificate signed by the mesh CA. The destination verifies the certificate before accepting the connection.

# SPIFFE identity for a pod (Istio format)
# spiffe://cluster.local/ns/default/sa/api
#   cluster = cluster name
#   ns      = namespace
#   sa      = service account

# mTLS in STRICT mode — no plaintext, no unknown identities
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT       # all pods must use mTLS

# DestinationRule controls mTLS and traffic policies per service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api
spec:
  host: api
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL   # Istio provisions certs automatically

# Without mesh: Pod A → Pod B uses plain TCP
# With mesh mTLS:  Pod A's envoy → Pod B's envoy (mTLS) → Pod B
# The app on Pod B sees plaintext; the envoy handles crypto

How Sidecars Intercept Traffic (iptables)

The istio-proxy sidecar uses iptables redirection to intercept all inbound and outbound TCP traffic in the Pod. It sets up OUTPUT and INPUT iptables rules that redirect traffic to the Envoy process. Envoy then decides what to do with it (forward, mTLS, reject).

# In the istio-proxy container, the init container sets up iptables:
# The init container runs BEFORE the application container and configures:
#   iptables -t nat -A OUTPUT   -p tcp -j ISTIO_REDIRECT  # app → proxy
#   iptables -t nat -A PREROUTING -p tcp -j ISTIO_REDIRECT # ingress → proxy

# The ISTIO_REDIRECT chain redirects to the Envoy port:
#   iptables -t nat -A ISTIO_REDIRECT -p tcp \
#     -j REDIRECT --to-port 15001  (Envoy inbound port)

# Result: every TCP connection from the app goes through Envoy first.
# Envoy applies policies (mTLS verification, RBAC, retries) then
# re-encrypts and sends to the destination.

# Check the rules inside a pod with the istio-proxy sidecar:
$ kubectl exec -it api-0 -c istio-proxy -- iptables -t nat -L -n -v
Chain PREROUTING (1 references)
 pkts bytes target     prot opt in     out     source       destination
    0     0 ISTIO_INBOUND  tcp  --  *      *       0.0.0.0/0    0.0.0.0/0

Chain ISTIO_INBOUND (1 references)
 pkts bytes target     prot opt in     out     source       destination
    0     0 RETURN     tcp  --  *      *       0.0.0.0/0    0.0.0.0/0    tcp dpt:8080
    0     0 ISTIO_REDIRECT tcp  --  *      *       0.0.0.0/0    0.0.0.0/0

Chain ISTIO_REDIRECT (0 references)
 pkts bytes target     prot opt in     out     source       destination
    0     0 REDIRECT   tcp  --  *      *       0.0.0.0/0    0.0.0.0/0    redir_ports 15001

Service Mesh vs CNI Policy — When to Use Each

Use CNI NetworkPolicy for cluster-wide default-deny, L3/L4 firewalling, and cross-namespace isolation. It works without any application code change. Calico and Cilium enforce it at the kernel level.

Use a Service Mesh when you need mTLS between specific services, L7 traffic management (retries, circuit breaking, %-based traffic splitting), fine-grained RBAC at the HTTP method level, or distributed tracing headers propagation.

⚠️ Gotcha: Sidecar overhead. Every packet goes through the sidecar proxy. For high-throughput, latency-sensitive workloads, this adds measurable CPU and memory overhead. Cilium's eBPF approach provides much of the observability without the proxy overhead. Consider whether you need full Istio or whether Cilium's NetworkPolicy + Hubble + ambient mode suffices.

CNI Chaining in Practice

CNI chaining lets you compose plugins so each handles a specific capability. The canonical use case: Calico handles pod IP networking, but you need HostPort support, which Calico doesn't do. The portmap plugin sits in the chain after Calico and handles the HostPort translation via iptables.

# How the chain executes for ADD:
# 1. kubelet calls calico (first plugin in conflist)
# 2. calico sets up eth0 with pod IP
# 3. calico returns with ips/routes
# 4. portmap receives calico's output as input
# 5. portmap adds HostPort iptables rules
# 6. portmap returns its own augmented result
# 7. bandwidth receives portmap's output
# 8. bandwidth adds tc (traffic control) qdisc rules
# 9. bandwidth returns to kubelet

# Resulting iptables rules for a HostPort pod:
$ iptables -t nat -L PORTMAP-INGRESS -n -v
target  prot opt source  destination
DNAT    tcp  --  0.0.0.0/0  0.0.0.0/0  tcp dpt:30080 to:10.244.1.5:8080
# 10.244.1.5:8080 is the pod's actual IP:port

# HostPort: containerPort → hostPort mapping:
#   - containerPort: 8080 (what the container binds to)
#   - hostPort: 30080   (what the host exposes)
# The portmap plugin translates host:30080 → podIP:8080 via DNAT

Multus — Meta Plugin for Multiple Networks

Multus is a meta-CNI that calls other CNIs based on per-pod annotations. It solves the problem of pods that need multiple network interfaces — for example, a pod that needs a management interface (Calico) AND a data-plane interface (host-local for storage) AND a DPDK interface (for NFV workloads).

# A pod with two network interfaces via Multus annotation:
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |
      [
        {
          "name": "mgmt-network",       # interface 1: Calico
          "interface": "eth0",
          "cniType": "calico"
        },
        {
          "name": "data-network",        # interface 2: host-local
          "interface": "eth1",
          "cniType": "host-local"
        }
      ]

# Result inside the pod:
$ ip addr
1: lo: <LOOPBACK> mtu 65536
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500
    inet 10.244.1.37/24    # Calico IP (pod network)
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500
    inet 10.200.1.5/24    # host-local IP (data network)

IP Address Management (IPAM) — CIDRs and Allocations

Kubernetes needs to allocate three separate CIDR blocks when it starts. Getting these right before cluster bootstrap avoids painful renumbering later.

Pod CIDR

--cluster-cidr on kube-controller-manager (or in CNI config). The entire address space for all Pod IPs. Typically a /16. Each node is assigned a /24 (or /26) sub-range from this.

e.g., 10.244.0.0/16 — supports ~65K pods total

Service CIDR

--service-cluster-ip-range on kube-apiserver. The address space for Service ClusterIPs. A /12 by default. Service IPs are virtual — they don't correspond to any interface.

e.g., 10.96.0.0/12 — supports ~4K services

Node CIDR

--node-cidr-mask-size on kube-controller-manager. Determines the per-node Pod CIDR size. Default /24 for IPv4, /64 for IPv6. A /24 gives 254 usable IPs on each node; at 110 max pods/node, that's fine.

/24 per node → 254 usable IPs, 110 max pods — plenty

# kubeadm cluster bootstrap with explicit CIDRs
$ kubeadm init --pod-network-cidr=10.244.0.0/16 \
                --service-cidr=10.96.0.0/12 \
                --service-dns-domain=cluster.local

# Verify the settings:
$ kubectl get pod -A -o jsonpath='{.items[0].spec}...' # not directly queryable

# Check what CNI thinks the CIDR is:
$ kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.244.1.0/24 10.244.2.0/24 10.244.3.0/24

# kube-controller-manager assigns /24 to each node (node-cidr-mask-size=24)
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,PODCIDR:.spec.podCIDR
NAME      PODCIDR
node-1    10.244.1.0/24
node-2    10.244.2.0/24
node-3    10.244.3.0/24

# In AWS with VPC CNI, pods use the host's ENI secondary IPs:
# Each EC2 instance type has a limit on ENIs and secondary IPs per ENI.
# The VPC CNI plugin allocates from the VPC subnet, so Pod CIDR == VPC subnet.

⚠️ Gotcha: CIDR exhaustion. If you set --max-pods=110 and give each node a /24, a 50-node cluster uses 50×/24 = 50×254 = 12,700 IPs but only ~5,500 pods. Each node's /24 has 254 usable IPs, but only 110 are used by pods. With 254 possible pods per node, you may hit the /24 limit before the node itself is saturated. Use a /26 (62 IPs) if you have many nodes with low pod counts, or a /16 at the cluster level and allocate /26 per node to avoid waste.

Connectivity Troubleshooting

Networking issues in Kubernetes typically fall into a few buckets: DNS failures, firewall/NetworkPolicy blocks, CNI misconfiguration, or MTU-related packet drops. Systematic debugging is essential.

DNS Debugging

# Step 1: Is kube-dns responding?
$ kubectl run dnsutils --image=tutum/dnsutils --restart=Never -- sleep 3600
$ kubectl exec dnsutils -- nslookup kubernetes.default.svc.cluster.local
# If this fails → kube-dns is not reachable from this namespace

# Step 2: Check /etc/resolv.conf is correct
$ kubectl exec dnsutils -- cat /etc/resolv.conf
nameserver 10.96.0.10        # must be kube-dns service IP
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# If nameserver is wrong → kubelet not injecting resolv.conf
# Check kubelet config --resolv-conf flag

# Step 3: Dig directly with TCP (more reliable than UDP)
$ kubectl exec dnsutils -- dig @10.96.0.10 api.default.svc.cluster.local +tcp +noad
;; QUESTION SECTION:
;api.default.svc.cluster.local.  IN  A

;; ANSWER SECTION:
api.default.svc.cluster.local.  30  IN  A  10.0.5.6

# Step 4: Check CoreDNS pod logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns -c coredns
# Look for: formerr, SERVFAIL, refused, timeout entries
# CoreDNS log format: [INFO] plugin/kubernetes ...

Pod-to-Pod Connectivity

# Step 1: Ping the gateway (first hop)
$ kubectl exec web-0 -- ping -c 1 10.244.1.1   # gateway for node-1's subnet
# If ping fails → host routing not set up, check CNI

# Step 2: Ping the destination pod
$ kubectl exec web-0 -- ping -c 1 10.244.2.19   # pod on another node
# If ping fails:
#   - Cross-node routing: check CNI (Calico BGP sessions, Flannel routes)
#   - Security group/firewall: cloud security group blocking pod CIDR
#   - MTU: VXLAN overhead may fragment or drop at path MTU

# Step 3: TCP connect to the actual port
$ kubectl exec web-0 -- nc -zv 10.244.2.19 8080
Ncat: Connected to 10.244.2.19:8080.

# Step 4: Check iptables/eBPF for drops
# For iptables-based CNI (Calico in iptables mode):
$ iptables -t filter -L FORWARD -n -v | grep DROP
# Look for drops in the forward chain

# For Cilium:
$ cilium bpf lb list
# Verify the service exists and has backends

tcpdump Patterns

# Capture on the host's veth interface for a specific pod
# Find the veth pair for pod web-0:
$ kubectl get pod web-0 -o jsonpath='{.status.hostIP}'
192.168.1.101
# On node-1, find the veth:
$ ip link show | grep veth
$ Bridge fwd show   # if using flannel, capture on flannel.1

# Capture VXLAN traffic (Flannel):
$ tcpdump -i eth0 -n port 8472 -c 10
# 8472 is Flannel's VXLAN port. You should see encapsulated UDP.
# If you see nothing → Flannel daemon not running on one of the nodes

# Capture plain IP traffic between pods on same node (Calico):
$ tcpdump -i caliXXXX -n -c 10   # caliXXX = veth interface name
# You'll see plaintext ICMP/TCP between pods (no encapsulation)

# Capture in a pod's network namespace directly:
$ nsenter --net=/var/run/netns/cni-abc123 -- tcpdump -i eth0 -n -c 5
# Useful when you want to see what the pod actually receives
# (not what's on the host veth before iptables processing)

MTU Pitfalls

# Check MTU on the pod interface:
$ kubectl exec web-0 -- ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,mru 1420> mtu 1420

# The gap: eth0 MTU 1420 vs host eth0 MTU 1500
# 1500 - 1420 = 80 bytes of overhead budget
# Flannel VXLAN: 50 bytes overhead (20 byte inner header + 8 byte UDP + 14 byte IP/ether)
# 1420 + 50 = 1470 — still fits in 1500
# But if the physical network has MTU 1500 AND additional tunnel overhead,
# you may get fragmentation or ICMP "packet too big"

# Common cause: AWS VPC MTU is 1500, but some EC2 instance types have
# reduced MTU on secondary ENIs. Flannel's MTU detection may not pick this up.
# Solution: set explicit MTU in Flannel config:
net-conf.json: |
  {
    "Network": "10.244.0.0/16",
    "MTU": 1400
  }

# Cilium auto-detects MTU but can be overridden:
$ helm install cilium cilium/cilium \
    --set tunnelmtu=1400

CoreDNS Internal Architecture

CoreDNS replaced kube-dns (SkyDNS) in Kubernetes 1.12 and became the default in 1.13. It's a modular DNS server written in Go, where each capability is a plugin. The Corefile configures which plugins are loaded and in what order.

History: SkyDNS to CoreDNS

SkyDNS (the original) used etcd for storing DNS records and answered queries by walking the etcd tree. It was slow for large clusters and not very extensible. CoreDNS took a different approach: a minimal DNS server with a plugin chain, where Kubernetes integration is just one of many plugins. The Kubernetes plugin watches the API server for Services, Endpoints, and (with the endpointslices plugin) EndpointSlice objects and synthesizes DNS records on the fly. No etcd dependency.

Corefile Configuration

# Default Corefile in kube-dns ConfigMap
$ kubectl get configmap coredns -n kube-system -o yaml
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods verified
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf  # forward unknown queries to upstream DNS
        cache 30
        loop
        reload
        loadbalance
    }

# The plugin chain (in order):
#  1. errors     → log DNS errors
#  2. health     → /healthz endpoint for readiness
#  3. kubernetes → synthesize A records for Services + headless Pods
#  4. prometheus → metrics at :9153/metrics
#  5. forward    → forward unknowns to /etc/resolv.conf (node's DNS)
#  6. cache      → TTL-based cache for all records (30s TTL here)
#  7. loop       → detect and break infinite loops
#  8. reload     → watch Corefile changes
#  9. loadbalance → round-robin A records with random order

# Understanding "pods verified" mode:
# kubernetes plugin can verify that a pod IP belongs to a pod in the cluster.
# "pods verified" only returns a DNS name for pod IPs that are verified.
# "pods insecure" returns names for all pod IPs (legacy behavior).
# "pods disabled" disables pod DNS entirely.

Scaling CoreDNS

# CoreDNS is deployed as a Deployment with 2+ replicas (by default)
$ kubectl get deployment -n kube-system coredns
NAME      READY   UP-TO-DATE   AVAILABLE
coredns   2       2            2

# Each CoreDNS instance can handle ~20K queries/second.
# For large clusters, increase replicas and/or resources.
# The Service for kube-dns is a ClusterIP with 2 endpoints (round-robin).

# HPA for CoreDNS based on metrics (requires metrics-server):
$ kubectl autoscale deployment coredns -n kube-system \
    --cpu-percent=70 --min=2 --max=10

# Memory-based HPA (for cache pressure):
# CoreDNS uses a fixed-size cache. High memory usage indicates cache churn.
# Set CoreDNS resource limits explicitly:
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

IPv6 and Dual-Stack

Kubernetes supports IPv6-only and IPv4/IPv6 dual-stack clusters. Dual-stack (GA in 1.21) is increasingly common as organizations plan for IPv4 exhaustion.

Dual-Stack Configuration

# kube-apiserver flags for dual-stack
kube-apiserver \
    --service-cluster-ip-range=10.96.0.0/12,2001:db8::/108 \
    --cluster-dns=10.96.0.10,2001:db8::a    # kube-dns service IPs

# kubelet for dual-stack pod allocation
kubelet \
    --node-ip=NODE_IPV4,NODE_IPV6             # or let it auto-detect

# Check that a node has both IPs assigned:
$ kubectl get node node-1 -o custom-columns=NAME:.metadata.name,IP4:.status.addresses[?(@.type=="InternalIP")].address,IP6:.status.addresses[?(@.type=="InternalIP")].address
NAME    IP4            IP6
node-1  192.168.1.101  2001:db8::c0a8:165

# Check a pod with dual-stack:
$ kubectl get pod web-0 -o wide -o custom-columns=NAME:.metadata.name,IP4:.status.podIP,IP6:.status.podIP
NAME    IP4            IP6
web-0   10.244.1.37    2001:db8::a:f444:125

Dual-Stack Services

# Service with dual-stack ClusterIP
spec:
  ipFamilies:
    - IPv4
    - IPv6
  ipFamilyPolicy: RequireDualStack   # or PreferDualStack, SingleStack
# With RequireDualStack, the Service gets both a v4 and v6 ClusterIP.
# DNS returns both A and AAAA records.

$ kubectl get svc api -o custom-columns=NAME:.metadata.name,CLUSTERIP4:.spec.clusterIP,CLUSTERIP6:.spec.clusterIPs
NAME   CLUSTERIP4     CLUSTERIP6
api    10.0.5.6       2001:db8::6

$ kubectl exec dnsutils -- nslookup api.default.svc.cluster.local
Name: api.default.svc.cluster.local
Address: 10.0.5.6
Address: 2001:db8::6

# A client picks which IP family to use based on its own configuration.
# A dual-stack client typically prefers IPv6 (RFC 6724).

hostPort and hostNetwork

Kubernetes provides two ways to expose a pod directly on the host's network: hostPort (a container-level port mapping on the host interface) and hostNetwork: true (the pod joins the host's network namespace directly). Both sacrifice network isolation and should be used sparingly.

hostPort

hostPort maps a container port to a host port. The pod is still in its own network namespace, but traffic to hostIP:hostPort on the host is DNATted to the pod's IP. This requires the CNI portmap plugin to be in the CNI chain (or Multus for delegation). Only one pod per host can use a given hostPort.

# hostPort pod spec:
spec:
  containers:
    - name: nginx
      image: nginx
      ports:
        - containerPort: 80
          hostPort: 80        # host IP:80 → podIP:80
          protocol: TCP       # default

# Equivalent iptables rules (added by portmap CNI):
$ iptables -t nat -A PORTMAP-INGRESS \
    -p tcp --dport 80 -j DNAT --to-destination POD_IP:80

# Limitations:
# - Port conflicts: another pod on the same host can't also use hostPort 80
# - Only supports TCP/UDP/SCTP (not arbitrary protocols)
# - hostPort in a Deployment means only one replica per node (use DaemonSet)
# - Doesn't work with Kind (Kubernetes in Docker) without special config

hostNetwork

hostNetwork: true puts the pod in the host's network namespace directly. The pod sees all host interfaces, binds to host ports directly, and uses the host's routing table. This is the "host" network mode from Docker. Useful for system daemons that need to bind to specific host interfaces (notably kube-proxy itself).

# kube-proxy uses hostNetwork (not hostPort) — it must bind to port 10253
# (metrics endpoint) on all host interfaces
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet   # hostNetwork pods default to host DNS
  containers:
    - name: kube-proxy
      image: k8s.gcr.io/kube-proxy:v1.28.0
      # The process binds directly to eth0's IP

# A pod using hostNetwork sees the host's routing table:
$ kubectl run test --rm -it --image=busybox --restart=Never \
    -- /bin/sh -c 'ip route'
default via 192.168.1.1 dev eth0     # host's default route
10.244.0.0/16 via 192.168.1.1 dev eth0  # cluster pod network (via CNI)
# These are host routes, not pod-specific routes

# ⚠️ hostNetwork pods share the host's DNS resolver (/etc/resolv.conf)
# With default dnsPolicy: ClusterFirst, hostNetwork pods DON'T use cluster DNS.
# They use whatever DNS the host uses (typically from /etc/resolv.conf on the node).
# Fix: set dnsPolicy: ClusterFirstWithHostNet to force cluster DNS.

# Security implication: hostNetwork pods can bind to any host port,
# including privileged ports (<1024) if the container runs as root.

⚠️ Gotcha: dnsPolicy with hostNetwork. The default dnsPolicy for a pod with hostNetwork: true is ClusterFirstWithHostNet (since Kubernetes 1.9). Previously it was Default (use node's resolv.conf directly), which caused DNS resolution failures for hostNetwork pods accessing cluster Services by name. If you're on an older Kubernetes version, explicitly set dnsPolicy.

Egress Policies and Egress Proxies

By default, pods can egress to any destination (0.0.0.0/0) unless a NetworkPolicy restricts them. For many organizations, this is insufficient — they need centralized egress control (all outbound traffic through a proxy or firewall). There are several patterns for this.

Egress Gateway Pattern

An egress gateway is a pod (or set of pods) that all egress traffic must route through. It can be an explicit route in the CNI (Calico's EgressGateway) or a kube-proxy + iptables rule that forces all egress to a specific pod.

# Calico EgressGateway CRD
apiVersion: crd.projectcalico.org/v1
kind: EgressGateway
metadata:
  name: egress-gw
  namespace: kube-system
spec:
  nodeSelector: egress-node == "true"
  destination: {
    CIDR: 0.0.0.0/0   # intercept all egress
  }

---
# Force egress through a specific IP via kube-proxy (static NAT)
# All egress from namespace "production" goes through NAT gateway at 10.244.1.50
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: force-egress-via-gateway
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: egress-gw
      ports:
        - protocol: TCP
          port: 443

# Limitation: this only works for pods the gateway can route to.
# A true egress gateway requires CNI support to redirect all egress.

Egress Proxy via Sidecar

A more common pattern: route all external traffic through a transparent proxy (Envoy, Squid, etc.) running as a sidecar or per-namespace proxy. Applications don't know their traffic is being proxied (unless the proxy does TLS interception).

# Per-namespace egress proxy using an egress-sidecar pattern:
# 1. A "egress-proxy" deployment runs as a normal ClusterIP service.
# 2. All pods in the namespace have an init container that sets up
#    iptables to redirect external traffic to the proxy.
# 3. The proxy forwards to the actual destination.

# egress-proxy deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: egress-proxy
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: egress-proxy
  template:
    metadata:
      labels:
        app: egress-proxy
    spec:
      containers:
        - name: squid
          image: sameersbn/squid:3.5.27
          ports:
            - containerPort: 3128

---
# The init container (runs before app container) sets up redirect:
# NOTE: This requires the unprivileged mode of iptables to work in a container,
# or capabilities: [ NET_ADMIN ] on the init container.
initContainers:
  - name: redirect-egress
    image: gcr.io/google-containers/iptables-whitelist:latest
    securityContext:
      capabilities:
        add:
          - NET_ADMIN
    command:
      - sh
      - -c
      - |
        iptables -t nat -A OUTPUT -p tcp ! -d 10.0.0.0/8 \
          -j DNAT --to-destination 10.0.5.100:3128
        iptables -t nat -A PREROUTING -p tcp ! -d 10.0.0.0/8 \
          -j DNAT --to-destination 10.0.5.100:3128

Cloud Provider Egress Controls

# AWS: NAT Gateway for controlled egress
# All private subnets route 0.0.0.0/0 through a NAT Gateway (managed by AWS).
# The NAT Gateway has an Elastic IP and is the only way out.
# Security groups on the NAT Gateway restrict what destinations are reachable.

# AWS PrivateLink for AWS service access (no internet egress needed)
# For pods that need to access S3, DynamoDB, etc. without going to the internet:
# Use VPC CNI with PrivateLink endpoints (interface VPC endpoints).
# Pods access aws.s3.amazonaws.com → resolved to the endpoint ENI IP.
# No packet leaves the VPC.

# GCP: Cloud NAT
# Cloud Router + Cloud NAT distributes egress for all GKE node IPs.
# All outbound traffic from GKE nodes goes through Cloud NAT → one IP or range.
# Useful when you need a predictable source IP for firewall rules.

# Azure: NAT Gateway
# Similar to AWS/GCP. Associate a NAT Gateway with the subnet GKE uses.
# All egress from pods will use the NAT Gateway's public IP.

Tradeoffs

Strengths

Pluggable — swap CNI, swap kube-proxy, swap Ingress controller independently
eBPF dataplanes scale linearly with cluster size, not rule count
Gateway API finally gives portable L7 routing without annotation soup
NetworkPolicy is declarative, GitOps-friendly firewall
CoreDNS's plugin architecture is extensible (whoami, geoip, etc.)
Headless Services enable stateful workload discovery without service mesh
Dual-stack IPv4/IPv6 lets you migrate gradually without dual-stack penalty on existing pods

Sharp Edges

iptables kube-proxy is O(N) — large clusters need IPVS or eBPF
NetworkPolicy is only enforced if the CNI implements it — Flannel users get no firewall
Overlay networks add encapsulation overhead and MTU pitfalls
Multi-cluster networking (flat IP across clusters) is hard, not standardized
hostPort conflicts prevent multiple replicas on the same node
hostNetwork pods bypass NetworkPolicy and share the host's DNS context
Sidecar-based service meshes add per-request latency and memory overhead
Dual-stack on some CNIs (Flannel) falls back to v4-only, silently dropping v6
Service IP exhaustion is possible in very large clusters (>10K Services) with /12
CoreDNS is the only DNS backend — there's no built-in multi-cluster federation

Frequently Asked Questions

What does CNI actually do?

CNI (Container Network Interface) is a tiny spec: when a Pod is created, kubelet calls a CNI plugin (a binary in /opt/cni/bin) with a JSON config and the path to the Pod's network namespace. The plugin's job is to assign an IP, create the network interface inside the namespace (typically a veth pair), and program any routing/firewall needed. CNI doesn't say HOW you do it — Calico uses BGP, Flannel uses VXLAN, Cilium uses eBPF. The interface is just: 'set up the network for this namespace, and tear it down when the pod dies.'

What's the difference between iptables, IPVS, and nftables modes for kube-proxy?

kube-proxy implements Service load balancing by programming kernel rules. iptables mode (default) creates one chain per Service plus N rules per endpoint — fine up to a few thousand Services, then linear scan dominates. IPVS mode uses the kernel's IPVS subsystem (designed for load balancing) — O(1) lookup, scales to tens of thousands of Services. nftables mode (1.31+ alpha) is the modern replacement for iptables with similar O(N) characteristics but cleaner kernel internals. Most production clusters use IPVS or have switched to Cilium's eBPF dataplane (which replaces kube-proxy entirely).

How does Cilium's eBPF dataplane skip kube-proxy?

Cilium attaches eBPF programs to socket creation, network interfaces, and tc hooks. Service translation happens at the socket level: when a Pod connects to a Service ClusterIP, eBPF rewrites the destination to a backend Pod IP before the packet ever leaves the Pod's stack. No iptables, no kube-proxy daemon, no DNAT chains. NetworkPolicy enforcement also moves to eBPF, evaluated per-packet at line rate. The result: lower latency, much better scaling on large clusters (rule count doesn't matter), and observability (Hubble) for free.

Ingress vs Gateway API — which to use?

Ingress is the original (2015) API for HTTP routing — limited to host/path rules, every controller adds non-portable annotations for everything else. Gateway API (GA in 2023) replaces it with a richer, role-oriented design: GatewayClass (the controller offering), Gateway (the listener), HTTPRoute/TLSRoute/TCPRoute (the rules). Gateway API supports header-based matching, traffic splitting, request modification, cross-namespace routing — all without annotations. New deployments should use Gateway API; existing Ingress configs still work but won't see new features.

What does NetworkPolicy enforce?

NetworkPolicy is a Pod-selector-based firewall expressed in Kubernetes objects. You select Pods (matchLabels) and declare ingress/egress rules — what other Pods or CIDR blocks can talk to them on which ports. Critical caveat: NetworkPolicy is only enforced if your CNI implements it. Calico, Cilium, Antrea do; older Flannel didn't. With no CNI policy enforcement, NetworkPolicy objects have no effect. You can verify with 'kubectl describe pod' showing the NetworkPolicy status, or by trying a denied connection.

How does dual-stack work?

Dual-stack (1.21 GA) runs IPv4 and IPv6 simultaneously. Each Pod gets one IP from each family; Services can be SingleStack (one), PreferDualStack (one with the option to allocate both), or RequireDualStack (both). The Pod's network namespace has both addresses on its interface. Kernel routes by family. ClusterIP Services have ipFamilyPolicy and ipFamilies fields. Most cloud CNIs and Cilium handle this transparently; some legacy CNIs don't, in which case Services fall back to v4-only.