Networking
CNI, Services, kube-proxy, and the eBPF Dataplane
Kubernetes' networking model is deceptively simple: every Pod gets a routable IP, and every Pod can reach every other Pod without NAT. How that is implemented is delegated entirely to the CNI plugin you install — Calico uses BGP, Flannel uses a VXLAN overlay, Cilium uses eBPF, AWS VPC CNI uses ENI trunking. The cluster operator picks a CNI and trades cost, performance, and policy features.
On top of the Pod network sits Services (virtual IPs that load-balance to a set of Pods), implemented by kube-proxy via iptables/IPVS/nftables — or replaced entirely by eBPF in modern stacks. NetworkPolicy adds firewall semantics; the Gateway API adds rich L7 routing. Each layer is independently swappable.
The Stack
Key Numbers
Container Networking Fundamentals
Before diving into Kubernetes, understand what the Linux kernel provides. Docker and container runtimes use four network modes — the same primitives K8s CNI builds on top of.
The Four Container Network Modes
Docker's --network flag maps directly to Linux network namespace plumbing.
Kubernetes always uses the container mode at runtime (kubelet handles the
namespace creation), but understanding the modes clarifies what a CNI plugin actually does.
docker0 bridge on the host. Containers get IPs from a
bridge-local DHCP or static assignment. They can reach the host and external
network via NAT (MASQUERADE). This is Docker's default.
Use case: Dev/test on single hosts, legacy setups.
Use case: System daemons, performance-critical networking where you can't afford NAT overhead.
Use case: Offline batch workloads, air-gapped workloads.
lo, same interfaces, same port bindings. Kubernetes uses
this — kubelet creates a network namespace per Pod, then runs containers inside it.
Use case: Kubernetes Pods (all containers in a Pod share netns), logging sidecars that need to intercept the main container's traffic.
veth Pair — The Fundamental Pipe
Virtual Ethernet (veth) pairs are the building block for container networking. A veth
pair is a point-to-point link — like a pipe with two ends. Whatever goes in one end
comes out the other. When a CNI plugin "connects" a container to the host network,
it creates a veth pair: one end stays named eth0 inside the container's
network namespace, the other end appears on the host (often named vethXXXX).
The host end gets plugged into a bridge or routed directly.
# Inspect a running container's network namespace
$ docker inspect my-container --format '{{.NetworkSettings.SandboxKey}}'
/var/run/docker/netns/default
# Enter the container's namespace to see eth0
$ nsenter --net=/var/run/docker/netns/default ip addr
1: lo: <LOOPBACK,UP> mtu 65536 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
8: eth0@if26: <BROADCAST,MULTICAST,UP,mru 1622> mtu 1420
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.2/16 brd 172.17.0.255 scope global eth0
inet6 fe80::42:acff:fe11:2/64 link tentative
# On the host, the other end of the veth pair shows up as vethXXX
$ ip addr | grep veth
26: vethb1e1e86@if5: <BROADCAST,MULTICAST,UP,mru 1622> mtu 1420
link/ether 3a:92:13:21:3e:9f brd ff:ff:ff:ff:ff:ff
master docker0
inet6 fe80::3892:13ff:fe21:3e9f/64
# The bridge connects them
$ brctl show docker0
bridge name bridge id STP enabled interfaces
docker0 8000.0242ac110002 no vethb1e1e86
veth2a3f4c91
veth9d1e7f32
# MTU consideration: veth pairs inherit MTU from the parent interface.
# Flannel uses VXLAN, which adds 50 bytes overhead. Default docker0 MTU is 1500,
# so the container eth0 MTU is often smaller (e.g., 1420) to account for
# encapsulation. This is a common source of mysterious connectivity issues.
$ ip link show docker0
5: docker0: <BROADCAST,MULTICAST,UP,mru 1500> qdisc noqueue master docker0 Network Namespaces (netns)
Network namespaces are the Linux kernel's fundamental network isolation primitive. They partition the networking stack: interfaces, routes, iptables rules, ARP tables, and sockets are all per-namespace. When Kubernetes creates a Pod, it creates a new network namespace and moves the Pod's containers into it via the CNI plugin.
# Create a network namespace manually (like kubelet does)
ip netns add myns
# Create a veth pair
ip link add veth-host type veth peer name veth-pod
# Move one end into the namespace
ip link set veth-pod netns myns
# Configure the namespace's end
ip netns exec myns ip addr add 10.244.1.10/24 dev veth-pod
ip netns exec myns ip link set veth-pod up
ip netns exec myns ip link set lo up
# Configure the host end
ip addr add 10.244.1.1/24 dev veth-host
ip link set veth-host up
# From the namespace, you can now reach the host at 10.244.1.1
ip netns exec myns ping -c 1 10.244.1.1
PING 10.244.1.1 (10.244.1.1) 56(84) bytes of data.
64 bytes from 10.244.1.1: icmp_seq=1 ttl=64 time=0.031 ms
# List all network namespaces on the host
ip netns list
ls /var/run/netns/ # alternative view
# Delete when done
ip netns delete myns ns-<pid>). Always
stop containers before manually deleting netns entries. Kubernetes handles this
correctly via the CNI DEL command.
The Kubernetes Networking Model
Kubernetes codifies four strict rules that every CNI plugin must satisfy. These rules are why pod-to-pod communication "just works" across nodes — the model removes ambiguity.
-p publish model creates.
localhost. Containers in different Pods on the same node go through
the host's bridge or routing stack — never localhost.
10.244.1.5 on node A must be reachable
as 10.244.1.5 from node B. The CNI plugin is responsible for
making this routing work — via flat L2 (Calico in policy纯粹的 BGP mode),
overlay (Flannel VXLAN), or L3 routing (Cilium eBPF). No NAT at the pod IP layer.
eth0 reports 10.244.1.5, and that same address
is what node B uses to reach it. There is no "internal NAT" hiding the Pod's
true IP. This matters for mTLS (the certificate must match the IP DNS resolves to).
Why IP-Per-Pod Matters
The IP-per-Pod model is what enables Kubernetes' compositionality. Because every Pod has a unique IP, you can place any workload on any node without coordinating port numbers. A Service wrapping N Pods doesn't care if those Pods are on 1 node or 50 nodes — the Service IP routes to them identically. Compare this to the old model where you'd publish port 8080 on every node and hope containers didn't collide.
# Confirm the IP-per-Pod model with a concrete example
$ kubectl get pod -o wide
NAME READY STATUS IP NODE NOMINATED NODE
web-0 1/1 Running 10.244.1.37 node-1 <none>
web-1 1/1 Running 10.244.2.19 node-2 <none>
api-0 1/1 Running 10.244.3.8 node-3 <none>
# web-0 on node-1 is reachable from api-0 on node-3 as 10.244.1.37
# (no NAT, no port mapping, just routing)
$ kubectl exec api-0 -- curl -s --connect-timeout 2 http://10.244.1.37:8080/health
{"status":"ok"}
# The service IP (10.0.5.6) is a virtual IP that kube-proxy rewrites
# to one of those backend Pod IPs
$ kubectl get svc web
NAME TYPE CLUSTER-IP PORT(S) SELECTOR
web ClusterIP 10.0.5.6 8080/TCP app=web
# DNS maps the service name to the ClusterIP, not the Pod IPs
$ kubectl exec api-0 -- nslookup web.default.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: web.default.svc.cluster.local
Address: 10.0.5.6 CNI Plugins
The Container Network Interface (CNI) is a spec from the Cloud Native Computing Foundation. It defines how a container runtime (Docker, containerd, cri-o) talks to a network plugin when containers are created and deleted. The spec is intentionally minimal — it specifies only the interface, not the implementation.
The CNI Specification
CNI defines two operations: ADD (wire a container into a network) and
DEL (tear it down). A third optional operation CHECK
(validate setup) was added later. There is no "GET" — state is not queryable through
the spec. Plugins are binaries invoked by the container runtime via stdin with JSON.
# The CNI ADD call — kubelet calls this for every pod
# JSON is written to the plugin's stdin; result comes on stdout
# Example ADD call (what kubelet generates):
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"type": "calico", # which plugin binary to call
"containerID": "abc123def456",
"netns": "/var/run/netns/cni-abc123def456",
"ifName": "eth0", # what to name the interface inside the container
"ip": {
"version": "4",
"address": "10.244.1.37/24",
"gateway": "10.244.1.1"
},
"dns": {
"nameservers": ["10.96.0.10"],
"search": ["default.svc.cluster.local", "svc.cluster.local", "cluster.local"]
}
}
# The plugin returns:
{
"cniVersion": "1.0.0",
"ips": [
{
"version": "4",
"address": "10.244.1.37/24",
"gateway": "10.244.1.1",
"interface": 0
}
],
"dns": {
"nameservers": ["10.96.0.10"],
"search": ["default.svc.cluster.local"]
}
"routes": [
{ "dst": "0.0.0.0/0", "gw": "10.244.1.1" }
]
}
# kubelet auto-discovers plugins in this order:
$ ls /opt/cni/bin/
bandwidth calico cilium-cni flannel host-local loopback portmap tuning
# Plugins register by dropping config files into /etc/cni/net.d/
# The first file lexically (after sorting) is the active plugin
$ cat /etc/cni/net.d/10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "1.0.0",
"plugins": [
{
"type": "calico",
"ipam": { "type": "calico-ipam" },
"policy": { "type": "calico-policy" }
},
{
"type": "portmap", # CNI chaining: portmap for HostPort
"capabilities": { "portMappings": true }
},
{
"type": "bandwidth", # CNI chaining: bandwidth limiting
"capabilities": { "bandwidth": true }
}
]
} IP Address Management (IPAM)
CNI separates the what (IP allocation) from the how (interface creation).
IPAM plugins handle just the IP/subnet allocation. Kubernetes ships with
host-local (allocates from a configured subnet per node) and
dhcp (获取 DHCP lease for the interface). Calico, Cilium, and AWS VPC CNI
ship their own IPAM plugins that integrate with their control-plane.
# host-local IPAM — simplest, per-node subnet allocation
# kubelet --pod-infra-container-image=... sets up the network,
# then calls the CNI ADD. host-local reads /var/lib/cni/networks/<network>
# to track allocations.
# IPAM config for host-local:
{
"type": "host-local",
"subnet": "10.244.1.0/24", # range to allocate from
"rangeStart": "10.244.1.10", # skip first few (gateway, DNS, etc.)
"rangeEnd": "10.244.1.250",
"gateway": "10.244.1.1",
"routes": [ { "dst": "0.0.0.0/0", "gw": "10.244.1.1" } ]
}
# Calico IPAM — distributes allocations across etcd or Kubernetes API
# Each node gets a /26 from the global CIDR. Calico tracks allocations
# in its own CRDs (WorkloadEndpoint, IPPool).
$ kubectl get ippool
NAME CIDR NAT DISABLED
default-pool 10.244.0.0/16 true false
$ kubectl get workloadendpoints -A
NAMESPACE NAME NETWORK INTERFACE
default web-0 default eth0
default web-1 default eth0
# Cilium IPAM — per-node CIDRs managed by Cilium Operator
$ kubectl get ciliumippool
NAME VERSION CIDR
default 4 10.244.0.0/16 Popular CNI Plugins
Calico routes Pod IPs directly using BGP — nodes exchange routes just like an internet router. No encapsulation overhead. Can run without an overlay (flat L3) if nodes are on the same layer-2 network (e.g., in the same VPC subnet). The Felix agent programs iptables for NetworkPolicy.
Control plane: BIRD BGP daemon per node, orchestrated by calico/kube-controllers. For large clusters, route reflectors reduce the full-mesh BGP N×(N-1)/2 sessions problem.
Best for: Bare-metal clusters, high-throughput workloads, environments where you want route-level visibility (standard tcptraceroute works).
# Install Calico
$ kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
# BGP peer status — confirms BGP sessions are up
$ calicoctl node status
Calico process is running.
IPv4 BGP peering status
Node Peer Address BGP State ...
node-1 192.168.1.102 established
node-2 192.168.1.103 established
# On each node, you see kernel routes for every pod CIDR
$ ip route | grep 10.244
10.244.1.0/26 via 192.168.1.101 dev eth0 proto bird
10.244.2.0/26 via 192.168.1.102 dev eth0 proto bird
# No encapsulation — plain IP routing
$ tcpdump -i eth0 -c 5 icmp &
$ kubectl exec web-0 -- ping -c 1 10.244.2.19
# You'll see plain ICMP packets with pod IPs as src/dst (no VXLAN wrapper)
Flannel creates an overlay network. Each node gets a /24 (configurable) from
a shared backend CIDR. Cross-node traffic is encapsulated in VXLAN packets,
which are UDP packets sent to port 4789. The flanneld daemon on each node
manages theVXLAN interface (flannel.1).
Backends: udp (legacy), vxlan (recommended), host-gw (direct routing, no encapsulation, requires L2 connectivity), wireguard (encrypted).
Best for: Quick setup, homelabs, environments where simplicity matters more than performance. Not recommended for production at large scale due to VXLAN overhead.
# Install Flannel (must be done BEFORE Docker/containerd on each node)
$ kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
# flanneld writes this to /run/flannel/subnet.env
$ cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.1.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
# VXLAN backend creates flannel.1 interface
$ ip -d link show flannel.1
61: flannel.1: <BROADCAST,MULTICAST,UP,mru 1450> mtu 1450
tunnel mode vxlan local 192.168.1.101 dstport 8472
# dstport 8472 is the established VXLAN port (not 4789 — older spec)
# Cross-node packet path: pod → flannel.1 (VXLAN encap) → eth0 (UDP 8472)
# The receiving node's kernel decapsulates and delivers to the pod's veth
$ ip route show | grep -A1 "10.244.2"
10.244.2.0/24 via 192.168.1.102 dev flannel.1 onlink Cilium replaces the kernel's packet filtering entirely with eBPF programs attached to TC (traffic control) hooks and socket operations. Service load balancing, NetworkPolicy, and observability are all handled in-kernel at wire speed. No iptables chains to traverse, no overlays to encapsulate.
Control plane: Cilium Agent (DaemonSet) on every node, Cilium Operator for CRD management (IPAM, identities). Hubble is the built-in observability layer (per-flow visibility without a service mesh).
Best for: Large-scale production clusters, environments needing NetworkPolicy + observability without Istio, bare-metal with high throughput requirements.
# Install Cilium
$ helm install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=<control-plane> \
--set k8sServicePort=6443
# Cilium endpoints — every pod gets a CiliumEndpoint CRD
$ kubectl get ciliumendpoints -A
NAMESPACE NAME ENDPOINT ID IDENTITY POLICY LEVEL
default web-0 384 17334 none
default api-0 522 19582 default
# eBPF maps backing the load balancer
$ cilium bpf lb list
FRONTEND SERVICE ID BACKEND ACTIVE
10.96.0.10:53 1 10.244.1.8:53 yes
10.0.5.6:80 2 10.244.1.37:8080 yes
10.0.5.6:80 2 10.244.2.19:8080 yes
# No kube-proxy processes needed
$ kubectl get daemonset -n kube-system kube-proxy
NAME DESIRED CURRENT READY
kube-proxy 3 3 3 # still there but idle in eBPF mode
Weave creates its own overlay network using a "Sleeve" mode (UDP encapsulation
for cross-node traffic) or "Fast DP" mode (a kernel module for faster data path).
Weave's亮点 is automatic mTLS between peers and simple firewall policy (weave
deny/weave allow).
Note: Weave has fallen behind on Kubernetes compatibility and CNCF ecosystem engagement. Calico and Cilium are generally preferred for new deployments.
# Install Weave
$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version -o json 2>/dev/null | jq -r '.serverVersion.minor')"
# Weave router status
$ kubectl exec -n kube-system weave-kube-xxxxx-xxxx -- status
Version: 2.8.1
Peer name: node-1-1921681101
Encryption: enabled (gcm)
Target mesh: one-shot
Name: 52:f2:ca:3e:39:4d
# Weave's fastdp mode uses a kernel module; sleeve falls back to userspace UDP
$ ip link | grep weave
77: weave: <BROADCAST,MULTICAST,UP,mru 1376> mtu 1376 CNI Chaining
CNI supports chaining — multiple plugins execute in sequence for a single ADD call.
The output of plugin N becomes the input to plugin N+1. This is how you get
capabilities the primary CNI doesn't support natively. For example, Calico handles
IP allocation and routing, but the portmap plugin handles HostPort
translation, and bandwidth handles traffic shaping.
# Chaining in /etc/cni/net.d/10-multus.conflist (Multus entry point)
# Multus is a "meta-plugin" that calls other plugins based on annotation
{
"name": "multus-net",
"cniVersion": "1.0.0",
"delegates": [
{
"type": "calico" # Primary: does pod IP and policy
},
{
"type": "portmap", # Secondary: HostPort and host-local port mapping
"capabilities": { "portMappings": true }
},
{
"type": "bandwidth", # Tertiary: ingress/egress rate limiting
"capabilities": { "bandwidth": true }
}
]
}
# When a pod requests additional interfaces via annotation:
# annotations:
# k8s.v1.cni.cncf.io/networks: |
# [{"name": "podnic", "interface": "pod1", "cniType": "ipam.dhcp"}]
# Multus delegates the ADD call to ipam.dhcp, which sends a DHCP request
# on the second interface Kubernetes Services — Deep Dive
A Service is a stable virtual IP (ClusterIP) that load-balances traffic to a set of Pods selected by label. The ClusterIP never changes (until the Service is deleted), decoupling consumers from the transient Pod IPs underneath. kube-proxy implements this virtual IP by programming kernel NAT rules.
ClusterIP Internals
The ClusterIP is allocated from the service CIDR (configured via
--service-cluster-ip-range on kube-apiserver, default 10.96.0.0/12).
It exists only in the kernel's iptables/IPVS tables — there is no actual interface
called svc-10.0.5.6. Packets to the ClusterIP are DNATted by kube-proxy
to a backend Pod IP before forwarding.
api.default.svc.cluster.local to
10.0.5.6 (the ClusterIP)
10.0.5.6:80. The kernel's
routing layer puts it into the forward chain.
10.0.5.6:80 with 10.244.1.5:8080 (a random backend
Pod). This is DNAT.
conntrack) entry rewrites src back to ClusterIP on the
way back, so the client sees 10.0.5.6 as the response address.
Service Types Compared
# ======================
# ClusterIP — default
# ======================
# Virtual IP reachable ONLY from within the cluster.
# Internal tooling, sidecar proxies, other Pods all use this.
apiVersion: v1
kind: Service
metadata:
name: api
spec:
type: ClusterIP # this is the default, omit type: to get it
selector:
app: api
ports:
- name: http
port: 80 # the Service port (what you hit)
targetPort: 8080 # the container port the app listens on
protocol: TCP
- name: grpc
port: 50051
targetPort: 50051
protocol: TCP
# ======================
# NodePort — exposes on all node IPs at 30000-32767
# ======================
# Adds a static host-port binding. kube-proxy adds an iptables rule:
# iptables -t nat -A KUBE-NODEPORTS -p tcp --dport 30080 -j KUBE-SVC-XXX
spec:
type: NodePort
ports:
- name: http
port: 80
targetPort: 8080
nodePort: 30080 # explicit; omit for random (30000-32767)
# NodePort services also get a ClusterIP automatically
---
# Equivalent iptables (simplified):
$ iptables -t nat -L KUBE-NODEPORTS -n -v
target prot opt source destination
KUBE-SVC-XXXX tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:30080
# ======================
# LoadBalancer — provisions a cloud LB (AWS ELB, GCP LB, Azure LB)
# ======================
# The cloud controller manager watches for LoadBalancer services and
# provisions an external LB pointed at the NodePort. kube-proxy is still
# involved (NodePort is the LB's backend).
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
# The loadBalancerIP field sets the pre-allocated cloud IP
loadBalancerIP: "203.0.113.10"
loadBalancerSourceRanges: # restrict access to the LB
- "10.0.0.0/8"
externalTrafficPolicy: Cluster # default: route to any node
# externalTrafficPolicy: Local # only route to nodes with running pods
# Preserves client IP (no intermediate node NAT)
# May cause uneven load distribution
# ======================
# ExternalName — CNAME, no proxying
# ======================
# Maps the service DNS name to an external DNS name. No ClusterIP,
# no kube-proxy involvement. Useful for pointing to an external db.
spec:
type: ExternalName
externalName: db.us-east-1.rds.amazonaws.com
# api.default.svc.cluster.local → db.us-east-1.rds.amazonaws.com
# ======================
# Headless — no ClusterIP, DNS returns Pod IPs directly
# ======================
# Useful for StatefulSets (Cassandra, ZooKeeper) where clients need to
# discover all pod IPs and connect directly (no intermediate proxy).
spec:
clusterIP: None # this is the headless signal
selector:
app: cassandra
ports:
- port: 9042
# DNS query for cassandra.default.svc.cluster.local returns:
# A 10.244.1.12
# A 10.244.2.8
# A 10.244.3.21
# DNS is managed by the endpoints controller — it creates read-write
# headless records (no A/AAAA synthesis by kube-dns for headless).
# With headless, clients can do SRV record discovery for all replicas. Session Affinity
By default, kube-proxy distributes connections with iptables --probability
(random distribution). For stateful protocols, you may want sticky sessions.
spec:
sessionAffinity: ClientIP # affinity based on client IP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours (default); max 86400 (1 day)
# For HTTP/2 or gRPC, the client reconnecting may hit different backends.
# sessionAffinity: None → round-robin (actually weighted random via iptables) kube-proxy Modes
kube-proxy is a DaemonSet that watches for Service and Endpoint changes and programs the kernel's forwarding rules accordingly. It runs on every node and implements the Service virtual IP → backend Pod IP translation.
iptables Mode
The historical default. kube-proxy creates chains in the nat table
named KUBE-SVC-<hash> (one per Service) and
KUBE-SEP-<hash> (one per backend endpoint). The Service chain
uses statistic mode random probability to distribute load.
The problem: for a Service with 100 Pods, the kernel evaluates
100 probability rules on every connection setup. At 10,000 Services, iptables
linear scans dominate CPU.
# Inspect kube-proxy iptables rules for a service
$ iptables -t nat -L KUBE-SERVICES -n --line-numbers | grep api
1 KUBE-SVC-XXXXXXXX tcp -- 0.0.0.0/0 10.0.5.6 /* default/api:http */ tcp dpt:80
$ iptables -t nat -L KUBE-SVC-XXXXXXXX -n -v
KUBE-SVC-XXXXXXXX tcp -- anywhere 10.0.5.6 /* default/api:http */ tcp dpt:80
KUBE-SVC-XXXXXXXX statistic mode random probability 0.33333333333 RETURN
KUBE-SVC-XXXXXXXX statistic mode random probability 0.50000000000 RETURN
KUBE-SVC-XXXXXXXX RETURN all -- anywhere anywhere /* special: export only */
$ iptables -t nat -L KUBE-SEP-AAA -n -v
target prot opt source destination
KUBE-MARK-MASQ all -- 10.244.1.5 anywhere /* default/api:http */
SNAT all -- anywhere anywhere /* default/api:http */ to:10.244.1.5
# On a busy node, the nat table can get very large
$ iptables -t nat -L KUBE-SERVICES -n | wc -l
4821 # rules — still manageable at thousands of services IPVS Mode
IPVS (IP Virtual Server) is a kernel module in the net/netfilter
subsystem designed specifically for load balancing. It uses a hash table internally,
giving O(1) lookup regardless of the number of services or endpoints. It's the
right choice for clusters with thousands of Services.
# Switch kube-proxy to IPVS mode via configmap
apiVersion: kubeproxy.config.k8s.io/v1
kind: KubeProxyConfiguration
mode: IPVS
ipvs:
scheduler: rr # round-robin; also: lc (least connections), sh (source hash)
syncPeriod: 30s # how often to sync IPVS rules with endpoints
minSyncPeriod: 10s # minimum interval between syncs
# Inspect IPVS table — hash table lookup is O(1)
$ ipvsadm -L -n
TCP 10.0.5.6:80 rr
-> 10.244.1.5:8080 Masq 1 0 0
-> 10.244.2.6:8080 Masq 1 0 0
-> 10.244.3.7:8080 Masq 1 0 0
# Masq = SNAT mode (default); with externalTrafficPolicy: Local,
# you'd see Local mode instead of Masq
# For externalTrafficPolicy: Local, IPVS shows:
$ ipvsadm -L -n | grep api
TCP 10.0.5.6:80 lc # lc = least connections (picks node with fewest local pods)
-> 10.244.1.5:8080 Local 1 0 0
-> 10.244.2.6:8080 Local 1 0 0
# IPVS also supports session persistence (same as sessionAffinity: ClientIP)
$ ipvsadm -L -n --persistent -v
Prot LocalAddress:Port
-> RemoteAddress:Port Weight Active InActve Persist
TCP 10.0.5.6:80 persistent 10800
-> 10.244.1.5:8080 1 10 0 0
-> 10.244.2.6:8080 1 8 0 0 nftables Mode (Kubernetes 1.31+, Alpha)
nftables is the modern successor to iptables. kube-proxy in nftables mode uses the nftables API instead of iptables. It has similar O(N) scaling characteristics to iptables mode but uses a cleaner, more compact ruleset representation. Still alpha in 1.31 — not for production yet.
# Enable nftables mode (requires KUBE_PROXY_NFTABLES=1 env var or flag)
# In KubeProxyConfiguration:
kind: KubeProxyConfiguration
mode: nftables
nftables:
transactionWriteTimeout: 5s # batch nft rules into transactions
# Inspect the nftables rules
$ nft list table ip kube-proxy
table ip kube-proxy { # ip = IPv4; ip6 = IPv6
chain prerouting { # nat prerouting hook
type nat hook prerouting priority dstnat; policy accept;
ip daddr 10.0.5.6 tcp dport 80 goto service-api
}
chain service-api {
ip protocol tcp ip daddr 10.0.5.6 tcp dport 80 ct state new \
counter packets 4821 bytes 12049023 accept
}
chain postrouting { # for masquerade
type nat hook postrouting priority srcnat; policy accept;
ip oaddr 10.0.0.0/8 masquerade
}
}
# Compare rule count: nftables uses sets for efficiency
# nftables "sets" are like: { 10.0.5.6, 10.0.6.12, 10.0.7.3 }:80
# rather than N individual iptables rules Performance Comparison
DNS for Services
Kubernetes DNS (CoreDNS) is the cluster's DNS server. It answers DNS queries for
Service names, Pod names, and external names. All Pods are configured to use
kube-dns (the Service for coredns) as their nameserver via
/etc/resolv.conf injected by kubelet at pod creation.
DNS Record Types
# ======================
# A/AAAA records for Services
# ======================
# For a ClusterIP service, CoreDNS creates an A record (or AAAA for IPv6):
# api.default.svc.cluster.local → 10.0.5.6
# api.default.svc.cluster.local → 10.0.5.6 (IPv4)
# api.default.svc.cluster.local → 2001:db8::6 (IPv6 dual-stack)
$ kubectl run dnsutils --image=tutum/dnsutils --restart=Never -- sleep 3600
$ kubectl exec dnsutils -- nslookup api.default.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: api.default.svc.cluster.local
Address: 10.0.5.6
# ======================
# A records for headless StatefulSet Pods
# ======================
# For headless services with StatefulSet pods:
# cassandra-0.cassandra.default.svc.cluster.local → 10.244.1.12
# cassandra-1.cassandra.default.svc.cluster.local → 10.244.2.8
$ kubectl exec dnsutils -- nslookup cassandra-0.cassandra.default.svc.cluster.local
Name: cassandra-0.cassandra.default.svc.cluster.local
Address: 10.244.1.12
# ======================
# SRV records
# ======================
# _http._tcp.api.default.svc.cluster.local → 0 100 80 api.default.svc.cluster.local
# _https._tcp.api.default.svc.cluster.local → 0 100 443 api.default.svc.cluster.local
# Format: _port-name._protocol.service-name.namespace.svc.cluster.local
# Priority=0 (unused for Services, used for headless), Weight=100, Port=80
$ kubectl exec dnsutils -- dig +short SRV _http._tcp.api.default.svc.cluster.local
0 100 80 api.default.svc.cluster.local.
# SRV records for headless services point to pod A records:
# _http._tcp.cassandra.default.svc.cluster.local → 0 100 9042 cassandra-0.cassandra.default.svc.cluster.local
# → 0 100 9042 cassandra-1.cassandra.default.svc.cluster.local
# ======================
# ExternalName — CNAME
# ======================
# db.us-east-1.rds.amazonaws.com
# → CNAME db.us-east-1.rds.amazonaws.com (no A record synthesized)
# ======================
# Search path — /etc/resolv.conf on every Pod
# ======================
# Kubelet injects this into every pod's /etc/resolv.conf:
$ kubectl exec dnsutils -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
# With ndots:5, any query with ≥5 dots skips the search path (assumed FQDN).
# Query "mysql" → "mysql.default.svc.cluster.local" (3 dots, uses search)
# Query "example.com" → "example.com" (1 dot → treated as FQDN, no search)
# This prevents every external DNS query from going through cluster DNS first. Headless Services and DNS Discovery
When a Service has clusterIP: None, kube-apiserver does not allocate
a ClusterIP. Instead, the endpoint controller creates endpoint records that point
directly to Pod IPs. CoreDNS receives these EndpointSlice events and creates
individual A records — one per pod. The client gets all of them and can choose how
to balance (random, round-robin via DNS TTL, or connect to all).
# For headless services, CoreDNS creates multiple A records:
$ kubectl exec dnsutils -- nslookup web.default.svc.cluster.local
Name: web.default.svc.cluster.local
Address: 10.244.1.37
Address: 10.244.2.19
Address: 10.244.3.8
# Java clients (Elasticsearch, etc.) use this to discover all replica IPs
# Python clients may only use the first record Ingress Controllers
Ingress provides HTTP/HTTPS routing into the cluster — host-based routing, path-based routing, TLS termination. The Ingress API has been in Kubernetes since 1.1, but the controller is NOT included in a standard cluster — you must install one.
ingress-nginx
The Kubernetes community's NGINX-based Ingress controller (different from the
"nginx-ingress" maintained by NGINX Inc). Runs a定制 NGINX that watches Ingress
resources and regenerates nginx.conf on changes.
# Install ingress-nginx
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.9.4/deploy/static/provider/cloud/deploy.yaml
# Basic Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
# ^ Required for path-based routing where you strip the prefix
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api
port:
number: 80
---
# TLS termination
spec:
tls:
- hosts:
- api.example.com
secretName: api-tls # must be a TLS Secret in the same namespace
# ingress-nginx fetches the cert/key from the Secret and terminates TLS Gateway API (Recommended)
Gateway API is the successor to Ingress. It decouples the role of "who manages the infrastructure" (GatewayClass/Gateway) from "who writes routing rules" (HTTPRoute). This enables multi-team setups where an infrastructure team owns the Gateway and application teams own their HTTPRoutes.
# Gateway API requires a controller (Cilium, Envoy, Traefik, NGIC)
# Install Cilium's Gateway API controller:
$ cilium install --set gateway-api.enabled=true
# Step 1: GatewayClass — declares which controller is available
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: cilium
spec:
controllerName: io.cilium/gateway-controller
# Step 2: Gateway — the actual load balancer / listener
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: web
namespace: networking
spec:
gatewayClassName: cilium
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: Same # HTTPRoutes must be in same namespace
- name: https
port: 443
protocol: HTTPS
allowedRoutes:
namespaces:
from: Same
tls:
mode: Terminate
certificateRefs:
- name: web-tls
kind: Secret
group: ""
namespace: networking
# Step 3: HTTPRoute — routing rules (cross-namespace reference)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: api
namespace: default # different from the Gateway's namespace!
spec:
parentRefs:
- kind: Gateway
name: web
namespace: networking
sectionName: http
hostnames: [ "api.example.com" ]
rules:
- matches:
- path:
type: PathPrefix
value: /v1
- headers:
type: Exact
name: x-canary
value: "true"
backendRefs:
- name: api-canary
port: 80
weight: 10 # 10% canary
- name: api-stable
port: 80
weight: 90
- matches:
- path:
type: PathPrefix
value: /v2
backendRefs:
- name: api-v2
port: 80 Other Ingress Controllers
Written in Go. Native Ingress (v1) and Gateway API support. Has a neat IngressRoute CRD (Traefik-specific) with more features than core Ingress. Good for simple deployments, less suited for ultra-high-throughput than NGINX. Supports ACME auto-certificates.
# Traefik IngressRoute (their custom CRD)
apiVersion: traefik.io/v1
kind: IngressRoute
metadata:
name: api
spec:
entryPoints: [web, websecure]
routes:
- match: Host(`api.example.com`) && PathPrefix(`/v1`)
kind: Rule
services:
- name: api
port: 80 Envoy is the data plane (high-performance C++ proxy). Contour is the Kubernetes controller. Gateway API support. Used by Heptio (now VMware, now Broadcom). Good for enterprises wanting a battle-tested Envoy setup.
# Contour's HTTPProxy (similar to IngressRoute)
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
name: api
spec:
virtualhost:
fqdn: api.example.com
routes:
- conditions: [prefix: /v1]
services:
- name: api
port: 80 NetworkPolicy
NetworkPolicy is a Kubernetes object for firewalling at the Pod level. It selects Pods with a label selector and defines what inbound (ingress) and outbound (egress) traffic is allowed. The CNI plugin is responsible for enforcing these rules — the Kubernetes API just stores them.
policyTypes and Implicit Default-Deny
If you specify ingress in policyTypes, the namespace
gets an implicit default-deny for ingress. If you specify egress,
you get an implicit default-deny for egress. This is a common source of surprise:
applying a policy without policyTypes has no effect (policyTypes
defaults to whatever you listed in spec).
# Default-deny both ingress and egress for a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
namespace: production
spec:
podSelector: {} # empty selector = all pods in this namespace
policyTypes:
- Ingress
- Egress
---
# Allow DNS egress (required for any pod to resolve names)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
# Allow web → api on TCP 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-web-to-api
namespace: default
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: web
ports:
- protocol: TCP
port: 8080 Egress Policies and External Destinations
# Allow api pod to reach external Prometheus (monitoring)
# Requires specifying ipBlock with CIDR of the Prometheus server
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-prometheus
namespace: default
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 203.0.113.0/24 # external Prometheus CIDR
ports:
- protocol: TCP
port: 9090
# Allow api pod to reach the internet (0.0.0.0/0 except cluster CIDR)
# ipBlock with except: lets you carve out the cluster CIDR
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-internet
namespace: default
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8 # exclude cluster internal CIDR
- 172.16.0.0/12
- 192.168.0.0/16
- 127.0.0.1/8 # Check if your CNI enforces NetworkPolicy
# For Calico:
$ kubectl get networkpolicy --all-namespaces
# If policies exist and traffic is being blocked, Calico's Felix agent
# is programming the policy via eBPF (modern) or iptables (legacy).
# For Cilium:
$ kubectl get ciliumnetworkpolicy -A
# Cilium's eBPF datapath evaluates policies at line rate.
# Verify enforcement by testing denied traffic:
$ kubectl exec web-0 -- curl -s --connect-timeout 2 http://10.244.3.8:8080
# Should timeout if policy denies it. Service Mesh — Sidecar Injection, mTLS, and Traffic Interception
A service mesh (Istio, Linkerd, Linkerd2, Kuma) adds per-pod proxy sidecars that intercept all inbound and outbound traffic. The sidecar handles mTLS, retries, circuit breaking, and observability without the application code knowing.
Sidecar Injection
A sidecar is an extra container in the same Pod as your application, sharing the
network namespace. The sidecar's lo is the app's lo,
so the app believes it's connecting to localhost:8080 when it's
actually connecting to the sidecar proxy, which then forward the (possibly mTLS'd)
traffic to the destination.
# Istio automatic sidecar injection
# Label the namespace to inject Envoy sidecars into all pods
$ kubectl label namespace default istio-injection=enabled
$ kubectl apply -f deployment.yaml
# kubelet will inject the envoy sidecar container into new pods
# Pod with sidecar — note the extra container
$ kubectl get pod api-0 -o jsonpath='{.spec.containers[*].name}'
api sidecar
# The sidecar container in the pod spec:
$ kubectl get pod api-0 -o jsonpath='{.spec.containers[1]}'
{
"name": "istio-proxy",
"image": "docker.io/istio/proxyv2:1.19",
"args": ["proxy", "sidecar", ...]
}
# Istio's envoy sidecar listens on all ports and routes based on its config.
# The application container's ports are still bound to the same netns. mTLS in the Mesh
In a service mesh, every pod gets an identity (SPIFFE URI) stored in a certificate. The sidecar proxy terminates and re-encrypts mTLS for every connection. Pod A connecting to Pod B uses a certificate signed by the mesh CA. The destination verifies the certificate before accepting the connection.
# SPIFFE identity for a pod (Istio format)
# spiffe://cluster.local/ns/default/sa/api
# cluster = cluster name
# ns = namespace
# sa = service account
# mTLS in STRICT mode — no plaintext, no unknown identities
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT # all pods must use mTLS
# DestinationRule controls mTLS and traffic policies per service
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api
spec:
host: api
trafficPolicy:
tls:
mode: ISTIO_MUTUAL # Istio provisions certs automatically
# Without mesh: Pod A → Pod B uses plain TCP
# With mesh mTLS: Pod A's envoy → Pod B's envoy (mTLS) → Pod B
# The app on Pod B sees plaintext; the envoy handles crypto How Sidecars Intercept Traffic (iptables)
The istio-proxy sidecar uses iptables redirection to intercept
all inbound and outbound TCP traffic in the Pod. It sets up OUTPUT
and INPUT iptables rules that redirect traffic to the Envoy process.
Envoy then decides what to do with it (forward, mTLS, reject).
# In the istio-proxy container, the init container sets up iptables:
# The init container runs BEFORE the application container and configures:
# iptables -t nat -A OUTPUT -p tcp -j ISTIO_REDIRECT # app → proxy
# iptables -t nat -A PREROUTING -p tcp -j ISTIO_REDIRECT # ingress → proxy
# The ISTIO_REDIRECT chain redirects to the Envoy port:
# iptables -t nat -A ISTIO_REDIRECT -p tcp \
# -j REDIRECT --to-port 15001 (Envoy inbound port)
# Result: every TCP connection from the app goes through Envoy first.
# Envoy applies policies (mTLS verification, RBAC, retries) then
# re-encrypts and sends to the destination.
# Check the rules inside a pod with the istio-proxy sidecar:
$ kubectl exec -it api-0 -c istio-proxy -- iptables -t nat -L -n -v
Chain PREROUTING (1 references)
pkts bytes target prot opt in out source destination
0 0 ISTIO_INBOUND tcp -- * * 0.0.0.0/0 0.0.0.0/0
Chain ISTIO_INBOUND (1 references)
pkts bytes target prot opt in out source destination
0 0 RETURN tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080
0 0 ISTIO_REDIRECT tcp -- * * 0.0.0.0/0 0.0.0.0/0
Chain ISTIO_REDIRECT (0 references)
pkts bytes target prot opt in out source destination
0 0 REDIRECT tcp -- * * 0.0.0.0/0 0.0.0.0/0 redir_ports 15001 Service Mesh vs CNI Policy — When to Use Each
CNI Chaining in Practice
CNI chaining lets you compose plugins so each handles a specific capability.
The canonical use case: Calico handles pod IP networking, but you need HostPort
support, which Calico doesn't do. The portmap plugin sits in the
chain after Calico and handles the HostPort translation via iptables.
# How the chain executes for ADD:
# 1. kubelet calls calico (first plugin in conflist)
# 2. calico sets up eth0 with pod IP
# 3. calico returns with ips/routes
# 4. portmap receives calico's output as input
# 5. portmap adds HostPort iptables rules
# 6. portmap returns its own augmented result
# 7. bandwidth receives portmap's output
# 8. bandwidth adds tc (traffic control) qdisc rules
# 9. bandwidth returns to kubelet
# Resulting iptables rules for a HostPort pod:
$ iptables -t nat -L PORTMAP-INGRESS -n -v
target prot opt source destination
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:30080 to:10.244.1.5:8080
# 10.244.1.5:8080 is the pod's actual IP:port
# HostPort: containerPort → hostPort mapping:
# - containerPort: 8080 (what the container binds to)
# - hostPort: 30080 (what the host exposes)
# The portmap plugin translates host:30080 → podIP:8080 via DNAT Multus — Meta Plugin for Multiple Networks
Multus is a meta-CNI that calls other CNIs based on per-pod annotations. It solves the problem of pods that need multiple network interfaces — for example, a pod that needs a management interface (Calico) AND a data-plane interface (host-local for storage) AND a DPDK interface (for NFV workloads).
# A pod with two network interfaces via Multus annotation:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{
"name": "mgmt-network", # interface 1: Calico
"interface": "eth0",
"cniType": "calico"
},
{
"name": "data-network", # interface 2: host-local
"interface": "eth1",
"cniType": "host-local"
}
]
# Result inside the pod:
$ ip addr
1: lo: <LOOPBACK> mtu 65536
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500
inet 10.244.1.37/24 # Calico IP (pod network)
3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500
inet 10.200.1.5/24 # host-local IP (data network) IP Address Management (IPAM) — CIDRs and Allocations
Kubernetes needs to allocate three separate CIDR blocks when it starts. Getting these right before cluster bootstrap avoids painful renumbering later.
# kubeadm cluster bootstrap with explicit CIDRs
$ kubeadm init --pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--service-dns-domain=cluster.local
# Verify the settings:
$ kubectl get pod -A -o jsonpath='{.items[0].spec}...' # not directly queryable
# Check what CNI thinks the CIDR is:
$ kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.244.1.0/24 10.244.2.0/24 10.244.3.0/24
# kube-controller-manager assigns /24 to each node (node-cidr-mask-size=24)
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,PODCIDR:.spec.podCIDR
NAME PODCIDR
node-1 10.244.1.0/24
node-2 10.244.2.0/24
node-3 10.244.3.0/24
# In AWS with VPC CNI, pods use the host's ENI secondary IPs:
# Each EC2 instance type has a limit on ENIs and secondary IPs per ENI.
# The VPC CNI plugin allocates from the VPC subnet, so Pod CIDR == VPC subnet. --max-pods=110 and give each node a /24, a 50-node cluster
uses 50×/24 = 50×254 = 12,700 IPs but only ~5,500 pods. Each node's /24 has
254 usable IPs, but only 110 are used by pods. With 254 possible pods per node,
you may hit the /24 limit before the node itself is saturated. Use a /26 (62 IPs)
if you have many nodes with low pod counts, or a /16 at the cluster level and
allocate /26 per node to avoid waste.
Connectivity Troubleshooting
Networking issues in Kubernetes typically fall into a few buckets: DNS failures, firewall/NetworkPolicy blocks, CNI misconfiguration, or MTU-related packet drops. Systematic debugging is essential.
DNS Debugging
# Step 1: Is kube-dns responding?
$ kubectl run dnsutils --image=tutum/dnsutils --restart=Never -- sleep 3600
$ kubectl exec dnsutils -- nslookup kubernetes.default.svc.cluster.local
# If this fails → kube-dns is not reachable from this namespace
# Step 2: Check /etc/resolv.conf is correct
$ kubectl exec dnsutils -- cat /etc/resolv.conf
nameserver 10.96.0.10 # must be kube-dns service IP
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
# If nameserver is wrong → kubelet not injecting resolv.conf
# Check kubelet config --resolv-conf flag
# Step 3: Dig directly with TCP (more reliable than UDP)
$ kubectl exec dnsutils -- dig @10.96.0.10 api.default.svc.cluster.local +tcp +noad
;; QUESTION SECTION:
;api.default.svc.cluster.local. IN A
;; ANSWER SECTION:
api.default.svc.cluster.local. 30 IN A 10.0.5.6
# Step 4: Check CoreDNS pod logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns -c coredns
# Look for: formerr, SERVFAIL, refused, timeout entries
# CoreDNS log format: [INFO] plugin/kubernetes ... Pod-to-Pod Connectivity
# Step 1: Ping the gateway (first hop)
$ kubectl exec web-0 -- ping -c 1 10.244.1.1 # gateway for node-1's subnet
# If ping fails → host routing not set up, check CNI
# Step 2: Ping the destination pod
$ kubectl exec web-0 -- ping -c 1 10.244.2.19 # pod on another node
# If ping fails:
# - Cross-node routing: check CNI (Calico BGP sessions, Flannel routes)
# - Security group/firewall: cloud security group blocking pod CIDR
# - MTU: VXLAN overhead may fragment or drop at path MTU
# Step 3: TCP connect to the actual port
$ kubectl exec web-0 -- nc -zv 10.244.2.19 8080
Ncat: Connected to 10.244.2.19:8080.
# Step 4: Check iptables/eBPF for drops
# For iptables-based CNI (Calico in iptables mode):
$ iptables -t filter -L FORWARD -n -v | grep DROP
# Look for drops in the forward chain
# For Cilium:
$ cilium bpf lb list
# Verify the service exists and has backends tcpdump Patterns
# Capture on the host's veth interface for a specific pod
# Find the veth pair for pod web-0:
$ kubectl get pod web-0 -o jsonpath='{.status.hostIP}'
192.168.1.101
# On node-1, find the veth:
$ ip link show | grep veth
$ Bridge fwd show # if using flannel, capture on flannel.1
# Capture VXLAN traffic (Flannel):
$ tcpdump -i eth0 -n port 8472 -c 10
# 8472 is Flannel's VXLAN port. You should see encapsulated UDP.
# If you see nothing → Flannel daemon not running on one of the nodes
# Capture plain IP traffic between pods on same node (Calico):
$ tcpdump -i caliXXXX -n -c 10 # caliXXX = veth interface name
# You'll see plaintext ICMP/TCP between pods (no encapsulation)
# Capture in a pod's network namespace directly:
$ nsenter --net=/var/run/netns/cni-abc123 -- tcpdump -i eth0 -n -c 5
# Useful when you want to see what the pod actually receives
# (not what's on the host veth before iptables processing) MTU Pitfalls
# Check MTU on the pod interface:
$ kubectl exec web-0 -- ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,mru 1420> mtu 1420
# The gap: eth0 MTU 1420 vs host eth0 MTU 1500
# 1500 - 1420 = 80 bytes of overhead budget
# Flannel VXLAN: 50 bytes overhead (20 byte inner header + 8 byte UDP + 14 byte IP/ether)
# 1420 + 50 = 1470 — still fits in 1500
# But if the physical network has MTU 1500 AND additional tunnel overhead,
# you may get fragmentation or ICMP "packet too big"
# Common cause: AWS VPC MTU is 1500, but some EC2 instance types have
# reduced MTU on secondary ENIs. Flannel's MTU detection may not pick this up.
# Solution: set explicit MTU in Flannel config:
net-conf.json: |
{
"Network": "10.244.0.0/16",
"MTU": 1400
}
# Cilium auto-detects MTU but can be overridden:
$ helm install cilium cilium/cilium \
--set tunnelmtu=1400 CoreDNS Internal Architecture
CoreDNS replaced kube-dns (SkyDNS) in Kubernetes 1.12 and became the default in 1.13. It's a modular DNS server written in Go, where each capability is a plugin. The Corefile configures which plugins are loaded and in what order.
History: SkyDNS to CoreDNS
SkyDNS (the original) used etcd for storing DNS records and answered queries by walking the etcd tree. It was slow for large clusters and not very extensible. CoreDNS took a different approach: a minimal DNS server with a plugin chain, where Kubernetes integration is just one of many plugins. The Kubernetes plugin watches the API server for Services, Endpoints, and (with the endpointslices plugin) EndpointSlice objects and synthesizes DNS records on the fly. No etcd dependency.
Corefile Configuration
# Default Corefile in kube-dns ConfigMap
$ kubectl get configmap coredns -n kube-system -o yaml
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods verified
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf # forward unknown queries to upstream DNS
cache 30
loop
reload
loadbalance
}
# The plugin chain (in order):
# 1. errors → log DNS errors
# 2. health → /healthz endpoint for readiness
# 3. kubernetes → synthesize A records for Services + headless Pods
# 4. prometheus → metrics at :9153/metrics
# 5. forward → forward unknowns to /etc/resolv.conf (node's DNS)
# 6. cache → TTL-based cache for all records (30s TTL here)
# 7. loop → detect and break infinite loops
# 8. reload → watch Corefile changes
# 9. loadbalance → round-robin A records with random order
# Understanding "pods verified" mode:
# kubernetes plugin can verify that a pod IP belongs to a pod in the cluster.
# "pods verified" only returns a DNS name for pod IPs that are verified.
# "pods insecure" returns names for all pod IPs (legacy behavior).
# "pods disabled" disables pod DNS entirely. Scaling CoreDNS
# CoreDNS is deployed as a Deployment with 2+ replicas (by default)
$ kubectl get deployment -n kube-system coredns
NAME READY UP-TO-DATE AVAILABLE
coredns 2 2 2
# Each CoreDNS instance can handle ~20K queries/second.
# For large clusters, increase replicas and/or resources.
# The Service for kube-dns is a ClusterIP with 2 endpoints (round-robin).
# HPA for CoreDNS based on metrics (requires metrics-server):
$ kubectl autoscale deployment coredns -n kube-system \
--cpu-percent=70 --min=2 --max=10
# Memory-based HPA (for cache pressure):
# CoreDNS uses a fixed-size cache. High memory usage indicates cache churn.
# Set CoreDNS resource limits explicitly:
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi IPv6 and Dual-Stack
Kubernetes supports IPv6-only and IPv4/IPv6 dual-stack clusters. Dual-stack (GA in 1.21) is increasingly common as organizations plan for IPv4 exhaustion.
Dual-Stack Configuration
# kube-apiserver flags for dual-stack
kube-apiserver \
--service-cluster-ip-range=10.96.0.0/12,2001:db8::/108 \
--cluster-dns=10.96.0.10,2001:db8::a # kube-dns service IPs
# kubelet for dual-stack pod allocation
kubelet \
--node-ip=NODE_IPV4,NODE_IPV6 # or let it auto-detect
# Check that a node has both IPs assigned:
$ kubectl get node node-1 -o custom-columns=NAME:.metadata.name,IP4:.status.addresses[?(@.type=="InternalIP")].address,IP6:.status.addresses[?(@.type=="InternalIP")].address
NAME IP4 IP6
node-1 192.168.1.101 2001:db8::c0a8:165
# Check a pod with dual-stack:
$ kubectl get pod web-0 -o wide -o custom-columns=NAME:.metadata.name,IP4:.status.podIP,IP6:.status.podIP
NAME IP4 IP6
web-0 10.244.1.37 2001:db8::a:f444:125 Dual-Stack Services
# Service with dual-stack ClusterIP
spec:
ipFamilies:
- IPv4
- IPv6
ipFamilyPolicy: RequireDualStack # or PreferDualStack, SingleStack
# With RequireDualStack, the Service gets both a v4 and v6 ClusterIP.
# DNS returns both A and AAAA records.
$ kubectl get svc api -o custom-columns=NAME:.metadata.name,CLUSTERIP4:.spec.clusterIP,CLUSTERIP6:.spec.clusterIPs
NAME CLUSTERIP4 CLUSTERIP6
api 10.0.5.6 2001:db8::6
$ kubectl exec dnsutils -- nslookup api.default.svc.cluster.local
Name: api.default.svc.cluster.local
Address: 10.0.5.6
Address: 2001:db8::6
# A client picks which IP family to use based on its own configuration.
# A dual-stack client typically prefers IPv6 (RFC 6724). hostPort and hostNetwork
Kubernetes provides two ways to expose a pod directly on the host's network:
hostPort (a container-level port mapping on the host interface) and
hostNetwork: true (the pod joins the host's network namespace directly).
Both sacrifice network isolation and should be used sparingly.
hostPort
hostPort maps a container port to a host port. The pod is still in
its own network namespace, but traffic to hostIP:hostPort on the host
is DNATted to the pod's IP. This requires the CNI portmap plugin to be in the
CNI chain (or Multus for delegation). Only one pod per host can use a given hostPort.
# hostPort pod spec:
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
hostPort: 80 # host IP:80 → podIP:80
protocol: TCP # default
# Equivalent iptables rules (added by portmap CNI):
$ iptables -t nat -A PORTMAP-INGRESS \
-p tcp --dport 80 -j DNAT --to-destination POD_IP:80
# Limitations:
# - Port conflicts: another pod on the same host can't also use hostPort 80
# - Only supports TCP/UDP/SCTP (not arbitrary protocols)
# - hostPort in a Deployment means only one replica per node (use DaemonSet)
# - Doesn't work with Kind (Kubernetes in Docker) without special config hostNetwork
hostNetwork: true puts the pod in the host's network namespace directly.
The pod sees all host interfaces, binds to host ports directly, and uses the host's
routing table. This is the "host" network mode from Docker. Useful for system daemons
that need to bind to specific host interfaces (notably kube-proxy itself).
# kube-proxy uses hostNetwork (not hostPort) — it must bind to port 10253
# (metrics endpoint) on all host interfaces
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet # hostNetwork pods default to host DNS
containers:
- name: kube-proxy
image: k8s.gcr.io/kube-proxy:v1.28.0
# The process binds directly to eth0's IP
# A pod using hostNetwork sees the host's routing table:
$ kubectl run test --rm -it --image=busybox --restart=Never \
-- /bin/sh -c 'ip route'
default via 192.168.1.1 dev eth0 # host's default route
10.244.0.0/16 via 192.168.1.1 dev eth0 # cluster pod network (via CNI)
# These are host routes, not pod-specific routes
# ⚠️ hostNetwork pods share the host's DNS resolver (/etc/resolv.conf)
# With default dnsPolicy: ClusterFirst, hostNetwork pods DON'T use cluster DNS.
# They use whatever DNS the host uses (typically from /etc/resolv.conf on the node).
# Fix: set dnsPolicy: ClusterFirstWithHostNet to force cluster DNS.
# Security implication: hostNetwork pods can bind to any host port,
# including privileged ports (<1024) if the container runs as root. dnsPolicy for a pod with hostNetwork: true
is ClusterFirstWithHostNet (since Kubernetes 1.9). Previously
it was Default (use node's resolv.conf directly), which caused
DNS resolution failures for hostNetwork pods accessing cluster Services by name.
If you're on an older Kubernetes version, explicitly set dnsPolicy.
Egress Policies and Egress Proxies
By default, pods can egress to any destination (0.0.0.0/0) unless a NetworkPolicy restricts them. For many organizations, this is insufficient — they need centralized egress control (all outbound traffic through a proxy or firewall). There are several patterns for this.
Egress Gateway Pattern
An egress gateway is a pod (or set of pods) that all egress traffic must route
through. It can be an explicit route in the CNI (Calico's EgressGateway)
or a kube-proxy + iptables rule that forces all egress to a specific pod.
# Calico EgressGateway CRD
apiVersion: crd.projectcalico.org/v1
kind: EgressGateway
metadata:
name: egress-gw
namespace: kube-system
spec:
nodeSelector: egress-node == "true"
destination: {
CIDR: 0.0.0.0/0 # intercept all egress
}
---
# Force egress through a specific IP via kube-proxy (static NAT)
# All egress from namespace "production" goes through NAT gateway at 10.244.1.50
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: force-egress-via-gateway
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
app: egress-gw
ports:
- protocol: TCP
port: 443
# Limitation: this only works for pods the gateway can route to.
# A true egress gateway requires CNI support to redirect all egress. Egress Proxy via Sidecar
A more common pattern: route all external traffic through a transparent proxy (Envoy, Squid, etc.) running as a sidecar or per-namespace proxy. Applications don't know their traffic is being proxied (unless the proxy does TLS interception).
# Per-namespace egress proxy using an egress-sidecar pattern:
# 1. A "egress-proxy" deployment runs as a normal ClusterIP service.
# 2. All pods in the namespace have an init container that sets up
# iptables to redirect external traffic to the proxy.
# 3. The proxy forwards to the actual destination.
# egress-proxy deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: egress-proxy
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: egress-proxy
template:
metadata:
labels:
app: egress-proxy
spec:
containers:
- name: squid
image: sameersbn/squid:3.5.27
ports:
- containerPort: 3128
---
# The init container (runs before app container) sets up redirect:
# NOTE: This requires the unprivileged mode of iptables to work in a container,
# or capabilities: [ NET_ADMIN ] on the init container.
initContainers:
- name: redirect-egress
image: gcr.io/google-containers/iptables-whitelist:latest
securityContext:
capabilities:
add:
- NET_ADMIN
command:
- sh
- -c
- |
iptables -t nat -A OUTPUT -p tcp ! -d 10.0.0.0/8 \
-j DNAT --to-destination 10.0.5.100:3128
iptables -t nat -A PREROUTING -p tcp ! -d 10.0.0.0/8 \
-j DNAT --to-destination 10.0.5.100:3128 Cloud Provider Egress Controls
# AWS: NAT Gateway for controlled egress
# All private subnets route 0.0.0.0/0 through a NAT Gateway (managed by AWS).
# The NAT Gateway has an Elastic IP and is the only way out.
# Security groups on the NAT Gateway restrict what destinations are reachable.
# AWS PrivateLink for AWS service access (no internet egress needed)
# For pods that need to access S3, DynamoDB, etc. without going to the internet:
# Use VPC CNI with PrivateLink endpoints (interface VPC endpoints).
# Pods access aws.s3.amazonaws.com → resolved to the endpoint ENI IP.
# No packet leaves the VPC.
# GCP: Cloud NAT
# Cloud Router + Cloud NAT distributes egress for all GKE node IPs.
# All outbound traffic from GKE nodes goes through Cloud NAT → one IP or range.
# Useful when you need a predictable source IP for firewall rules.
# Azure: NAT Gateway
# Similar to AWS/GCP. Associate a NAT Gateway with the subnet GKE uses.
# All egress from pods will use the NAT Gateway's public IP. Tradeoffs
- Pluggable — swap CNI, swap kube-proxy, swap Ingress controller independently
- eBPF dataplanes scale linearly with cluster size, not rule count
- Gateway API finally gives portable L7 routing without annotation soup
- NetworkPolicy is declarative, GitOps-friendly firewall
- CoreDNS's plugin architecture is extensible (whoami, geoip, etc.)
- Headless Services enable stateful workload discovery without service mesh
- Dual-stack IPv4/IPv6 lets you migrate gradually without dual-stack penalty on existing pods
- iptables kube-proxy is O(N) — large clusters need IPVS or eBPF
- NetworkPolicy is only enforced if the CNI implements it — Flannel users get no firewall
- Overlay networks add encapsulation overhead and MTU pitfalls
- Multi-cluster networking (flat IP across clusters) is hard, not standardized
- hostPort conflicts prevent multiple replicas on the same node
- hostNetwork pods bypass NetworkPolicy and share the host's DNS context
- Sidecar-based service meshes add per-request latency and memory overhead
- Dual-stack on some CNIs (Flannel) falls back to v4-only, silently dropping v6
- Service IP exhaustion is possible in very large clusters (>10K Services) with /12
- CoreDNS is the only DNS backend — there's no built-in multi-cluster federation
Frequently Asked Questions
What does CNI actually do?
CNI (Container Network Interface) is a tiny spec: when a Pod is created, kubelet calls a CNI plugin (a binary in /opt/cni/bin) with a JSON config and the path to the Pod's network namespace. The plugin's job is to assign an IP, create the network interface inside the namespace (typically a veth pair), and program any routing/firewall needed. CNI doesn't say HOW you do it — Calico uses BGP, Flannel uses VXLAN, Cilium uses eBPF. The interface is just: 'set up the network for this namespace, and tear it down when the pod dies.'
What's the difference between iptables, IPVS, and nftables modes for kube-proxy?
kube-proxy implements Service load balancing by programming kernel rules. iptables mode (default) creates one chain per Service plus N rules per endpoint — fine up to a few thousand Services, then linear scan dominates. IPVS mode uses the kernel's IPVS subsystem (designed for load balancing) — O(1) lookup, scales to tens of thousands of Services. nftables mode (1.31+ alpha) is the modern replacement for iptables with similar O(N) characteristics but cleaner kernel internals. Most production clusters use IPVS or have switched to Cilium's eBPF dataplane (which replaces kube-proxy entirely).
How does Cilium's eBPF dataplane skip kube-proxy?
Cilium attaches eBPF programs to socket creation, network interfaces, and tc hooks. Service translation happens at the socket level: when a Pod connects to a Service ClusterIP, eBPF rewrites the destination to a backend Pod IP before the packet ever leaves the Pod's stack. No iptables, no kube-proxy daemon, no DNAT chains. NetworkPolicy enforcement also moves to eBPF, evaluated per-packet at line rate. The result: lower latency, much better scaling on large clusters (rule count doesn't matter), and observability (Hubble) for free.
Ingress vs Gateway API — which to use?
Ingress is the original (2015) API for HTTP routing — limited to host/path rules, every controller adds non-portable annotations for everything else. Gateway API (GA in 2023) replaces it with a richer, role-oriented design: GatewayClass (the controller offering), Gateway (the listener), HTTPRoute/TLSRoute/TCPRoute (the rules). Gateway API supports header-based matching, traffic splitting, request modification, cross-namespace routing — all without annotations. New deployments should use Gateway API; existing Ingress configs still work but won't see new features.
What does NetworkPolicy enforce?
NetworkPolicy is a Pod-selector-based firewall expressed in Kubernetes objects. You select Pods (matchLabels) and declare ingress/egress rules — what other Pods or CIDR blocks can talk to them on which ports. Critical caveat: NetworkPolicy is only enforced if your CNI implements it. Calico, Cilium, Antrea do; older Flannel didn't. With no CNI policy enforcement, NetworkPolicy objects have no effect. You can verify with 'kubectl describe pod' showing the NetworkPolicy status, or by trying a denied connection.
How does dual-stack work?
Dual-stack (1.21 GA) runs IPv4 and IPv6 simultaneously. Each Pod gets one IP from each family; Services can be SingleStack (one), PreferDualStack (one with the option to allocate both), or RequireDualStack (both). The Pod's network namespace has both addresses on its interface. Kernel routes by family. ClusterIP Services have ipFamilyPolicy and ipFamilies fields. Most cloud CNIs and Cilium handle this transparently; some legacy CNIs don't, in which case Services fall back to v4-only.