Networking Stack

Packet Path from NIC to Socket — NAPI, XDP, qdiscs, eBPF

Overview

Every packet that enters or leaves a Linux machine passes through a layered machinery of interrupt handling, memory allocation, queueing disciplines, network filters, routing lookups, and protocol state machines. Understanding this stack is not academic — it's the difference between a system that saturates a 100GbE link and one that peaks at 20GbE because of a single misconfigured sysctl or a missing RPS mask. The Linux network stack sits at the intersection of kernel architecture, hardware capabilities, and distributed systems — knowing where a packet is at any point in time helps you debug latency spikes, DDoS amplification, NAT table exhaustion, or mysterious 5% packet loss at line rate.

This page covers the full ingress and egress paths, from the moment a NIC DMA ring fills with a received frame to the moment a process calls recv() and gets a pointer to data. In between, there's NAPI polling, Generic Receive Offload (GRO), XDP hooks, netfilter PREROUTING/INPUT/FORWARD/OUTPUT/POSTROUTING chains, the conntrack state table, IP routing, TCP congestion control and rate limiting, socket backlog queueing, and the qdisc that shapes what actually hits the wire. Each of these stages has observable metrics, tunable parameters, and failure modes. The tools in the Tools & Metrics section let you observe and adjust all of them.

The modern stack also includes eBPF-based dataplanes (Cilium, Meta's katran, Cloudflare's unsee) that bypass netfilter entirely for performance. We'll cover both the classic iptables/nftables path and the eBPF alternative, because real production systems use both simultaneously and you need to understand their interaction. The packet flow diagram below is the reference for the rest of the page.

A quick note on terminology: softirq is kernel-speak for "deferred interrupt handler running in process context with interrupts enabled". The network stack runs almost entirely in softirq context — net_rx_action and net_tx_action are the two softirq handlers. This means the packet processing happens at high priority but cannot sleep, allocate with GFP_KERNEL, or hold sleeping locks. Everything in the fast path uses lockless ring buffers, per-CPU variables, and pre-allocated sk_buff pools.

Full Packet Flow: Ingress and Egress

╔══════════════════════════════════════════════════════════════════╗
║                    INGRESS (RECEIVE) PATH                       ║
╚══════════════════════════════════════════════════════════════════╝

  NIC hardware
  ┌─────────────────────────────────────────────────────────────┐
  │  DMA ring: [desc0][desc1][desc2]...[descN]                  │
  │  Packet arrives → DMA to host memory → NIC raises IRQ        │
  └────────────────────┬────────────────────────────────────────┘
                       │
                 Hard IRQ (IRQLine)
                       │
         ┌─────────────▼──────────────┐
         │   Driver IRQ handler        │
         │   (one per RX interrupt)    │
         │   - disable further IRQs    │
         │   - napi_schedule(&napi)    │
         │   - returns IRQ_HANDLED    │
         └─────────────┬──────────────┘
                       │
         ┌─────────────▼──────────────┐
         │  NET_RX softirq             │
         │  net_rx_action()            │
         │  - polls NAPI struct        │
         │  - processes RX ring batch  │
         │  - builds sk_buff           │
         └─────────────┬──────────────┘
                       │
         ┌─────────────▼──────────────┐
         │  NAPI poll() callback       │
         │  - while (budget--)         │
         │  - refill DMA ring          │
         │  - netif_receive_skb(skb)   │
         └─────────────┬──────────────┘
                       │
         ┌─────────────▼────────────────────────────────────────┐
         │  GRO (Generic Receive Offload)                        │
         │  - merges packets of same flow (TCP)                 │
         │  - builds larger sk_buff from multiple frames         │
         │  - saves per-packet overhead                         │
         └─────────────┬────────────────────────────────────────┘
                       │
         ┌─────────────▼──────────────┐
         │  XDP (optional)              │
         │  - eBPF program runs FIRST   │
         │  - before skb allocation     │
         │  - can DROP/TX/REDIRECT/     │
         │    PASS before any work     │
         └─────────────┬──────────────┘
                       │
         ┌─────────────▼────────────────────────────────────────┐
         │  netif_receive_skb()         │
         │  - delivers to backlog or    │
         │    directly to protocol      │
         │  - RPS steers to CPU         │
         └─────────────┬────────────────────────────────────────┘
                       │
         ┌─────────────▼────────────────────────────────────────┐
         │  netfilter: PREROUTING       │
         │  (raw table → conntrack)     │
         │  - DNAT can happen here      │
         │  - NOTRACK bypasses state    │
         └─────────────┬────────────────────────────────────────┘
                       │
         ┌─────────────▼────────────────────────────────────────┐
         │  Routing decision            │
         │  fib_lookup() / fib_table()  │
         │  - LOCAL → deliver to socket  │
         │  - FORWARD → routing for dest │
         └─────────────┬────────────────────────────────────────┘
                       │
            ┌──────────┴───────────┐
            │                      │
       LOCAL DELIVERY          FORWARD
            │                      │
   ┌─────────▼─────────┐  ┌────────▼──────────┐
   │  netfilter INPUT   │  │ netfilter FORWARD  │
   │  (filter INPUT,    │  │ (filter FORWARD)   │
   │   raw, mangle)     │  │                    │
   └─────────┬─────────┘  └────────┬──────────┘
             │                      │
   ┌─────────▼─────────┐  ┌────────▼──────────┐
   │  IP layer         │  │  IP forward       │
   │  ip_local_deliver │  │  ip_forward()     │
   │  - defragment     │  │  - TTL check      │
   │  - options parse  │  │  - reassemble     │
   └─────────┬─────────┘  └────────┬──────────┘
             │                      │
   ┌─────────▼─────────┐  ┌────────▼──────────┐
   │  TCP / UDP        │  │  netfilter         │
   │  (upper layer)    │  │  POSTROUTING       │
   └─────────┬─────────┘  │  (mangle, nat)     │
             │            └────────┬──────────┘
   ┌─────────▼─────────┐           │
   │  socket backlog   │  ┌────────▼──────────┐
   │  sk->sk_backlog   │  │  IP output         │
   │  → recv() syscall │  │  ip_output()       │
   └───────────────────┘  └────────┬──────────┘
                                   │
                    ┌──────────────┴──────────┐
                    │                           │
             ┌───────▼────────┐  ┌─────────────▼─────────┐
             │  netfilter     │  │  routing again         │
             │  POSTROUTING   │  │  (source address,     │
             │  (mangle NAT)  │  │   TTL, fragmentation)  │
             └───────┬────────┘  └──────────┬──────────┘
                     │                       │
         ┌───────────▼───────────────────────▼──────────┐
         │  qdisc enqueue                         │
         │  - fq_codel / cake / pfifo_fast        │
         │  - sk_buff queued, TX pending          │
         └───────────┬──────────────────────────────┘
                     │
         ┌───────────▼──────────────────────────────┐
         │  qdisc dequeue (driven by TX completion)  │
         │  - pulls from queue                       │
         │  - hands to NIC driver                   │
         └───────────┬──────────────────────────────┘
                     │
         ┌───────────▼──────────────────────────────┐
         │  NIC DMA ring TX                         │
         │  NIC raises IRQ on TX complete           │
         └──────────────────────────────────────────┘
                    (back to top for next packet)


╔══════════════════════════════════════════════════════════════════╗
║                    EGRESS (TRANSMIT) PATH                        ║
╚══════════════════════════════════════════════════════════════════╝

  Application
  ┌────────────────────┐
  │  send() / write()  │
  └─────────┬──────────┘
            │
  ┌─────────▼──────────┐
  │  socket backlog    │
  │  (sk->sk_write_queue)│
  │  - copies from user │
  │  - queues for TX    │
  └─────────┬──────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  tcp_transmit_skb() / udp_sendmsg()                │
  │  - builds full packet (headers, options)            │
  │  - calls ip_queue_xmit() for TCP                   │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  netfilter OUTPUT chain                            │
  │  (raw → mangle → filter → nat)                    │
  │  - allows firewalling locally generated traffic   │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  Routing decision (fib_lookup)                     │
  │  - determines outgoing interface                    │
  │  - selects source IP, next hop                     │
  │  - sets skb->dst (destination entry)              │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  netfilter POSTROUTING chain                       │
  │  (mangle → nat)                                    │
  │  - used by NAT to rewrite source address          │
  │  - used by traffic shaping (e.g., htb class)      │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  qdisc enqueue                                     │
  │  - enqueues skb to qdisc root                     │
  │  - may drop (e.g., fq_codel on backlog)          │
  │  - wakes qdisc net_tx_action softirq             │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  qdisc dequeue (NET_TX softirq)                    │
  │  - pulls skb from qdisc structure                 │
  │  - netif_start_xmit(skb, dev)                     │
  │  - calls driver's hard_start_xmit()               │
  └─────────┬──────────────────────────────────────────┘
            │
  ┌─────────▼──────────────────────────────────────────┐
  │  Driver / NIC TX                                  │
  │  - writes to DMA TX ring                          │
  │  - NIC raises TX completion IRQ when done         │
  │  - net_tx_action softirq drains completed TX     │
  └────────────────────────────────────────────────────┘

The two paths are asymmetric by design. Ingress is the hot path for servers (receive work); egress is the hot path for clients (send work). Each stage in both paths is instrumentable — see Tools & Metrics below.

Key Numbers

~14.88 Mpps

10 GbE line rate at 64-byte packets

~150 Mpps

100 GbE line rate

~67 ns

budget per packet at 100 GbE

64 KB

GRO max coalesced segment

~300 B

conntrack entry size

2003 / 2014 / 2016

NAPI / nftables / XDP merge dates

NAPI Internals: Budgets, Weights, and Softirq

NAPI (New API) was merged in kernel 2.6 and fundamentally changed how the receive interrupt path works. Before NAPI, every packet caused an IRQ; at 10Gbps line rate that was ~15 million IRQs per second, each stealing CPU time from user processes. NAPI replaces this with a hybrid interrupt/polling model that automatically adapts to load — at low traffic you get interrupts (low latency), and at high traffic you switch to pure polling (zero IRQ overhead).

The key data structure is the napi_struct, registered by each driver. It contains a poll function pointer, a weight (the maximum number of packets to process in one softirq turn), and state flags. The weight is tuned per-driver based on expected ring size — most drivers use 64, which means each NAPI poll can process up to 64 packets before yielding. If you have a large ring (e.g., 4096 descriptors), you might want higher weight to drain it faster under burst load, at the cost of keeping other CPUs waiting.

// Simplified NAPI lifecycle — what actually happens per-packet

// 1. HARD IRQ (one per RX interrupt, not per packet)
// Each interrupt fires once, then is disabled until poll completes.
irqreturn_t e1000_intr(int irq, void *data)
{
    struct adapter *adapter = data;

    // Disable this IRQ line — we'll re-enable it in napi_complete_done()
    disable_irq_nosync(irq);

    // Schedule the softirq that will do the actual polling.
    // This is the key: ONE interrupt -> many packets processed in softirq.
    napi_schedule(&adapter->napi);
    return IRQ_HANDLED;
}

// 2. SOFTIRQ: net_rx_action() runs in softirq context (BH disabled)
// net_rx_action iterates all NAPI structs registered for this CPU,
// calling each one's poll() function until budget exhausted or ring empty.

static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = &get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;          // sysctl net.core.netdev_budget, default 300
    void *have;

    local_irq_disable();
    while (!list_empty(&sd->poll_list)) {
        struct napi_struct *n = list_first_entry(&sd->poll_list,
                                                 struct napi_struct, poll_list);

        have = netdev_poll;
        if (!test_bit(NAPI_STATE_SCHED, &n->state))
            goto next;

        // Call the driver's poll function, passing the weight
        work = n->poll(n, weight);           // weight from netif_napi_add()
        // budget is the global limit; weight is per-NAPI-instance limit
        WARN_ON_ONCE(work > weight);

        if (work > 0) {
            budget -= work;
            if (budget <= 0) break;           // global budget exhausted
        }
next:
        list_move_tail(&n->poll_list, &sd->poll_list);
    }
    local_irq_enable();
}

// 3. DRIVER POLL callback — called from net_rx_action
// This is where packets are drained from the DMA ring.
static int e1000_poll(struct napi_struct *napi, int budget)
{
    struct adapter *adapter = container_of(napi, struct adapter, napi);
    int work = 0;

    // Process up to 'budget' packets
    while (work < budget) {
        struct sk_buff *skb;
        union e1000_rx_desc *rx_desc;
        unsigned int len;

        rx_desc = get_next_rx_desc(adapter, &adapter->rx_ring);
        if (!rx_desc->wb.upper.length)   // no more packets
            break;

        len = le16_to_cpu(rx_desc->wb.upper.length);
        skb = build_skb(rx_desc->buffer_info->data, adapter->rx_buffer_len);

        // Hand it up the stack
        napi_gro_receive(napi, skb);     // includes GRO merge step
        work++;
    }

    // If we drained the ring (work < budget), re-enable interrupts.
    // If the ring still has packets, we stay scheduled and will be called
    // again immediately (still in softirq context — no IRQ overhead).
    if (work < budget) {
        napi_complete_done(napi, work);  // clears SCHED flag, re-enables IRQ
        enable_irq(irq);
    }
    return work;
}

// 4. NAPI weighting — how to choose the right value
//
// Small ring / latency-sensitive:   weight = ring_size (e.g., 64)
// Large ring / throughput-sensitive:  weight = ring_size * 2 (e.g., 256)
//
// sysctl tuning:
//   net.core.netdev_budget      — global max packets per net_rx_action run (default 300)
//   net.core.netdev_budget_usecs — time limit per net_rx_action run (default 2000 µs)
//
// Raising netdev_budget helps drain backlogs faster at the cost of more CPU
// spent in softirq. If softirq CPU% is high and you have idle cores, raising
// netdev_budget can improve throughput.

// 5. Per-CPU softnet_data
// Each CPU has its own struct softnet_data:
//   /proc/softnet_stat — cumulative stats per CPU
//     column 1: total packets processed
//     column 2: softnet backlogs processed
//     column 3: CPU was budget-starved (couldn't process all in time)
//     column 9: net_rx_action ran out of budget

$ cat /proc/softnet_stat
00000000 0000003c 00000000 00000000 00000000 00000000 ...  # CPU0: 60 backlogs
00000000 00000152 00000000 00000000 ...                      # CPU1: 338 backlogs

The net_rx_action softirq runs on whatever CPU the IRQ affinity mask assigned to that NIC's IRQ. If your NIC is on CPU 0, all packet processing happens on CPU 0 unless RPS redirects work. The /proc/softnet_stat file is the primary visibility into per-CPU softirq saturation — if column 3 (budget exhausted) is non-zero and growing, you need to either raise netdev_budget, add more CPU to IRQ affinity, or enable RPS to move work to other CPUs.

Socket Layer: How connect(), accept(), recv(), send() Work

The socket API (Berkeley sockets) is the boundary between userspace and the kernel's network stack. Every send(), recv(), connect(), and accept() is a system call that crosses into kernel space. The kernel's socket implementation lives in net/socket.c, net/ipv4/af_inet.c, and the protocol handlers (tcp.c, udp.c). Understanding this layer helps you reason about backlog queue overflows, connection establishment latency, and why a process might block despite data being available.

// TCP client: connect() — the full path
//
// Application:
//   int fd = socket(AF_INET, SOCK_STREAM, 0);
//   connect(fd, (struct sockaddr *)&addr, sizeof(addr));
//
// Inside the kernel:

1. sock_create()        // creates struct socket + struct sock
                        // allocates inet_sock, sets sk_family = AF_INET
2. inet_hash_connect()  // finds a random ephemeral port (udp_refill_port range)
                        // binds to INADDR_ANY with that port
3. tcp_v4_connect()      // sets state = TCP_SYN_SENT
                        // calls ip_route_connect() to find route
                        // picks source IP based on route
                        // sets sk_daddr / sk_rcv_saddr
4. tcp_connect()        // allocates a TCP header (struct sk_buff)
                        // calls tcp_transmit_skb() which calls ip_queue_xmit()
                        // sends SYN (seq = iss, flags = SYN)
                        // starts retransmission timer (RTO)

5. Back in userspace, connect() returns 0 (non-blocking would return -EINPROGRESS)
   The socket is now in TCP_SYN_SENT state.
   recv() calls would block until the three-way handshake completes.


// TCP server: bind() + listen() + accept()
//
// Application:
//   int fd = socket(AF_INET, SOCK_STREAM, 0);
//   bind(fd, (struct sockaddr *)&port, sizeof(port));
//   listen(fd, 128);    // backlog = 128
//   int client = accept(fd, NULL, NULL);

1. sock_create()         // creates struct socket for the listening socket
                         // allocates inet_sock, sets sk_family = AF_INET

2. inet_bind()          // binds to specific port and optionally specific IP
                         // checks for port conflicts (SO_REUSEADDR)
                         // registers in the hash table (ehash / bhash)

3. inet_listen()        // calls sk->sk_prot->listen(sk, backlog)
                         // tcp_listen_start()
                         // allocates the TCP request_sock queue
                         // inet_hashfn() adds to listening hash table
                         // socket is now in state TCP_LISTEN

4. accept()              // calls sk->sk_prot->accept(sk, flags)
                         // tcp_v4_accept() — the actual implementation

tcp_v4_accept():
    // Walk the list of established connections (sk->sk_ack_queue)
    // If empty, sleep until a connection is ready or timeout

    struct request_sock *req;
    req = inet_csk_reqsk_queue_add(sk, newsk, skb);

    // Three queues exist:
    //   - sk->sk_ack_queue:  fully established, ready for accept() [accept() drains this]
    //   - sk->sk_syn_queue:  connections in SYN_RECV state (3-way handshake in progress)
    //                        (this is actually reqsk_queue, not directly sk->)
    //   - sk->sk_error_queue: async errors

    inet_csk_reqsk_queue_add():
        // Adds new connection to accept() queue
        // If accept() hasn't been called yet, it sits in the accept queue

    return inet_csk_accept():
        // Non-blocking: O(1) check of accept queue
        // Blocking: waits on waitqueue until queue not empty

// Connection backlog: what happens when it overflows
//
// When the accept() queue is full (kernel maintains:
//   sk->sk_max_ack_backlog, set by listen(backlog) at socket creation),
// new incoming connections from the SYN queue can be silently dropped.
// The tcp_max_syn_backlog sysctl controls the size of the SYN_RECV queue
// (separate from the accept queue). For servers handling bursty connections,
// raising both listen() backlog and tcp_max_syn_backlog is required.
$ sysctl net.ipv4.tcp_max_syn_backlog    # default 128 → raise to 4096 for busy servers
$ sysctl net.core.somaxconn               # default 128 → socket's listen() capped by this


// TCP recv() — data path from socket to userspace
//
// Application:
//   n = recv(fd, buf, 1024, 0);

1. inet_recvmsg()        // sock_read_iter() → ... → inet_recvmsg()
2. tcp_recvmsg()         // the heavy lifter
                         // Copies from socket receive queue (sk->sk_receive_queue)
                         // Called "sk->sk_backlog_rcv" if data arrives while socket is locked
                         //
                         // Flow:
                         //   - if sk->sk_receive_queue has data: copy directly
                         //   - if sequence gap (packet loss): wait or return partial
                         //   - if MSG_WAITALL flag: block until full request satisfied
                         //
                         // Special cases:
                         //   URG data: checked via sk->sk_urg
                         //   OOB markers: handled via SOCK_SKB_CB offset
                         //
                         // Locks held: sock lock (sk_lock) — no blocking ops allowed

tcp_recvmsg():
    do {
        // 1. Try to copy from sk->sk_receive_queue (already in order)
        if (!skb_queue_empty(&sk->sk_receive_queue)) {
            skb = skb_peek(&sk->sk_receive_queue);
            // seq comparison: do we have contiguous data up to 'len' bytes?
            if (TCP_SKB_CB(skb)->end_seq >= seq + len) {
                // All data available — splice to userspace or copy
                copied = tcp_copy_to_user(sk, uaddr, skb, offset, len);
            }
        }

        // 2. If no data: block, unless MSG_DONTWAIT
        if (copied == 0) {
            err = sk_wait_data(sk, &timeo);
            if (err < 0) goto out;
        }
    } while (len > 0);

    // 3. tcp_cleanup_rbuf() — called on each recv
    //    Sends a window update (ACK) if receive window is more than half open
    //    This is why TCP receive window auto-tuning works

// TCP send() — data path from userspace to wire
//
// Application:
//   n = send(fd, buf, 1024, 0);

1. inet_sendmsg()        // sock_write_iter() → ... → inet_sendmsg()
2. tcp_sendmsg()         // the producer

tcp_sendmsg():
    // Copies from user buffer to kernel sk_buff chain
    // Manages the write queue (sk->sk_write_queue)
    // Handles:
    //   - Nagle algorithm (coalescing small writes)
    //   - urgent data (MSG_OOB)
    //   - partial copies (when socket buffer full)

    while (len > 0) {
        skb = sk_stream_alloc_skb();    // allocates a new sk_buff
                                        // tries to keep it in the same page pool
                                        // size target: MSS (Maximum Segment Size)

        // Copy data — may be fragmented across multiple sk_buffs
        // in the write_queue (scatter-gather)
        copy = min_t(size_t, len, skb_tailroom(skb));
        err = memcpy_to_msg_kvec(skb, uaddr, copy);

        sk->sk_wmem_queued += copy;
        sk->sk_forward_alloc -= copy;

        // Queue it — tcp_push() decides when to transmit
        tcp_push(sk, ..., TCP_NAGLE_PUSH);
        // If Nagle says wait, it sets sk->sk_tx_queue.flags = NAGLE
        // to coalesce with next send()
    }

// TCP three-way handshake states visualized:
//
// Client                              Server
//   │                                   │
//   │─── SYN (seq=iss) ──────────────────▶│
//   │                                   │
//   │      (state: TCP_SYN_SENT)        │ state: TCP_SYN_RECV
//   │                                   │
//   │◀── SYN+ACK (seq=k, ack=iss+1) ────│
//   │                                   │
//   │─── ACK (ack=k+1) ──────────────────▶│
//   │                                   │
//   │      (state: TCP_ESTABLISHED)     │ state: TCP_ESTABLISHED
//   │                                   │
//   │         Data flows both ways ────────▶│


// Socket options that affect performance
//
// SO_SNDBUF / SO_RCVBUF — socket send/receive buffer sizes
//   Default: auto-tuned (net.ipv4.tcp_rmem / net.ipv4.tcp_wmem)
//   Raising can help with high-throughput, high-latency links
//   Check actual usage: /proc/sys/net/core/rmem_max vs ss -m output
//
// SO_REUSEADDR — rebind to TIME_WAIT port
//   Set before bind() — critical for restarting servers quickly
//
// SO_REUSEPORT — multi-process load balancing (Linux 3.9+)
//   Multiple processes bind() same port, kernel hashes 4-tuple to dest process
//   Eliminates thundering-herd on multi-core servers
//
// SO_KEEPALIVE — detects dead connections (2h idle → probes → 75s later → kill)
//   tcp_keepalive_probes / tcp_keepalive_time / tcp_keepalive_intvl sysctls
//
// TCP_NODELAY — disables Nagle (send immediately, no coalescing)
//   Critical for low-latency RPC: Redis, gRPC, trading systems
//   Default: Nagle ON (coalescing small writes into full segments)

The backlog queues are often the source of connection failures under load. When accept() isn't called fast enough, the kernel holds established connections in the accept queue (bounded by listen(backlog), capped by somaxconn). SYN floods fill the SYN_RECV queue (bounded by tcp_max_syn_backlog). If both are full, new connections are dropped before they even reach your application.

Routing: FIB Lookups, Policy Routing, and Tables

Every packet that leaves the kernel — locally generated or forwarded — goes through a routing decision. The kernel's Forwarding Information Base (FIB) is consulted in ip_route_input() (ingress) and ip_route_output_slow() (egress). The result is a fib_result containing the output interface, next hop gateway, and source address to use. Linux's routing is far more powerful than a single table — policy routing (via ip rule) lets you select different routing tables based on packet properties (fwmark, src, tos, uid). This enables complex setups like split-access (different uplinks for different traffic classes).

// Viewing the routing table
$ ip route show
default via 192.168.1.1 dev eth0 proto dhcp src 192.168.1.100 metric 600
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 600

$ ip route show table local
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 192.168.1.100 dev eth0 proto kernel scope host src 192.168.1.100

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref  Use Iface
0.0.0.0         192.168.1.1     0.0.0.0         UG    600    0        0 eth0
192.168.1.0     0.0.0.0         255.255.255.0   U     600    0        0 eth0

// Policy routing — multiple tables
//
// Default rules (priority 0, 32766, 32767):
//   priority 0:   lookup local  (local address delivery)
//   priority 32766: lookup main  (normal routing)
//   priority 32767: lookup default (if main has no match)

$ ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

// Example: mark traffic with fwmark for a separate routing table
$ ip rule add fwmark 1 table internet
$ ip rule add from 10.0.0.0/8 table internal
$ ip rule show
0:      from all lookup local
32765:  from 10.0.0.0/8 lookup internal
32766:  from all lookup main
32767:  from all lookup default

// Blackhole routing — null route for DDoS mitigation
$ ip route add blackhole 203.0.113.0/24
// Packets to this subnet are dropped silently (no ICMP)
// Equivalent to: ip route add 203.0.113.0/24 dev lo
// In iptables: -d 203.0.113.0/24 -j DROP (but this processes packets)
// Blackhole route is cheaper — kernel drops at routing decision, no iptables cost

// Multipath routing (ECMP)
$ ip route add default proto static \
    nexthop via 192.168.1.1 dev eth0 weight 1 \
    nexthop via 192.168.2.1 dev eth1 weight 1
// Kernel hashes (by default) on layer 3/4 4-tuple to pick nexthop
// So individual flows always use the same path (no reordering within a flow)

$ sysctl net.ipv4.fib_multipath_hash_fields
// Controls which fields are hashed for ECMP selection:
//   1 = src/dst IP only
//   2 = src IP + dst IP + src port + dst port (default, for L4 balancing)
//   3 = src IP + src port only
//   4 = dst IP + dst port only


// Kernel FIB internals
//
// The FIB is implemented in net/ipv4/fib_trie.c.
// The trie structure (level-compressed Patricia trie) compresses the routing table
// into a tree that supports longest-prefix-match in O(key_length) time.
//
// fib_lookup() signature:
//   struct fib_result {
//       u32     prefix_len;
//       struct fib_info *fi;
//       struct fib_nh_common *nhc;
//   };
//
//   int fib_lookup(struct net *net, const struct flowi4 *flp,
//                   struct fib_result *res);
//
// Key fields in flowi4:
//   fl4->daddr  — destination address (primary lookup key)
//   fl4->saddr  — source address (for policy routing if configured)
//   fl4->flowi4_mark — fwmark (set by iptables -j MARK — used for policy routing)
//   fl4->flowi4_tos — type of service
//   fl4->uid    — user ID (for per-UID routing)
//
// Example: routing a forwarded packet
//   ip_route_input_slow(skb, daddr, saddr, tos, dev_in, dev_out)
//     → fib_lookup(net, &fl4, &res)        // main lookup
//     → if (res.type == RTN_UNICAST) { ... }
//     → if (res.prefix_len == 0 && daddr is non-local) { ... }

// Routing cache (historical note)
//
// Pre-2.6: kernel maintained a routing cache (rcu-protected hash table).
// Each lookup cached the result keyed by destination. Problems: could grow
// unbounded, took CPU to expire entries, caused hash collisions under DoS.
// Removed in Linux 2.6.39. The FIB (trie) is always consulted directly now.
// Security: no more routing cache exhaustion attacks.


// Monitoring routing decisions
//
// /proc/net/fib_trie — internal FIB trie state, shows what routes match what IPs
$ cat /proc/net/fib_trie
Main:
  0.0.0.0/0
    ├── 0.0.0.0    qualifier=NETKEY    # default route
    ...
  192.168.1.0/24
    ├── 192.168.1.0    qualifier=NETKEY
    ├── 192.168.1.100  qualifier=LOCAL # local address
    ...
  127.0.0.0/8
    └── 127.0.0.0/8   qualifier=LOCAL # loopback

$ ip route get 8.8.8.8
8.8.8.8 via 192.168.1.1 dev eth0 src 192.168.1.100 uid 0
    cache expires 4294967sec

// Route metrics — preference when multiple routes exist
$ ip route add default via 192.168.1.1 dev eth0 metric 100
$ ip route add default via 10.0.0.1 dev eth1 metric 200
// Lower metric wins. You can fail over by changing metric on the primary.

Policy routing with fwmark is the foundation of many advanced networking setups — it lets iptables marks redirect traffic into a separate routing table, which in turn can force traffic out a specific interface or through a specific gateway. Combined withconntrack, this enables stateful failover and multi-path setups. The ip rule priority determines which rule is evaluated first; the first matching rule wins.

XDP: eBPF at the Driver

eXpress Data Path (XDP) runs eBPF programs at the earliest point in the receive path — inside the NIC driver, before the kernel allocates a sk_buff. At 100Gbps line rate you have ~67ns per packet, and XDP can process and make a decision (DROP, PASS, REDIRECT) in under 30ns on modern hardware. The key advantage over iptables rules: XDP makes decisions before the kernel's sk_buff allocator is even called, so it imposes almost zero overhead on the packet — it accesses the raw DMA buffer directly. This makes it ideal for DDoS mitigation, load balancing, and packet filtering at scale.

XDP programs are attached to network interfaces and run on every received packet. They receive a pointer to the raw packet data (the DMA buffer) and can inspect, modify, or redirect it. The verifier guarantees the program terminates and won't crash the kernel — it rejects loops, enforces bounds checking on all memory accesses, and limits program complexity. Programs can be attached at three levels: native (in the driver, fastest), generic (after sk_buff allocation, for drivers without native support), and offload (on the NIC itself, e.g., Netronome NFP).

// XDP program: drop DNS queries to port 53 (DDoS mitigation pattern)
//
// Written in C, compiled with clang -target bpf -O2, loaded with ip or bpftool

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>

SEC("xdp")
int drop_dns_queries(struct xdp_md *ctx) {
    // ctx->data and ctx->data_end are the bounds of the packet buffer.
    // The verifier enforces that we check against data_end before dereferencing.

    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Parse Ethernet header (14 bytes)
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;                   // can't parse — let kernel handle

    // Only interested in IPv4 (EtherType = 0x0800)
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;

    // Parse IP header — iphdr is variable length due to options
    // Use the iph pointer, then check bounds against data_end
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    // Only interested in UDP traffic
    if (ip->protocol != IPPROTO_UDP)
        return XDP_PASS;

    // Parse UDP header (8 bytes)
    // UDP header starts right after IP header (no options in our case)
    struct udphdr *udp = (void *)ip + (ip->ihl * 4);
    if ((void *)(udp + 1) > data_end)
        return XDP_PASS;

    // Check destination port
    if (udp->dest == bpf_htons(53))
        return XDP_DROP;                   // silent drop — no reply sent

    return XDP_PASS;                       // pass to normal stack
}

// Compiling and loading:
$ clang -target bpf -O2 -Wall -c drop_dns.c -o drop_dns.o
$ ip link set dev eth0 xdpgeneric obj drop_dns.o sec xdp
// Or for native (driver must support it):
$ ip link set dev eth0 xdp obj drop_dns.o sec xdp
// Check:
$ ip link show eth0
eth0: ... xdp qdisc bpfflags 0 mode generic state UNKNOWN
// mode changed from generic to something else when attached


// More complex XDP program: redirect traffic to another interface
SEC("xdp")
int redirect_to_veth(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Redirect to veth pair (e.g., for container traffic)
    return bpf_redirect(1, 0);  // ifindex 1, flags=0 (XDP redirect)
}

// AF_XDP: zero-copy from NIC directly to userspace ring buffer
//
// AF_XDP socket bypasses the entire kernel network stack for RX path.
// It attaches to a queue pair on an interface and provides a userspace
// ring buffer (umem) that the NIC DMA-codes into directly.
//
// Typical setup:
//   $ ip link set dev eth0 xdp obj prog_xsk.o sec xdp
//   (attach XDP redirect program that redirects to AF_XDP socket)
//
// In userspace:
//   struct sockaddr_xa axs;
//   int fd = socket(AF_XDP, SOCK_RAW, 0);
//   struct xdp_mmap_offsets off;
//   sockopt(fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, sizeof(off));
//   // Fill umem rings (fill ring, completion ring)
//   // Call bind() with xdp_ring_params pointing to queue_id
//   // recv() reads from RX ring, send() writes to TX ring


// XDP return codes — what each one does:
//
// XDP_DROP         — drop packet (no further processing)
//                     Used for: DDoS filtering, rate limiting, sampling
// XDP_PASS         — pass to normal kernel stack (allocates sk_buff, runs netfilter)
//                     Normal processing continues
// XDP_TX           — transmit this packet out the same interface (echo server, NAT)
//                     Saves: no DMA unmapping and re-mapping needed
// XDP_REDIRECT     — redirect to another interface or AF_XDP socket
//                     Used for: load balancing, packet mirroring, forwarding
// XDP_ABORTED       — like DROP but indicates an error (traced in perf)


// XDP with tc (traffic control) — attaching eBPF as a qdisc filter
//
// tc can attach eBPF programs to ingress and egress qdiscs, giving them
// access to the sk_buff (unlike XDP which operates on raw packet data):
$ tc qdisc add dev eth0 clsact
$ tc filter add dev eth0 ingress bpf obj policy.o sec ingress
// This runs after netif_receive_skb, before qdisc, with sk_buff available.
// Useful when you need to mark packets (fwmark) for policy routing.


// Debugging XDP programs
//
// perf top -g — shows CPU cycles spent in XDP
// bpftool prog list — shows loaded programs
// bpftool prog show id <id> — detailed info including JIT-compiled instructions
// cat /sys/kernel/debug/tracing/trace_pipe — ftrace for XDP events
//
// $ bpftool prog list
// 336: xdp  name drop_dns  tag 3b185c09c9f8ae2e  gpl
//       loaded_at 2024-01-15T10:23:11  uid 0
//       xlated 128B  jited 96B  map_ids 0
//       btf_id 44
//
// $ cat /sys/class/net/eth0/xdp/stats
// (per-CPU packet counters for XDP operations)

tc qdiscs: Egress Shaping and Queue Management

qdiscs (Queueing Disciplines) sit between the IP layer and the NIC driver on the egress path. Every packet that goes out an interface first passes through the root qdisc, which enqueues it according to its algorithm. The qdisc is responsible for deciding which packet to dequeue next (when the NIC TX completion fires), which is where traffic shaping, prioritization, and fairness algorithms apply. The default since kernel 5.x is fq_codel (Fair Queue + Controlled Delay), which automatically eliminates bufferbloat without any configuration — it limits queue depth to a target delay (default 5ms), so your 1Gbps link doesn't inflate latency to 500ms by buffering packets during a burst.

# Default qdisc since Linux 5.x: fq_codel
# Fair Queue (per-connection queuing) + CoDel (Controlled Delay)
# This replaces the old pfifo_fast which had no delay management
$ tc qdisc show dev eth0
qdisc fq_codel 0: root refcnt 2 limit 10240p flows 1024 quantum 1514 \
    target 5ms interval 100ms memory_limit 32Mb ecn

# fq_codel parameters explained:
#   limit        — max packets in all flows combined (10240)
#   flows        — number of separate queues (1024, one per flow)
#   quantum      — how many bytes a flow dequeues per round (MTU=1514)
#   target       — desired maximum queue delay (5ms)
#   interval     — how long a packet must be queued to be considered 'late' (100ms)
#   memory_limit — total memory all flows can consume (32MB)
#   ecn          — Explicit Congestion Notification enabled (drops instead of queuing)

# Cake qdisc: more sophisticated, handles diffserv (VoIP priority)
# Recommended for: home routers, any link with bufferbloat issues
$ modprobe sch_cake
$ tc qdisc add dev eth0 root cake bandwidth 1gbit triple-isolate nat wash

# Cake parameters:
#   bandwidth     — shape to this rate (1gbit, avoids sender flooding our queue)
#   triple-isolate — each flow is isolated from others (no cross-flow coalescing)
#   nat           — perform NAT on forwarded packets (for masquerade visibility)
#   wash          — strip ECN and diffserv markings after classification

# HTB (Hierarchical Token Bucket): for multi-tenant bandwidth allocation
$ tc qdisc add dev eth0 root handle 1: htb default 30

# Create classes (HTB allows hierarchical class structure)
$ tc class add dev eth0 parent 1: classid 1:1 htb rate 1gbit burst 1mbit
$ tc class add dev eth0 parent 1:1 classid 1:10 htb rate 800mbit ceil 1gbit prio 1
$ tc class add dev eth0 parent 1:1 classid 1:20 htb rate 100mbit ceil 1gbit prio 2
$ tc class add dev eth0 parent 1:1 classid 1:30 htb rate 100mbit ceil 200mbit prio 5

# Attach qdisc to classes (1:10 and 1:20 use fq_codel, 1:30 gets pfifo)
$ tc qdisc add dev eth0 parent 1:10 fq_codel
$ tc qdisc add dev eth0 parent 1:20 fq_codel
$ tc qdisc add dev eth0 parent 1:30 pfifo

# Filter: assign traffic to classes based on port, DSCP, fwmark
$ tc filter add dev eth0 protocol ip parent 1: handle 0x10 protocol ip \
    u32 match ip dport 443 0x3ff classid 1:10
$ tc filter add dev eth0 protocol ip parent 1: handle 0x20 protocol ip \
    u32 match ip dport 80 0x3ff classid 1:20

# TBF (Token Bucket Filter): simple rate limiting
$ tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 50ms
#   rate 100mbit — maximum rate
#   burst 32kbit  — size of the token bucket (absorbs short bursts)
#   latency 50ms  — max time a packet can wait in the qdisc

# SFQ (Stochastic Fairness Queueing): simple per-flow queuing
# (less accurate than fq_codel but lighter)
$ tc qdisc add dev eth0 root sfq perturb 10
#   perturb 10 — rehash every 10 seconds (prevents hash collisions becoming permanent)

# Monitoring qdisc drops and queue depths
$ tc -s qdisc show dev eth0
qdisc fq_codel 8002: root refcnt 2 limit 10240p flows 1024 ...
 Sent 123456789 bytes 98765 pkt (dropped 234, overlimits 0 requeues 0)
  max queue: 42  min queue: 0
  Flows 1024 (active 47)
  memory used: 128960 of 33554432
  ecn_mark: 12

# Key metrics:
#   dropped  — packets dropped by this qdisc (queue overflow)
#   overlimits — times the qdisc was full and couldn't accept a packet
#   requeues  — packets requeued after NIC TX was busy
#   ecn_mark  — packets marked with ECN instead of dropped (good if ecn is on)

The fq_codel defaults are sane for most workloads. The most important knob is target and interval — if your link has high bandwidth and high latency (e.g., satellite or cellular), you'll want higher values. For low-latency gaming or financial trading, consider cake with интервал tuned down. The ecn flag is worth enabling if your traffic crosses ECN-capable networks — it prevents packet drops under congestion by signaling early rather than queuing until the buffer is full.

conntrack and netfilter

conntrack (Connection Tracking) is netfilter's stateful tracking infrastructure. It maintains a hash table of every TCP, UDP, ICMP, and SCTP flow the kernel has seen. Each entry records source and destination addresses and ports, protocol state, NAT translations, and metadata (mark, timeout, status). With conntrack active, firewall rules can match on connection state — e.g., "only allow incoming packets that are part of an established connection (i.e., we've seen the 3-way handshake)". This dramatically simplifies firewall rules and is the foundation of NAT (both SNAT and DNAT).

# Watch live connection table — see every tracked flow
$ conntrack -L
tcp 6 119 ESTABLISHED src=10.0.0.5 dst=10.0.0.10 sport=42312 dport=443 \
    src=10.0.0.10 dst=10.0.0.5 sport=443 dport=42312 [ASSURED] mark=0 use=1
udp 17 22 src=10.0.0.5 dst=8.8.8.8 sport=53432 dport=53 src=8.8.8.8 dst=10.0.0.5 sport=53 dport=53432
# Format: protocol state src=X dst=Y sport=X dport=Y src_reply dst_reply [status]

# Flush all conntrack entries (useful after firewall config change)
$ conntrack -F

# conntrack tool also supports -E for live event stream:
$ conntrack -E -p tcp
    [NEW] tcp      6 120 LISTEN src=0.0.0.0 dst=0.0.0.0 sport=0 dport=22
 [UPDATE] tcp      6 60 SYN_RECV src=10.0.0.5 dst=10.0.0.10 sport=42312 dport=443
 [UPDATE] tcp      6 432000 ESTABLISHED src=10.0.0.5 dst=10.0.0.10 sport=42312 dport=443


# iptables rules using conntrack state
$ iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
$ iptables -A INPUT -m conntrack --ctstate NEW -p tcp --dport 22 -j ACCEPT
$ iptables -A INPUT -m conntrack --ctstate INVALID -j DROP

# DNAT: forward external traffic to internal server
$ iptables -t nat -A PREROUTING -p tcp -d 203.0.113.1 --dport 80 \
    -j DNAT --to-destination 10.0.0.5:8080

# SNAT / masquerade: give internal network outbound NAT
$ iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth0 -j MASQUERADE
# MASQUERADE picks the outbound IP automatically (good for dynamic DHCP)

# NOTRACK: bypass conntrack for high-churn traffic (reduces table pressure)
$ iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK
$ iptables -t raw -A PREROUTING -p udp --sport 53 -j NOTRACK
# DNS servers with millions of queries/sec should NOT track UDP DNS at all.

# nftables equivalent for stateful firewalling
table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # established/related connections
        ct state established,related counter accept

        # new SSH connections
        ct state new tcp dport 22 counter accept

        # invalid packets
        ct state invalid counter drop
    }

    chain forward {
        type filter hook forward priority 0; policy drop;
        ct state established,related counter accept
    }
}

# nftables NAT table
table ip nat {
    chain prerouting {
        type nat hook prerouting priority -100;
        tcp dport 80 dnat to 10.0.0.5:8080
    }
    chain postrouting {
        type nat hook postrouting priority 100;
        ip saddr 10.0.0.0/24 oif eth0 masquerade
    }
}


# conntrack table sizing — critical for high-connection hosts
$ sysctl -w net.netfilter.nf_conntrack_max=2097152      # 2M connections
$ sysctl -w net.netfilter.nf_conntrack_buckets=524288  # hash table buckets

# How many buckets and max connections relate:
# conntrack uses a two-level hash table.
# buckets = /proc/sys/net/netfilter/nf_conntrack_buckets (default: 2^18 = 262144)
# max connections = /proc/sys/net/netfilter/nf_conntrack_max (default: 65536)
# conntrack hash bucket stores a linked list of entries.
# Setting buckets higher reduces chain length in each bucket (fewer entries per chain).
# Rule of thumb: buckets ≈ max_connections / 8 for short chains.

# conntrack timeout tuning (critical for different workload types)
$ sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=7200
$ sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=5
$ sysctl -w net.netfilter.nf_conntrack_tcp_timeout_close_wait=5
$ sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=120
$ sysctl -w net.netfilter.nf_conntrack_udp_timeout=30

# Why timeout tuning matters:
# - Web servers with many short-lived connections: lower time_wait/close_wait
# - Long-lived SSH tunnels: higher established timeout
# - Kubernetes nodes: many connection setup/teardowns → shorter time_wait helps


# Viewing conntrack memory usage
$ cat /proc/net/stat/nf_conntrack
cut -f1-56 /proc/net/nf_conntrack | awk '...'

# Check table utilization:
$ conntrack -L -p tcp 2>/dev/null | wc -l
$ echo $(cat /proc/sys/net/netfilter/nf_conntrack_max)
$ echo $(cat /proc/sys/net/netfilter/nf_conntrack_buckets)

conntrack is often the first bottleneck on container hosts or DDoS targets. With millions of short-lived connections, the hash table fills up and new connections are dropped — this looks like a connectivity issue even though the network itself is fine. Monitoring /proc/net/stat/nf_conntrack (particularly the lookups and found columns) and setting nf_conntrack_max appropriately for your RAM is essential for any host running many concurrent connections.

RPS/RFS and Multi-queue NIC Steering

Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are the kernel's software mechanisms for distributing received packet processing across multiple CPU cores. Modern NICs have multiple hardware queues (RX/TX), each with its own IRQ affinity, and each queue runs on a specific CPU. When a NIC has more queues than CPUs (or only one queue), RPS/RFS move packets in software to prevent a single CPU from becoming the bottleneck. The distinction: RPS hashes the packet and routes it to any CPU; RFS also considers which CPU owns the socket, trying to keep the flow on the same CPU for cache warmth.

# Check how many RX queues your NIC has
$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:             0
TX:             0
Combined:       32     # 32 total queues (combined mode)

Current hardware settings:
RX:             16     # 16 active RX queues
TX:             16
Combined:       16

# Modern NICs typically have one queue per CPU core (up to a limit).
# Multi-queue means: different flows hit different queues -> different CPUs.
# The NIC's RSS (Receive Side Scaling) hashes the 4-tuple to pick a queue.

# Check RSS configuration
$ ethtool -n eth0 rx-flow-hash tcp4
TCP over IPV4 flows use these fields for computing Hash flow key:
IP SA / IP DA / L4 src port / L4 dst port

# See which CPUs each queue is affinitized to
$ grep -r . /sys/class/net/eth0/queues/rx-*/cpu
/sys/class/net/eth0/queues/rx-0/cpu:0,1,2,3,4,5,6,7
/sys/class/net/eth0/queues/rx-1/cpu:8,9,10,11,12,13,14,15
# Queue 0's IRQ fires on CPUs 0-7, queue 1's IRQ fires on CPUs 8-15

# RPS (Receive Packet Steering) — software distribution when RSS isn't enough
# RPS hashes the packet 5-tuple and routes it to a CPU based on rps_cpus mask
#
# For each RX queue, configure which CPUs can process its packets:
$ cat /sys/class/net/eth0/queues/rx-0/rps_cpus
000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff
# Bits set = CPUs allowed to process this queue's packets (8 CPUs here)

# Enable RPS for a queue (all CPUs allowed):
$ echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
# Or enable for specific CPUs:
$ echo 0000ff > /sys/class/net/eth0/queues/rx-0/rps_cpus  # CPUs 0-7

# When does RPS kick in? When the NIC's RSS can't distribute enough.
# If NIC has 32 queues and 64 CPUs, RSS handles everything — RPS unused.
# If NIC has 1 queue and 4 CPUs, all packets hit CPU0 unless RPS moves them.

# rps_sock_flow_entries — global flow table size for RFS
$ cat /proc/sys/net/core/rps_sock_flow_entries
32768  # 32k max tracked flows (across all interfaces)

# rps_flow_cnt — per-queue flow entry count
# Each RX queue has its own flow counter table (hash of flows -> count)
$ cat /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
4096  # each queue tracks up to 4096 flow entries

# When to tune RPS:
#   - Single-queue NIC (no RSS): enable RPS to distribute across all CPUs
#   - CPU-heavy packet processing (firewall, encryption): move to multiple CPUs
#   - Cache-sensitive apps: use RFS instead (keeps flow on socket-owning CPU)

# RFS (Receive Flow Steering) — RPS + locality awareness
# RFS tries to place a flow on the same CPU as the application consuming it.
# This improves L1/L2 cache hit rates for socket receive buffers.

# Enable RFS:
# 1. Increase rps_sock_flow_entries (global)
$ echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

# 2. Increase rps_flow_cnt per queue (matches rps_sock_flow_entries)
$ for q in /sys/class/net/eth0/queues/rx-*; do
    echo 4096 > $q/rps_flow_cnt
  done

# 3. Confirm it's active — per-queue flow table tracks socket owner CPU
$ ip -s link show eth0 | grep -A5 'rx-queue'
# Stats for each queue show drops, overruns etc.

# Multi-queue RSS: check and control IRQ affinity
$ cat /proc/interrupts | grep eth0 | head -10
  88:   124533   0   0   0   NIC eth0-rx-0   # all on CPU0 — bottleneck!
  89:   124521   0   0   0   NIC eth0-rx-1   # also CPU0 — still bottlenecked!
...

# Spread IRQs across cores:
$ cat /proc/irq/88/smp_affinity_list
0
$ echo 0-7 > /proc/irq/88/smp_affinity_list  # CPUs 0-7 share IRQ 88 (queue 0)
$ echo 8-15 > /proc/irq/89/smp_affinity_list # CPUs 8-15 share IRQ 89 (queue 1)

# Modern approach: irqbalance daemon (systemd service)
// irqbalance monitors interrupt rates and dynamically reassigns IRQ affinity
// to balance load. For most production servers, running irqbalance is sufficient.

# Check current interrupt rate per CPU:
$ cat /proc/interrupts | grep -E 'CPU|eth0-rx'
          CPU0   CPU1   CPU2   CPU3
  88:  123456  23456     0     0  eth0-rx-0   # CPU0 heavily loaded
  89:  120000     0  23456     0  eth0-rx-1   # CPU2 catching up
...

The key insight with RPS/RFS is that they're not needed on modern multi-queue NICs with good IRQ affinity. The RSS hash already distributes flows across queues, and each queue's IRQ is affinity-set to a specific CPU. If you see one CPU dominating softirq time while others are idle, check /proc/softnet_stat column 3 (budget exhausted) — if it's non-zero, the system can't drain the softirq backlog fast enough, and raising netdev_budget or reducing per-CPU load is the fix, not more RPS.

Performance Tuning: sysctls for Buffer Sizes, Congestion Control

The Linux network stack has dozens of sysctls that control buffer sizes, timeouts, congestion algorithms, and offload features. Getting these wrong — or leaving them at defaults designed for a 1990s server — is the most common cause of "why is my 10Gbps NIC only doing 5Gbps?" The key axes are: buffer sizes (socket, interface, conntrack, device DMA rings), TCP tuning (congestion control, window sizes, offload), and interrupt coalescing. This section covers the most impactful ones with explanations of why they'd need changing.

# ============================================================
# BUFFER SIZES — kernel memory for network data structures
# ============================================================

# Core socket buffer sizes (auto-tuned within these bounds)
# tcp_rmem: min / default / max for receive buffer
$ sysctl net.ipv4.tcp_rmem
8192    131072    6291456   # 8KB / 128KB / 6MB (auto-tuned)

# tcp_wmem: min / default / max for send buffer
$ sysctl net.ipv4.tcp_wmem
8192    131072    6291456   # 8KB / 128KB / 6MB (auto-tuned)

# When to change:
#   - High-BDP links (e.g., 10Gbps + 10ms RTT): raise max to 16MB+
#     Bandwidth-delay product = 10Gbps * 0.01s = 12.5MB
#     If max receive buffer is 6MB, you can't fill the pipe.
#   - Financial trading (ultra-low latency): lower default to avoid
#     large buffers causing variable scheduling delays.
#   - Container hosts with thousands of small sockets: lower memory footprint.

# Setting for 10Gbps + 10ms RTT:
$ sysctl -w net.ipv4.tcp_rmem="4096 262144 16777216"
$ sysctl -w net.ipv4.tcp_wmem="4096 262144 16777216"
$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.wmem_max=16777216

# Interface-level ring buffer (NIC DMA ring — not kernel buffers)
# Control via ethtool -G (groups), not sysctl.
$ ethtool -g eth0
Ring parameters for eth0:
RX:    4096  # max receive ring descriptors
RX mini: 0
RX mini: 0
TX:    4096  # max transmit ring descriptors

# Increase RX ring to handle burst (if NIC supports it):
$ ethtool -G eth0 rx 8192 tx 4096

# netdev_budget: max packets processed per net_rx_action run (per CPU)
$ sysctl net.core.netdev_budget
300    # default 300 packets, then yields
# Raise to 600-1000 if softnet_stat column 3 (budget exhausted) is growing.
# Watch: watch -n1 'cat /proc/softnet_stat | head -1'

# netdev_budget_usecs: time budget per net_rx_action run
$ sysctl net.core.netdev_budget_usecs
2000   # 2ms of CPU time per softirq run
# Combined with netdev_budget, this prevents softirq from monopolizing CPU.
# Lower to 1000 for lower latency (at cost of lower throughput under load).


# ============================================================
# TCP CONGESTION CONTROL — how fast the sender pumps data
# ============================================================

# Current congestion control algorithm:
$ sysctl net.ipv4.tcp_congestion_control
cubic   # default since 2.6.19, good general-purpose algorithm

# Available algorithms (must be compiled into kernel):
$ sysctl net.ipv4.tcp_available_congestion_control
cubic reno bbr

# Enable BBR (Bottleneck Bandwidth and RTT) — Google's algorithm
# BBR outperforms CUBIC on high-BDP and lossy networks
$ sysctl -w net.ipv4.tcp_congestion_control=bbr
$ sysctl -w net.ipv4.tcp_congestion_control=BBR

# BBR parameters (sysctl):
$ sysctl net.ipv4.tcp_congestion_control

# tcp_slow_start_after_idle: shrink CWND after idle period
# Disable for persistent long connections (keep CWND high):
$ sysctl -w net.ipv4.tcp_slow_start_after_idle=0

# tcp_sYNcookies: protect against SYN flood
# Enable when under attack (doesn't require conntrack):
$ sysctl -w net.ipv4.tcp_syncookies=1
$ sysctl net.ipv4.tcp_syncookies
1

# tcp_fastopen: send data in SYN (eliminates one RTT for new connections)
$ sysctl net.ipv4.tcp_fastopen
0  # disabled by default (bitmask: client=1, server=2, both=3)

$ sysctl -w net.ipv4.tcp_fastopen=3  # enable both client and server TFO
# Requires application support (TCP_FASTOPEN setsockopt)

# tcp_tw_reuse: allow TIME_WAIT socket reuse for new connections
# Safe for clients (not servers with known ports):
$ sysctl -w net.ipv4.tcp_tw_reuse=1

# tcp_max_syn_backlog: SYN_RECV queue size (per listener)
# Raise for high-connection-rate servers:
$ sysctl -w net.ipv4.tcp_max_syn_backlog=4096

# tcp_fin_timeout: TIME_WAIT duration (default 60s)
# Lower for high-churn servers (though fq_codel handles this):
$ sysctl -w net.ipv4.tcp_fin_timeout=15


# ============================================================
# OTHER PERFORMANCE TUNING
# ============================================================

# Enable GRO (Generic Receive Offload) — reduces per-packet overhead
# (usually on by default, check):
$ ethtool -k eth0 | grep generic-receive-offload
generic-receive-offload: on

# Enable TSO (TCP Segmentation Offload) — reduces CPU work on TX
$ ethtool -k eth0 | grep tcp-segmentation-offload
tcp-segmentation-offload: on

# Disable if debugging packet-level issues (TSO can obscure what NIC actually sends)

# Enable flow director / RSS hash (if NIC supports):
$ ethtool -K eth0 flow-director-atr off
# Note: flow director programs NIC's hash filter (not software)

# ip local port range — ephemeral port allocation range
$ sysctl net.ipv4.ip_local_port_range
32768   60999

# When running out of ephemeral ports (many outbound connections):
$ sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# tcp_keepalive: detect dead connections (2h idle → probes → 75s → kill)
# Tune for long-lived connections:
$ sysctl net.ipv4.tcp_keepalive_time
7200    # 2 hours before first probe (default)
$ sysctl net.ipv4.tcp_keepalive_probes
9       # number of probes before killing connection
$ sysctl net.ipv4.tcp_keepalive_intvl
75      # seconds between probes

# Reduce for highly stateful services (k8s nodes, proxies):
$ sysctl -w net.ipv4.tcp_keepalive_time=600   # 10 minutes


# ============================================================
# COMPLETE SERVER TUNING EXAMPLE (10Gbps, low-latency)
# ============================================================

# Add to /etc/sysctl.d/99-tuning.conf:
#
# TCP buffers
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 262144 16777216
net.ipv4.tcp_wmem = 4096 262144 16777216

# conntrack sizing (8GB RAM host, expect 500k concurrent)
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_buckets = 262144

# Softirq tuning
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 2000

# TCP performance
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_tw_reuse = 1

# conntrack timeouts (shorter for web servers)
net.netfilter.nf_conntrack_tcp_timeout_established = 3600
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 5
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 5
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 120

BBR and CUBIC are the two congestion control algorithms most worth understanding. CUBIC (the default) uses the cubic function to probe for bandwidth and is conservative — it will fill buffers before backing off, which causes bufferbloat. BBR (Google, since kernel 4.9) models the actual bottleneck bandwidth and RTT, and avoids overfilling buffers. On high-BDP links (satellite, cellular, or cross-continental fiber with 50ms+ RTT), BBR can double throughput over CUBIC. For low-latency financial trading where latency variance matters more than throughput, CUBIC with small buffers is often better.

Tools and Metrics

The Linux network stack is one of the most instrumented subsystems in the kernel. Every stage has a corresponding metric or counter. This section catalogs the key observability points and the tools that expose them.

# ============================================================
# THE METRIC FILES
# ============================================================

# /proc/net/softnet_stat — per-CPU softirq stats
# One line per CPU, columns:
#   col 1: packets processed
#   col 2: softnet backlog count (times net_rx_action was called)
#   col 3: CPU ran out of budget (couldn't finish in one run)
#   col 4: CPU exceeded time budget (netdev_budget_usecs exceeded)
#   col 9: net_rx_action exhausted its budget
$ awk '{print $1, $2, $3, $9}' /proc/softnet_stat
0000003c 00000000 00000000 00000000  ...  # CPU0: 60 backlog runs, no drops
00000000 00000152 00000004 00000000  ...  # CPU1: 338 backlog runs, 4 budget exhausted

# /proc/net/stat/nf_conntrack — conntrack table stats
$ cat /proc/net/stat/nf_conntrack
cpu          entries  searched  found   new   invalid   delete   delete_list  insert  insert_failed  drop  early_drop  error  expect_new  expect_create  expect_delete
00000000       512      0       0      0       0         0         0            0            0    0        0        0    0         0         0
00000000       512      0       0      0       0         0         0         0         0         0    0        0    0         0         0

# Key columns:
#   entries     — current conntrack entries (on this CPU)
#   searched    — hash lookups performed
#   found       — successful lookups (high searched/found ratio = hash collisions)
#   invalid     — packets rejected as invalid (malformed or outside conntrack state)
#   early_drop  — entries evicted due to table pressure (memory saving)
#   expect_new/  create/  delete — connection tracking for CT expectations (e.g., FTP)


# /sys/class/net/<dev>/statistics/ — per-interface packet/byte counters
$ cat /sys/class/net/eth0/statistics/rx_packets
1234567890
$ cat /sys/class/net/eth0/statistics/tx_packets
987654321
$ cat /sys/class/net/eth0/statistics/rx_dropped
42
$ cat /sys/class/net/eth0/statistics/tx_dropped
0
$ cat /sys/class/net/eth0/statistics/rx_fifo_errors
0

# /proc/net/dev — same data in table format
$ cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo frame compressed coll carrier
  eth0: 1234567890 12345678    0    0    0    0          0         0 9876543210 98765432    0    0    0     0          0         0


# ============================================================
# SS (socket statistics) — current connections
# ============================================================

# Show all TCP connections with details:
$ ss -ti
State       Recv-Q Send-Q Local Address:Port               Peer Address:Port
ESTAB       0      0      10.0.0.5:443                    10.0.0.10:42312
         rtt:0.123ms cwnd:10  pacing_rate 1.2Mbps     # real-time TCP info

# ss filters:
$ ss -tlnp              # listening TCP sockets with process name
$ ss -tn state established '( dport = :443 )'  # connections to port 443
$ ss -ti sport = :443   # connection stats from source port 443

# Detailed socket info:
$ ss -m -p              # show memory usage and process owning socket
#   skmem:(r0,rb131072,t0,tb262144,f0,w0,o0,bl0)
#   r0 = receive buffer (used/allocated)
#   rb131072 = receive buffer max
#   t0 = send buffer (used/allocated)
#   tb262144 = send buffer max


# ============================================================
# NETSTAT / IP ROUTE / IP LINK
# ============================================================

$ ip -s link show eth0              # interface info with stats
$ ip -s addr show eth0              # addresses and broadcast
$ ip -s route show                  # routing table with cache info
$ ip route get 8.8.8.8             # show how kernel routes a specific dest

$ netstat -s                       # per-protocol counters (TCP, UDP, ICMP)
$ netstat -i                        # interface stats table
$ netstat -anp | grep :443         # connections on port 443 with PID


# ============================================================
# PACKET CAPTURE AND TRACING
# ============================================================

# tcpdump — classic packet capture
$ tcpdump -i eth0 -nn host 10.0.0.5 and port 443
$ tcpdump -i eth0 -nn -c 1000 'tcp[tcpflags] & tcp-syn != 0'  # SYN flood
$ tcpdump -i eth0 -nn 'udp dst port 53'                       # DNS queries

# tcpdump with BPF filters (compiled to kernel BPF):
#   port 80 or port 443
#   tcp[tcpflags] & tcp-push != 0
#   ip[2:2] < 100  (small packets, possible traceroute)

# nstat — kernel counter polling (for trending):
$ nstat -az                        # all counters, zero-reset after
$ nstat -rz                        # show delta (rate) since last run

# /proc/net/snmp — classic SNMP counters by protocol
$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ...
Ip: Forwarding DefaultTTL 123456789 0 0 ...

# Useful for:
#   Tcp: RetransSegs, OutSegs, EstabResets, ActiveOpens, PassiveOpens
#   Tcp: CurrEstab (currently established — key metric!)
#   Udp: InDatagrams, OutDatagrams, RcvbufErrors


# ============================================================
# EThtool — NIC configuration and statistics
# ============================================================

$ ethtool eth0                              # interface settings
$ ethtool -i eth0                           # driver info
$ ethtool -S eth0                           # NIC-specific stats (from driver)
$ ethtool -g eth0                           # RX/TX ring sizes
$ ethtool -k eth0                           # offload features (GRO, TSO, etc.)
$ ethtool -K eth0 gro on tso on             # enable offloads
$ ethtool -C eth0 rx-usecs 50 tx-usecs 50   # interrupt coalescing

# Interrupt coalescing:
#   rx-usecs 50: wait 50µs after last RX interrupt before triggering another
#   Higher value = fewer interrupts (better for throughput)
#   Lower value = lower latency (better for interactive traffic)
#
# tx-usecs works the same way for TX completion interrupts.
# For 10Gbps+ links with bursty traffic, rx-usecs 50-100 is typical.
# For latency-sensitive workloads, rx-usecs 1 (minimal coalescing).


# ============================================================
# CONNTRACK TOOLS
# ============================================================

$ conntrack -L                    # list all tracked connections
$ conntrack -L -p tcp --state ESTABLISHED | wc -l   # count ESTABLISHED
$ conntrack -E                    # live event stream
$ conntrack -C                    # current table size
$ conntrack -F                    # flush table (debug only)
$ conntrack -G -p tcp --orig-src 10.0.0.5  # delete by filter

# Reading /proc/net/nf_conntrack directly (raw entries):
$ cat /proc/net/nf_conntrack | head -5
ipv4     2 tcp      6 431913 ESTABLISHED src=10.0.0.5 dst=10.0.0.10 sport=42312 dport=443 ...
# Format: family protocol state timer src dst sport dport [reply] [labels]


# ============================================================
# TC (traffic control) — qdisc observability
# ============================================================

$ tc qdisc show dev eth0                       # show qdisc
$ tc -s qdisc show dev eth0                    # show with stats
$ tc -s -d qdisc show dev eth0                # show with detailed stats
$ tc class show dev eth0                       # show HTB classes
$ tc filter show dev eth0 parent 1:            # show filters


# ============================================================
# XDP and eBPF observability
# ============================================================

$ ip link show eth0                            # shows xdp state
$ bpftool prog list                            # all loaded eBPF programs
$ bpftool prog show id <id>                     # detailed program info
$ bpftool map list                             # all eBPF maps
$ ls /sys/kernel/debug/tracing/events/xdp/     # XDP tracepoints
$ cat /sys/class/net/eth0/xdp/stats            # per-CPU XDP counters

# perf for XDP overhead measurement:
$ perf top -g                                  # find hot spots in XDP path
$ perf record -a -g -e cycles:k sleep 10        # record cycles spent in network path

Tradeoffs

Strengths

NAPI scales to 100 Gbps NICs without IRQ saturation
XDP is faster than any userspace bypass for drop-heavy workloads
qdiscs (fq_codel, cake) eliminate bufferbloat without configuration
eBPF replaces brittle kernel patches and out-of-tree modules

Sharp edges

iptables rules become O(N) above ~10k entries — kube-proxy/iptables hits this
conntrack table fills under SYN floods, dropping legitimate connections
Default RFS off; single-queue NICs without RPS waste cores
Each layer has its own metrics — debugging spans tools

Frequently Asked Questions

What is NAPI and why does it exist?

Before NAPI, every received packet caused a hardware interrupt. At line rate on 10Gbps+ NICs that's millions of interrupts per second, and each interrupt costs microseconds — the system spent more time servicing IRQs than processing packets. NAPI flips the model: the NIC raises one interrupt, the kernel disables further interrupts on that NIC, then runs net_rx_action in softirq context, polling the NIC's RX rings until they're empty. Then it re-enables interrupts. Under load you switch into pure polling mode automatically, eliminating IRQ overhead.

How does XDP differ from a kernel module?

XDP runs eBPF programs at the earliest possible point in the network stack — typically inside the NIC driver, before the skb is even allocated. You get raw access to the packet buffer and you can decide DROP, TX (echo back), REDIRECT (to another interface or AF_XDP socket), or PASS (continue into the regular stack). Kernel modules can do similar things but require kernel-version-specific code, can crash the kernel, and don't get verified for safety. eBPF is verified, hot-loadable, and version-stable.

What's the order of qdisc, netfilter, and routing?

Roughly (egress): socket -> tcp_transmit_skb -> ip_queue_xmit -> netfilter OUTPUT -> routing decision -> netfilter POSTROUTING -> qdisc enqueue -> qdisc dequeue (driven by NIC's TX completion) -> NIC. Ingress is symmetric: NIC -> driver -> netif_receive_skb -> netfilter PREROUTING -> routing decision (local or forward) -> netfilter INPUT or FORWARD -> socket lookup -> sk_buff to userspace.

iptables vs nftables vs eBPF — which is in use?

iptables is the original, still ubiquitous for compatibility and what kube-proxy speaks by default. nftables is the successor with a unified syntax across IPv4/IPv6/ARP/bridge — it's what RHEL 8+ and Debian 11+ use under the hood. eBPF (via tc-bpf or XDP) bypasses both and runs custom programs at packet ingress/egress. Modern stacks like Cilium replace netfilter entirely with eBPF for orders-of-magnitude better scaling on rule count.

What is RPS and when do you need it?

Receive Packet Steering distributes received packets across multiple CPU cores in software. Without RPS, a single-queue NIC delivers all packets to one CPU, which becomes a bottleneck. RPS hashes each packet (5-tuple by default) and queues it onto a different CPU's per-CPU backlog. RFS (Receive Flow Steering) is RPS plus locality awareness — it tries to place a flow on the same CPU as the userspace process that owns the socket. Modern multi-queue NICs largely make RPS obsolete.

What is conntrack and what's its cost?

Conntrack is the netfilter subsystem that records state for every flow — source/dest/port/protocol/state — so stateful firewalls can match 'established connections only' and so NAT can rewrite packets consistently. Each entry is ~300 bytes. Default table size is /proc/sys/net/netfilter/nf_conntrack_max (typically 65536 to 1M depending on RAM). Costs: every packet does a hash lookup, and the table can fill up under DDoS or in container hosts with millions of short-lived connections.

What happens at the IP layer for forwarded packets?

When a forwarded packet arrives at the routing code, the kernel looks at the destination address and consults the routing table (managed via iproute2 / route -n). If the destination is local, it goes to INPUT (socket delivery). If the destination is reachable via a different interface (not local), the kernel calls ip_forward. The forward path does another netfilter FORWARD check, then hands the packet to ip_output, which consults the routing table again (for the source address and outgoing interface), then passes to the qdisc for egress. Forwarding is typically much cheaper than local delivery since there's no socket lookup or skb cloning.

What is GRO and how does it relate to TSO?

Generic Receive Offload (GRO) is the software counterpart to TSO (TCP Segmentation Offload) and UFO (UDP Fragmentation Offload). When a NIC supports TSO, it can send large segments (up to 64KB) and the hardware handles breaking them into MTU-sized frames. GRO on the receive side merges packets that belong to the same flow before passing them up the stack, reducing per-packet processing overhead. GRO kicks in at netif_receive_skb level and is used by bridges, tunnels, and TCP. The tcp_tso_syn_drops sysctl tracks how often TSO packets are dropped because the receiver's GRO can't coalesce them.

How does tcp_congestion_control affect networking performance?

TCP congestion control is the algorithm that governs how fast a sender can transmit before the network signals congestion. CUBIC (default since 2.6.19) is the classic. BBR (Google's Bottleneck Bandwidth and RTT) is a newer model that targets higher throughput on high-BDP links. You can see what's active with sysctl net.ipv4.tcp_congestion_control and switch with sysctl -w net.ipv4.tcp_congestion_control=bbr. For high-throughput, low-latency workloads like HPC or financial trading, BBR significantly outperforms CUBIC on lossy networks. The related sysctl net.ipv4.tcp_slow_start_after_idle controls whether congestion windows shrink after idle periods — disabling it helps keep throughput stable on persistent connections.

What's the difference between SO_REUSEPORT and SO_REUSEADDR?

SO_REUSEADDR allows a server to bind the same port multiple times — critical for restarting a server without waiting for TIME_WAIT to expire. SO_REUSEPORT (Linux 3.9+) goes further: it allows multiple processes or threads to each bind() the same port, and the kernel load-balances incoming connections across them using a hash of the 4-tuple. This enables true multi-process parallelism, unlike the old accept()-across-fork model where all connections went to one process and others starved. On machines with many cores and high connection counts (e.g., a proxy or API gateway), SO_REUSEPORT can eliminate the thundering herd bottleneck of a single listen socket.

What does the routing table look like and how does the kernel decide?

The kernel FIB (Forwarding Information Base) is consulted on every egress packet (after the routing decision) and for every ingress packet that isn't destined for the local machine. The lookup key is the destination address; the result is a fib_result containing the outgoing interface, next hop gateway, and route type (RTN_UNICAST, RTN_BLACKHOLE, RTN_THROW for policy routing, etc.). Multiple routing tables are supported via policy routing (ip rule). By default there's one table (local, main, default). The kernel uses a longest-prefix match — 10.0.0.0/24 beats 10.0.0.0/16. If multiple routes match at the same prefix length, metrics (ip route get + metric) break ties. Blackhole routes (/32 or default) are useful for DDoS mitigation: you null-route a subnet and the kernel drops matching packets without processing them further.

What is the cost of a system call vs. a kernel bypass mechanism?

A typical recv()/send() system call involves: userspace -> syscall entry -> copy from socket receive queue to user buffer (or vice versa) -> check socket state -> possibly wake another thread -> syscall exit. That costs ~100-200ns on a fast system. io_uring with fixed buffers and SQPOLL reduces this to near-zero by letting the kernel access user buffers directly and by polling the submission queue in kernel context, eliminating syscall overhead for high-throughput workloads. AF_XDP takes it further: it attaches an XDP program that redirects packets directly into a userspace ring buffer, bypassing the entire sk_buff allocation and the kernel's network stack. For a packet processing benchmark, this can mean the difference between 2M and 14M packets per second on the same hardware.

How does netfilter interact with routing decisions?

Netfilter hooks are interleaved with routing decisions, not separate from them. In the ingress path: PREROUTING runs before routing (so DNAT can happen before routing lookup), then routing decides LOCAL vs FORWARD, then INPUT/FORWARD netfilter chains fire. In the egress path: OUTPUT runs before routing (for locally-generated traffic), then routing, then POSTROUTING. The mangle table lets you modify packet headers (TTL, TOS, marking) at various points. The conntrack infrastructure is consulted in PREROUTING (for NAT) and FORWARD (for stateful filtering). The key performance note: if you mark a packet with nfmark in mangle PREROUTING, subsequent routing lookups can use that mark for policy routing (ip rule fwmark). This is how stateful firewalls can force forwarded traffic through specific routing paths based on connection state.