eBPF

Sandboxed kernel programs for observability without instrumentation

eBPF (extended Berkeley Packet Filter) lets you attach small, verified programs to kernel hook points — system call entry/exit, network packet RX, scheduler events, TCP retransmits, perf counters — without recompiling the kernel or loading a kernel module. The kernel runs your program in a sandbox: it cannot crash the host, cannot loop forever, cannot access arbitrary memory. It can only do what the verifier proves safe.

For observability, this means you can trace every TCP connection, profile every function call, or build a service map of every gRPC call — from outside the application, with overhead measured in the low single-digit percent. Cilium uses it for service mesh dataplane. Pixie uses it for zero-config auto-instrumentation. Parca, Pyroscope, and Profefe use it for continuous CPU profiling.

Architecture

An eBPF program lives in two places: a small in-kernel routine attached to a hook, and a userspace controller that loads it, reads its output, and exposes it as metrics, logs, or traces.

Key Numbers

512

stack bytes available to a single program

verifier instruction limit (was 4096 pre-5.2)

~10 ns

overhead of a JIT'd kprobe trampoline

~99 Hz

typical sampling rate for CPU profilers

general-purpose registers (r0-r10) in the eBPF VM

4.x+

kernel needed for kprobes; 5.x+ for most BTF-driven tools

CO-RE

Compile Once, Run Everywhere via BTF type info

eBPF Program Types

Every eBPF program has a specific type that determines its allowed hook points, the context struct handed to it at invocation, and which helper functions it may call. Choosing the right program type is the first architectural decision in any eBPF project.

BPF_PROG_TYPE_KPROBE

Attached to arbitrary kernel function entries. The context is the function's register state (r1 = pt_regs*). Fragile across kernel versions because function names and signatures change, but works on any kernel 4.x+. Use BTF-based kprobes (BTF_KPROBE) when available — they name the function and let the verifier type-check the arguments. A kprobe cannot be detached and re-attached cleanly; the probe fires on every invocation of the target function.

BPF_PROG_TYPE_TRACEPOINT

Attached to named, stable kernel tracepoints. These are part of the kernel ABI and rarely change between versions. Tracepoint arguments are typed: the kernel generates a struct for each tracepoint (e.g. struct trace_event_raw_sys_enter) with named fields. This is the preferred attach point for observability — stable API, low overhead, no dependency on internal kernel layout. The tradeoff: not every interesting kernel event has a tracepoint.

BPF_PROG_TYPE_PERF_EVENT

Attached to perf hardware or software events via perf_event_open(). The context is struct bpf_perf_event_data, which embeds a perf_event_header plus the_regs. Used by continuous CPU profilers: the program reads the instruction pointer, looks it up in a BPF_MAP_TYPE_STACK_TRACE map, and increments a histogram bucket. You control the sampling period from userspace; the program itself is just the per-sample handler.

BPF_PROG_TYPE_XDP

Runs at the earliest possible point in the NIC driver's receive path, before the packet enters the kernel's network stack. The context is struct xdp_buff or struct skb (depending on driver mode). XDP programs can redirect packets to other interfaces, drop them, or modify headers and pass them up the stack. At 10 Gbps+ line rates, per-packet eBPF execution is measurable but usually acceptable. Below 1 Gbps it is essentially free. XDP requires driver support; most modern drivers (mlx5, i40e, ixgbe, virtio-net) support it.

BPF_PROG_TYPE_SOCKET_FILTER

The oldest eBPF program type, dating to classic BPF. Attaches to a socket via SO_ATTACH_BPF and receives a pointer to the raw packet buffer. Used in tcpdump: the filter returns a 0/1 decision (keep or discard the packet). In modern observability stack, this has largely been superseded by tc and XDP, which offer richer context. Socket filters still see use for per-application packet counting.

BPF_PROG_TYPE_SCHED_CLS

Attached to traffic control (qdisc) hooks on network devices — both ingress and egress. The context is struct __sk_buff, which gives access to packet headers, mark, cgroup, and socket state. Cilium uses cls programs extensively for L7 policy enforcement: the program inspects the connection, looks up the security policy in a map, and either allows or drops the packet. Slower than XDP but runs after routing decisions and has full skb metadata.

BPF_PROG_TYPE_CGROUP_SOCK_ADDR

Attaches to cgroup hooks for socket-level operations: bind, connect, sendmsg, recvmsg. The context is either struct bpf_sock_addr (IPv4/IPv6) or struct bpf_sock_ops (for the full TCP state machine). This is how Cilium implements transparent proxying and L7 load balancing — intercepting a connect() call and redirecting it to an envoy sidecar without application knowledge. Also used by some network namespace isolation tools.

BPF_PROG_TYPE_SOCK_OPS

Attaches to the TCP state machine via a struct bpf_sock_ops context. The program receives events for connection establishment, close, retransmit, and congestion window changes. Cilium uses sockops to maintain connection tracking maps and to implement the sockhash maps that correlate packets to connections for visibility. Combined with BPF_MAP_TYPE_SOCKHASH, this gives a per-connection view of throughput, retransmits, and latency — entirely from the kernel's TCP state.

The Verifier

The eBPF verifier is the gatekeeper that makes kernel-bounded eBPF programs safe to run in kernel context. Before any program is executed, the kernel performs static analysis across all possible control flow paths and rejects programs that cannot be proven safe. Understanding what the verifier allows and what it rejects is essential for writing non-trivial eBPF programs.

Control Flow Graph Analysis

The verifier builds a control flow graph (CFG) from the bytecode and walks every reachable path from the entry instruction. It tracks the contents of all 11 registers (r0-r10) and the stack at each instruction boundary. A register may be invalid (never initialized), a scalar (a known constant or unknown value), or a pointer to a region of memory with known size and read/write permissions.

The key invariant the verifier enforces: every path through the program must pass the same safety checks. If there is a conditional branch, both paths must be safe. If the program reads from a pointer, all paths that reach that instruction must have already validated the pointer. This is why branches that narrow the state space matter — if (offset < len) { read(offset); } works because the verifier knows that inside the block, offset is bounded.

Pointer Arithmetic Rules

Pointers in eBPF are typed: ptr + scalar is allowed but the verifier tracks the arithmetic. The result pointer is only valid if the operation stays within the original allocation's bounds. The verifier rejects operations where the bound is unknown:

{`// REJECTED - offset is a u32 from the packet, no upper bound
u32 offset = load_u32(pkt, off);
volatile char *p = pkt_start + offset;  // can't add untrusted scalar to ptr
return *p;  // verifier error

// ACCEPTED - offset is a constant
volatile char *p = pkt_start + 4;
return *p;  // fine, known offset

// ACCEPTED - offset is range-checked
if (offset >= 0 && offset < pkt_len) {
    volatile char *p = pkt_start + offset;
    return *p;  // verifier tracks offset as [0, pkt_len)
}

// ACCEPTED - cast through bpf_skb_load_bytes helper
// which handles the bounds check internally
u8 val;
bpf_skb_load_bytes(ctx, offset, &val, 1);
return val;`}

Stack Access Rules

Each program has a 512-byte private stack. It is not zero-initialized; the verifier tracks which slots are initialized and rejects reads from uninitialized slots. Pointers to the stack are allowed as long as they stay within the 512-byte region. You cannot take a pointer to a map value and store it across function calls — the map may be rehashed between invocations, invalidating the pointer. This is a common mistake.

{`// REJECTED - storing a map value pointer in a global variable
struct socket *sock;
bpf_map_lookup_elem(&map, &key, &sock);
bpf_prandom_u32(); // verifier loses track of sock after this
return bpf_sock_ops_connect(sock, 80); // sock may be stale

// ACCEPTED - look up every time (map is rehashed-safe)
struct sock **sockp = bpf_map_lookup_elem(&map, &key);
if (!sockp) return 0;
return bpf_sock_ops_connect(*sockp, 80); // dereference fresh lookup`}

Loop Detection

The verifier rejects programs containing loops that cannot be proven to terminate. The heuristic: any back-edge in the CFG (a jump from a later instruction to an earlier one) triggers a check. If the trip count cannot be bounded by a constant at verification time, the program is rejected. This means:

#pragma unroll tells clang to unroll a loop at compile time, producing straight-line bytecode with no back-edge. The verifier sees it as a sequence of instructions, not a loop.
Using a break condition that depends on a non-constant — e.g. iterating a map with bpf_map_get_next_key in a loop — requires the loop to be bounded by the maximum number of iterations.
Some patterns that look iterative (processing a variable-length linked list) can be converted to tail-call recursion if each step makes forward progress.

{`// REJECTED - verifier sees back-edge, unbounded trip count
int i = 0;
while (1) {
    bpf_printk("iteration %d", i);
    i++;
}

// ACCEPTED - #pragma unroll forces full unroll at compile time
#pragma unroll
for (int i = 0; i < 64; i++) {
    bpf_printk("iteration %d", i);  // 64 copies of the body in bytecode
}

// ACCEPTED - bounded by max_entries, still uses back-edge but verifier
// allows it because the loop exits via bpf_map_get_next_key returning non-zero
key = 0;
i = 0;
while ((err = bpf_map_get_next_key(map_fd, &key, &new_key) == 0) && i < 128) {
    // process new_key
    key = new_key;
    i++;
}`}

The 1M Instruction Limit

Since kernel 5.2, the instruction limit is 1,048,576 (1M) eBPF instructions, up from 4096 in the classic era. This is not a performance budget — programs approaching the limit will run slowly because the verifier's path exploration is exponential in worst case — but it means that programs with large data tables encoded as instruction sequences (eBPF's early map alternative) are now viable. In practice, most production programs are a few hundred to a few thousand instructions.

The limit applies to the bytecode after clang compilation, not the source. A single bpf_printk() call generates dozens of instructions (string encoding, format setup). Large #pragma unroll loops multiply quickly. If you approach the limit, the first fix is usually to move data lookup into a map and use a few instructions to retrieve it, rather than encoding it as immediate values in the instruction stream.

Common Verifier Errors

{`// error: invalid context access
// kprobe reading from userspace memory directly
bpf_probe_read_user(dst, size, src);  // wrong helper for kernel ptr

// error:Unbounded iteration
// looping through a map without a trip-count bound
while (bpf_map_get_next_key(fd, &key, &next) == 0) { ... }

// error: stack must be within ...
// writing beyond 512-byte stack limit
volatile char buf[600];  // 600 > 512

// error: dereferencing of uninit addr
// using a variable as an index into a stack array without bounds check
int idx = r6;  // r6 is a scalar from packet
volatile int val = stack[idx];  // rejected unless idx is range-checked first`}

BPF Maps Deep Dive

Maps are the only stateful primitive in the eBPF environment. They are created by userspace via the bpf() syscall, shared with programs via file descriptors, and used by programs for counters, histograms, scratch space, and inter-program communication. The map type determines its access semantics, concurrency guarantees, and memory model.

BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY

The two fundamental map types. A hash map (BPF_MAP_TYPE_HASH) stores arbitrary key/value pairs in a resizable hash table. Key and value types are defined at map creation time. Operations: bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem, bpf_map_get_next_key. The global hash lock means high-frequency updates from multiple CPUs cause contention — each update takes a spinlock.

An array map (BPF_MAP_TYPE_ARRAY) is a dense indexed array. Keys are always integers (0..max_entries-1). Lookup is O(1) with no hashing. For fixed-size buckets in a histogram, array maps are faster than hashes. The common pattern: histogram[index] to increment a latency bucket or syscall counter.

PERCPU Variants

BPF_MAP_TYPE_PERCPU_HASH and BPF_MAP_TYPE_PERCPU_ARRAY maintain a separate copy of every slot for each CPU. When the program increments a counter, it only touches its own CPU's copy — no cross-CPU synchronization whatsoever. Userspace reads the map by iterating over all CPUs and summing the values. This is the right choice for any counter touched more than a few thousand times per second per CPU. The memory cost is multiplied by the number of CPUs (e.g. 128 CPUs × 1024 entries).

{`// Wrong: global hash map, cross-CPU lock contention at 100K syscalls/s
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __type(key, u32);
    __type(value, u64);
} counts SEC(".maps");
// r6 becomes a hot contested slot under heavy syscall load

// Right: per-CPU hash, each CPU touches only its own copy
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, u64);
} counts SEC(".maps");
// Atomic across CPUs: none. Just per-CPU ops.

// Userspace aggregation
__u32 key = 0, next_key;
long long total = 0;
int cpu;
while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
    __u64 vals[nrcpus];
    bpf_map_lookup_percpu_elem(map_fd, &next_key, 0, vals, nrcpus);
    for (cpu = 0; cpu < nrcpus; cpu++) total += vals[cpu];
    key = next_key;
}`}

BPF_MAP_TYPE_STACK_TRACE

A specialized map that stores kernel stack traces. The key is a process ID (or 0 for all processes). The value is an array of instruction pointers. When you call bpf_map_lookup_elem(&stackmap, &pid), the kernel walks the frame pointer chain (or uses DWARF unwinding data if available) and returns the instruction pointers as an array. Userspace or the eBPF program then resolves those IPs to symbol names via ksym. This is how continuous CPU profilers reconstruct call stacks: the perf event fires, the program gets the pid/tid, looks up its stack trace, and increments a histogram bucket keyed by the concatenated IP list.

BPF_MAP_TYPE_LRU_HASH

An LRU (Least Recently Used) hash map evicts entries automatically when the map reaches capacity and all buckets are full. The BPF_MAP_TYPE_LRU_PERCPU_HASH variant combines LRU eviction with per-CPU storage. LRU maps are ideal for caches where you want the kernel to manage capacity: a connection tracking table, a DNS response cache, or a rate limiter window. The kernel's LRU implementation is node-affine, meaning entries are preferentially kept on the NUMA node where they were last accessed.

BPF_MAP_TYPE_RINGBUF vs BPF_MAP_TYPE_PERF_EVENT

The two mechanisms for streaming events from kernel to userspace. Perf ring buffers (BPF_MAP_TYPE_PERF_EVENT) were the original mechanism: each event is a custom struct written to a mmap'd page, and userspace reads via read() or perf_event_open(). They support per-event sampling flags and preserving exact event ordering at the cost of higher overhead.

The BPF ring buffer (BPF_MAP_TYPE_RINGBUF, introduced in kernel 5.8) is a simpler, higher-performance circular buffer. All producers (eBPF programs) reserve a slot, write their data, and submit. The buffer is single-producer, single-consumer per CPU, and userspace polls with epoll_wait(). Features lossless batch delivery, memory-mapped I/O, and a consumer/producer protocol that avoids the overhead of perf's per-sample metadata. Ring buffers are the right default for new observability pipelines.

{`// Ring buffer - reservedodelivery
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256KB ring, power of 2
} events SEC(".maps");

struct event { u32 pid; u64 ts; char comm[16]; };
SEC("tracepoint/raw_syscalls/sys_enter")
int handle_openat(struct trace_event_raw_sys_enter *ctx) {
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(struct event), 0);
    if (!e) return 0;  // ring full, drop and continue
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_ringbuf_submit(e, 0);  // commit to ring, userspace receives it
    return 0;
}

// Userspace poll loop
int rb_fd = bpf_map__fd(map_obj, "events");
struct ringbuffer *rb = ringbuffer__new(rb_fd, callback, NULL);
while (running) {
    ringbuffer__poll(rb, 500 /* ms timeout */);
}`}

BPF_MAP_TYPE_PROGRAM_ARRAY (Tail Calls)

A tail call chains eBPF programs together: program A calls bpf_tail_call(ctx, &prog_array, index) and the kernel replaces the current program in the hook with program at index. They share the same context struct and stack. This is the only mechanism for dynamic program composition — a dispatcher program does a bounds check then tail-calls into a specific handler. The instruction counter is shared across the chain: if A is at 50,000 instructions and chain-calls B, B starts at instruction 50,001 of its own 1M limit.

BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_PROGRAM_ARRAY

These two types enable runtime indirection. BPF_MAP_TYPE_HASH_OF_MAPS stores map file descriptors as values, letting a program look up a sub-map by key at runtime. This is used by Cilium's per-connection-policy lookups: the connection 5-tuple is the key, and the value is a pointer to the policy map for that pod. BPF_MAP_TYPE_PROGRAM_ARRAY stores program file descriptors for tail-call targets.

BTF and CO-RE

BTF (BPF Type Format) and CO-RE (Compile Once, Run Everywhere) are the twin pillars that make eBPF programs portable across kernel versions without recompilation. Understanding them is essential for any production eBPF deployment.

BTF Type Descriptions

BTF encodes the types of all C structs, unions, enums, and typedefs in the kernel (and in eBPF programs) into a compact binary format embedded in the kernel image. The kernel exposes BTF data via the /sys/kernel/btf/vmlinux virtual file. When a kernel is built with CONFIG_DEBUG_INFO_BTF=y, the BTF info is emitted as a .BTF section and retained in the vmlinux ELF file. Tools like bpftool can dump the entire type tree:

{`# Dump the struct sk_buff type from the running kernel
$ bpftool btf dump formatc struct sk_buff

struct sk_buff {
    union {
        struct {...}            /* bit:  0 ~ 31 */
        struct {...}            /* bit: 32 ~ 39 */
        __u32                  /* bit:  0 ~ 31 */
    }                             /* offset:  0 (datalen_skb_cb) */
    struct sock *                /* offset:  8 */
    struct net_device *          /* offset: 16 */
    union {...}                  /* offset: 24 */
    ...
    __u16                         /* offset: 202 len */
    __u16                         /* offset: 204 data_len */
    ...
}

// List all structs containing "sock"
$ bpftool btf dump formatc | grep "struct.*sock"
struct sock { ... }
struct socket { ... }
struct msghdr { struct sock * sock; ... }`}

Each type has a BTF ID. When you write ctx->sk in a program, the verifier looks up the BTF ID for struct sk_buff, confirms that sk is a field of type struct sock*, and records the offset. This is how the verifier knows field offsets — it reads them from BTF, not from compiled headers.

CO-RE: Relocating Struct Fields at Load Time

The kernel's internal structs change between versions. A field that was at offset 24 in struct sk_buff in kernel 5.4 might be at offset 32 in 5.8 due to new fields being inserted. CO-RE solves this by:

Compiling the eBPF program with BTF information embedded in the .o file
Recording, for each struct field access, the BTF type ID and field index rather than a hard-coded byte offset
At load time, the libbpf loader reads the target kernel's BTF (via /sys/kernel/btf/vmlinux), looks up the current offset of that field, and patches the eBPF bytecode to use the correct offset

{`// source: reading sk_buff->len using CO-RE
SEC("tc")
int handle_skb(struct __sk_buff *ctx) {
    // BTF knows that "len" is at some offset that varies by kernel version
    // CO-RE patches the bytecode instruction to use the correct offset at load time
    __u32 len = ctx->len;  // not a hard-coded offset; relocatable
    // ...
}

// Compiled .o file contains BTF relocation records:
// relocation[0]: type_id=15 (sk_buff), field_off=20, insn_offset=4
// At load time, loader reads /sys/kernel/btf/vmlinux,
// finds struct sk_buff, field "len" at offset 202,
// patches the bytecode instruction at offset 4 with offset 202
// Program is now valid on any kernel 5.x with the same struct layout`}

BTF Generics and BTF-assisted BTF

Kernel 5.18 introduced BTF_KPROBE and BTF_KRETPROBE attach types that use BTF to resolve function arguments by name rather than relying on the raw pt_regs structure. This enables a typed function signature:

{`// Instead of the raw kprobe approach:
SEC("kprobe/tcp_sendmsg")
int kprobe__tcp_sendmsg(struct pt_regs *ctx) {
    // Must manually parse pt_regs to get the arguments
    long addr = PT_REGS_PARM1(ctx);  // architecture-specific, ugly
    // ...
}

// BTF-based kprobe lets the loader resolve by name:
SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size) {
    // Fully typed, verified against BTF at load time
    __u32 pid = bpf_get_current_pid_tgid() >> 32;
    // ...
    return 0;
}
// BPF_KPROBE expands to a kprobe program with a typed signature.
// The loader resolves "tcp_sendmsg" to its address via kallsyms,
// and the verifier validates that the arguments match the BTF signature
// of the function at that address.`}

Programs and Hook Points

Where you attach the program determines what events trigger it and what context the kernel hands you.

kprobe / kretprobe

Attach to the entry or return of any non-inlined kernel function. kprobe/tcp_sendmsg fires every time the kernel sends TCP data. Works on any kernel from 4.x but the function name is unstable across versions. The kernel uses a breakpoint trap (or, with JMP_CALLEE, a JIT'd trampoline when available) to hand control to the eBPF program. A kretprobe fires after the function returns, with the return value accessible in r0. The overhead is ~10 ns per fire for JIT'd trampolines.

tracepoint

Stable named hook points the kernel exposes intentionally. tracepoint/syscalls/sys_enter_openat is part of the kernel ABI and survives version bumps. Always prefer tracepoints over kprobes when one exists. Each tracepoint has a generated struct trace_event_raw_* with named, typed fields. The kernel rarely changes tracepoint signatures; when it does, it is announced in the stable kernel changelog. Search /sys/kernel/debug/tracing/events/ to see available tracepoints on your kernel.

uprobe / uretprobe

Userspace function entry/return. Hook SSL_write in libssl to capture decrypted HTTPS payloads. Pixie's auto-instrumentation is built almost entirely on uprobes into TLS, gRPC, and language runtimes. Uprobe attachment uses the binary's ELF symbol table to find the function entry point. For position-independent binaries (PIE), uprobes use the file's section headers to compute the correct offset. A uretprobe runs after the function returns — the kernel saves the return value on the stack and restores it after the probe fires.

XDP

Runs at NIC driver level, before the kernel's network stack sees the packet. The XDP context gives access to the packet buffer directly. XDP return codes: XDP_PASS (send up the stack), XDP_DROP (discard), XDP_REDIRECT (send to another interface or to userspace via a socket), XDP_TX (bounce out the same interface). Used for DDoS scrubbing, load balancing (Cilium, Katran), and packet capture without context-switching to userspace. Some drivers support only a subset of modes.

tc (traffic control)

Attach to ingress or egress qdisc on a network device. Slower than XDP but with access to skb metadata including routing decisions, cgroup info, and connection state. Cilium uses tc for L7 policy enforcement and service mesh. tc eBPF programs are attached via tc qdisc add dev eth0 cls and are part of the Linux traffic control pipeline. The qdisc layer is traversed on both RX and TX paths, making tc a natural place for per-connection accounting.

perf event

Fires on hardware perf counters (cycles, instructions, cache-misses, branch mispredictions) or software events (cpu-clock, task-clock). Used for sampling profilers: every N Hz, the perf interrupt fires, the eBPF program is called with the current registers and a perf event header. It reads the instruction pointer, looks up the stack trace in a STACK_TRACE map, and records a sample. The sampling rate is set from userspace via ioctl(perf_fd, PERF_EVENT_IOC_REFRESH, N).

BPF Maps

Maps are the only persistent state available to a program and the only way to communicate with userspace. The map type determines its semantics — hash, array, LRU, perf ring buffer, stack trace, per-CPU.

{`// libbpf C - per-CPU hash map for syscall counts
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);     // syscall number
    __type(value, u64);   // count
} syscall_counts SEC(".maps");

SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscall(struct trace_event_raw_sys_enter *ctx) {
    u32 nr = ctx->id;
    u64 *count = bpf_map_lookup_elem(&syscall_counts, &nr);
    if (count) (*count)++;
    else { u64 init = 1; bpf_map_update_elem(&syscall_counts, &nr, &init, 0); }
    return 0;
}

// Userspace reads the map periodically
__u32 key, *prev = NULL;
while (bpf_map_get_next_key(map_fd, prev, &key) == 0) {
    __u64 vals[num_cpus];
    bpf_map_lookup_elem(map_fd, &key, vals);
    __u64 total = 0;
    for (int i = 0; i < num_cpus; i++) total += vals[i];
    printf("syscall %u: %llu\\n", key, total);
    prev = &key;
}`}

Per-CPU maps avoid lock contention — each CPU has its own copy and userspace sums them. The perf ring buffer (and its successor, the BPF ring buffer) is the preferred way to stream high-volume events to userspace; it preserves event ordering and uses memory-mapped circular buffers.

bpftrace

bpftrace is a high-level DSL for eBPF that lets you write one-liners and short scripts that compile and run at runtime. It sits on top of the kernel's eBPF infrastructure and exposes an awk-like language with built-in variables for common kernel state. It is the fastest way to go from "what does this function do" to "I have data" for ad-hoc investigation.

One-liners

{`# Count file opens by process name
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'

# Trace TCP retransmits, print the function symbol of the skb being retransmitted
bpftrace -e 'kprobe:tcp_retransmit_skb { printf("%s -> %s\\n", comm, ksym(arg1)); }'

# Histogram of read() syscall latency via vfs_read
bpftrace -e '
  kprobe:vfs_read { @start[tid] = nsecs; }
  kretprobe:vfs_read /@start[tid]/ {
    @ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
  }'

# Count lock contentions by lock address
bpftrace -e 'kprobe:__lock_text_start { @locks[arg0] = sum(1); }'

# Print every new process with its arguments
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s(%s)\\n", comm, str(args->filename)); }'

# Bytes read per file descriptor per process
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes[pid, args->fd] = sum(args->ret); }'`}

USDT Probes (Statically Defined Tracepoints)

User-level Statically Defined Tracepoints (USDT) are compile-time instrumentation inserted into application code. They are used by gRPC, MongoDB, MySQL, Node.js, and many other runtimes. bpftrace can attach to these without any kernel involvement:

{`# List available USDT probes in a process
bpftrace -l '# USDT' | grep 
# or via:
# readelf -n /path/to/binary | grep -A2 stapdt

# Trace gRPC server received messages (example)
bpftrace -e '
  USDT::grpc:server_received_message {
    printf("pid=%d method=%s size=%d\\n", pid, str(arg0), arg1);
  }
  USDT::grpc:server_sent_message {
    printf("pid=%d method=%s size=%d\\n", pid, str(arg0), arg1);
  }' -p $(pgrep -x myservice)`}

bpftrace Scripts and Filters

A bpftrace script file (.bt) has the same language as one-liners but organized into multiple probes with shared state. Filtering in bpftrace uses an expression after the probe specifier that acts as a predicate:

{`#!/usr/bin/env bpftrace

// Only fire when the file being opened is in /tmp (filter expression)
tracepoint:syscalls:sys_enter_openat
/strcontains(str(args->filename), "/tmp")/ {
    @tmp_opens[comm] = count();
}

// Attach to scheduler to measure voluntary context switches
tracepoint:sched:sched_switch {
    @switches[prev_comm] = sum(1);
}

// Aggregate TCP connection state changes with a 10-second interval
interval:s:10 {
    print(@tcp_states);
    clear(@tcp_states);
}

END {
    print(@tmp_opens);
    print(@switches);
}`}

Output Formats

bpftrace supports several output modes. The default is human-readable terminal output with auto-scaling histograms. For machine consumption, use format json or format struct to get structured events. bpftrace also supports -o filename to write to a file, and can stream to a Unix domain socket for ingestion into a log aggregation pipeline.

The Toolchain: BCC vs bpftrace vs libbpf

Three layers, each with a different tradeoff between ease of use and runtime cost.

Tool	Style	Compile	Best for
bpftrace	awk-like DSL	at runtime	ad-hoc one-liners, troubleshooting
BCC	Python + embedded C	at runtime via clang	full programs with userspace UI; clang dependency in production
libbpf + CO-RE	C, compiled to BTF	ahead of time	production deployments; small, fast, portable
cilium/ebpf (Go)	Go bindings to libbpf	ahead of time	Go-native exporters, agents
aya (Rust)	Rust bindings	ahead of time	memory-safe userspace controllers

BCC: C Embedded in Python

The BPF Compiler Collection (BCC) embeds eBPF C code as a multi-line string inside a Python (or Lua) program. At runtime, BCC compiles the C with clang/LLVM (which must be installed on the target machine), creates maps, and attaches the programs via the bpf syscall. The Python side provides the userspace logic: reading maps, aggregating data, printing tables, serving HTTP. The tradeoff is the clang dependency — compiling eBPF at runtime takes seconds, and shipping clang in a production container image is heavyweight. BCC tools are widely used for ad-hoc debugging scripts.

{`#!/usr/bin/env python3
# from BCC examples/openaat
from bcc import BPF
from bcc.utils import printb

program = r"""
#include 
#include 
#include 

// Per-process counter map
BPF_HASH(counts, u32, u64);

int hello(struct trace_event_raw_sys_enter *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *cnt = counts.lookup(&pid);
    if (cnt)
        (*cnt)++;
    else {
        u64 init = 1;
        counts.update(&pid, &init);
    }
    return 0;
}
"""

b = BPF(text=program)
b.attach_uprobe(name="c", sym="openat", fn_name="hello")

print("Tracing openat()... Hit Ctrl-C to end.")
while True:
    sleep(1)
    for pid, cnt in b["counts"].items():
        print(f"PID {pid.value}: openat called {cnt.value} times")`}

Cilium: eBPF as Service Mesh Dataplane

Cilium is the most sophisticated production use of eBPF in the cloud-native ecosystem. It replaces kube-proxy and much of the Linux networking stack with eBPF programs that handle load balancing, routing, encryption (WireGuard), and L7 policy enforcement entirely in-kernel.

How Cilium Replaces iptables

Before Cilium, Linux NetworkPolicy enforcement used iptables rules that were evaluated linearly for every packet. At 100K connections, this became O(n) per packet — a significant CPU cost at scale. Cilium replaces this with BPF_MAP_TYPE_SOCKHASH and BPF_PROG_TYPE_SOCK_OPS: when a connection is established, a sockops program stores a reference to the socket in a hash map keyed by the 4-tuple (src IP, src port, dst IP, dst port). Subsequent packets are matched against this map via a SKIP (socket lookup) in the tc path, bypassing the iptables chain entirely.

Service load balancing — mapping a Kubernetes Service cluster IP to backend pods — is handled by a dedicated BPF_MAP_TYPE_ARRAY (the backend map) and a cls program that rewrites destination addresses. This is both faster than kube-proxy's iptables rules and more predictable: eBPF's O(1) map lookup is constant-time regardless of the number of backends.

Cilium Hubble: Observability via eBPF

Hubble is Cilium's observability layer. It attaches a cls program at the egress hook that captures every connection event (connection opened, closed, data transferred) and writes it to a ring buffer. A separate userspace daemon (hubble-relay) aggregates the per-node flows and exposes them via a gRPC API consumed by the Hubble UI and CLI. The result: a full service map and connection timeline for every pod in the cluster, with no application instrumentation required.

Pixie: Auto-Instrumentation via uprobes

Pixie is an open-source observability platform for Kubernetes that uses eBPF uprobes to automatically capture telemetry data from applications without requiring code changes, sidecars, or configuration. It is the most ambitious consumer-facing use of eBPF auto-instrumentation.

What Pixie Hooks

Pixie's auto-instrumentation scripts define a set of well-known function targets organized by protocol. For each protocol, a matching script attaches uprobes to the relevant library functions and decodes the protocol messages from the function arguments.

TLS (OpenSSL / BoringSSL): SSL_read, SSL_write. These fire after the TLS record layer has decrypted or before it encrypts the payload. The uprobe receives pointers to the application buffer, which eBPF copies into the ring buffer for userspace to decode as an HTTP request or response.
gRPC (HTTP/2 framer): nghttp2_session_mem_send, nghttp2_session_mem_recv. Pixie's gRPC scripts parse the HTTP/2 frames that wrap gRPC messages and extract the method name and payload size.
DNS: getaddrinfo, libc_sendto with a filter for port 53. Pixie correlates DNS queries with application connections.
MySQL, Postgres, Redis, Kafka: Entry and return probes on the protocol framing functions in each library — libmysqlclient, libpq, hiredis, librdkafka. Pixie parses the wire protocol to extract query text and response rows.

The Coupling Problem

The fundamental challenge with Pixie's approach is library version coupling. An uprobe targets a specific function with a specific signature at a specific offset. When a library ships a new version that changes the function layout, adds new inlined helpers, or modifies the wire protocol parser, the probes break silently or produce incorrect data. Pixie maintains a table of library versions and their corresponding probe specs, updating it when distributions ship new library versions. For long-lived clusters where base images change slowly, this works well. For clusters with rolling updates and heterogeneous images, the maintenance burden is significant.

Parca: Continuous CPU Profiling with eBPF

Parca is a continuous CPU profiler that uses eBPF to sample CPU utilization across an entire fleet with minimal overhead, storing only the aggregated stack/histogram rather than raw samples.

A traditional profiler using perf record writes gigabytes of raw perf.data per host per day — every sample includes the full call chain. Parca's eBPF program instead maintains a BPF_MAP_TYPE_STACK_TRACE map keyed by (pid, stack_id) and a separate histogram map keyed by the concatenated instruction pointer list. Every N microseconds (the default sampling interval), the perf event fires, the program looks up the stack trace, and increments a single counter for that unique call chain. Parca ships only the delta — deltas of the histogram bucket counts — to the server. The data rate is orders of magnitude smaller than raw perf data.

For unwinding, Parca supports both frame-pointer-based unwinding (fast, works on most x86_64 binaries compiled with -fno-omit-frame-pointer) and DWARF-based unwinding (accurate, works for optimized binaries). The DWARF path uses the kernel's eBPF-based DWARF unwinder (when available in the kernel) to resolve inlined frames and produce accurate stacks for applications compiled with -O2 or -O3.

Tetragon: Runtime Security Enforcement via eBPF

Tetragon (by Aqua Security) is a runtime security tool that uses eBPF tracepoints and uprobes to capture every syscall execution with full argument and return value context, then enforces policy-based access control at the kernel level.

Unlike Cilium, which focuses on network policy, Tetragon enforces process-level and file-level security policies by attaching to raw syscalls (via tracepoint raw_syscalls:sys_enter and raw_syscalls:sys_exit). When a syscall matches a policy rule — e.g. "process /usr/bin/curl may not execve anything except /lib/*/curl" — the eBPF program can send a signal to kill the process (SIGKILL), return an error code, or send an event to the tracing infrastructure for alerting. The enforcement happens in-kernel without userspace involvement, making it tamper-resistant against root-level attackers.

For observability, Tetragon exports traced events as structured JSON via a ring buffer, with fields for the process identity (pid, tid, uid, gid), the syscall arguments, the return value, and the stack trace. This is the most comprehensive syscall-level visibility available without a kernel module.

Performance Considerations

eBPF programs run in performance-critical paths. Design decisions that seem minor at development time can become measurable overhead at production traffic rates.

Per-CPU Maps Are Non-Negotiable for Hot Counters

A counter in a global hash map that is updated 50,000 times per second per CPU (for example, a connection tracker on a busy node) will serialize on the map's global spinlock. On a 128-CPU machine, this creates 128 CPUs contending for a single lock, which can manifest as 10-30% CPU overhead in the profiling data itself. Using BPF_MAP_TYPE_PERCPU_HASH eliminates this entirely: each CPU touches only its own copy. Userspace aggregates at read time.

Batching with Ringbuf for High-Volume Events

For streaming events (every HTTP request, every syscall, every packet), the choice between perf ring buffers and the BPF ring buffer matters. The BPF ring buffer's bpf_ringbuf_reserve() / bpf_ringbuf_submit() API is designed for batch delivery: consumers can retrieve multiple events per epoll_wait() wake. Each event has minimal per-event overhead. For workloads with millions of events per second, ringbuf's efficiency advantage over perf buffers is significant.

High-Frequency Tracepoint Overhead

A tracepoint attached to a syscall like sys_enter_read fires on every read syscall across all processes. On a database server doing 200K reads/second, the probe fires 200K times/second. At ~10 ns per fire for a simple counter increment, that's 2 ms/second of CPU — well under 1% of a single core. However, the combined overhead of many probes on hot tracepoints can become measurable. Profile with perf stat -e cycles -e instructions with and without the probes attached to establish a baseline.

XDP Per-Packet Cost

XDP programs process every packet that arrives at the NIC. At 10 Gbps line rate with 1500-byte packets, that's ~833K packets/second. A simple XDP program (drop, pass, or redirect) takes 50-100 ns on modern x86, which is ~0.08% of a single core per million packets/second. Programs that do map lookups, parse headers, or make policy decisions cost more. For line-rate packet processing at 40+ Gbps, the program must be kept minimal and data-dependent branching should be predictable.

Common Mistakes

Map lookup in hot path without per-CPU

Using BPF_MAP_TYPE_HASH for a high-frequency counter causes global spinlock contention. Always use PERCPU_HASH for counters updated more than a few thousand times per second.

Sleeping in non-preemptible context

eBPF programs run with preemption disabled. Calling bpf_printk inside a tight loop, or using a map lookup that might trigger memory allocation under pressure, can cause the program to fail the verifier or cause scheduling latency. Use bpf_ringbuf_reserve instead of blocking helpers.

Exceeding instruction limits with unroll

#pragma unroll creates N copies of the loop body in bytecode. A 10,000-iteration loop creates a 10,000×N-instruction program that will fail to load if it exceeds the 1M instruction limit. Use a bounded loop with a realistic maximum (e.g. 256) that your use case actually needs.

Storing map pointers across hook invocations

A map entry's address may change when the map is rehashed. If you store a pointer to a map entry in a stack variable or global and use it on the next hook invocation, the pointer may be stale. Always re-look up the entry.

Assuming kprobe function names are stable

A function name like tcp_sendmsg can be renamed, inlined, or replaced by a different function between kernel versions. Programs that attach via BTF names (BTF_KPROBE) are more resilient. For kprobes without BTF, test on every target kernel version.

Libbpf skeleton mismatch

bpftool gen skeleton embeds map and program names into a header. If the .o file is rebuilt and the skeleton is not regenerated, the skeleton still references the old names and the load fails. Automate skeleton generation in the build pipeline.

perf events vs eBPF

Both can sample the kernel, but with very different cost models.

Aspect	perf	eBPF
Where data lives	perf.data file (gigabytes)	BPF map (kilobytes-megabytes)
In-kernel filtering	limited (perf-filter)	arbitrary verified C
Continuous use	impractical (storage)	designed for it
Aggregation	userspace post-processing	in-kernel via maps
Stack traces	frame pointer / DWARF	BPF_MAP_TYPE_STACK_TRACE

A continuous profiler using perf would write multi-GB perf.data files per host per day. The eBPF equivalent (Parca, Pyroscope) keeps a per-CPU stack-trace histogram in a BPF map and ships only deltas to the backend — orders of magnitude less data.

Tradeoffs

Zero instrumentation

No code changes needed. Works for closed-source binaries, third-party services, and languages that resist instrumentation.

Kernel coupling

Programs target specific kernel data structures. CO-RE + BTF mitigates this but doesn't eliminate it — some attach points change between major versions.

Privileged install

Loading eBPF requires CAP_BPF or CAP_SYS_ADMIN. In multi-tenant clusters that's a security policy decision, not just a technical one.

Verifier rejection is opaque

"invalid access to packet" with no line number is a common debugging experience. The verifier log is verbose; learning to read it is a skill.

FAQ

Can eBPF crash the kernel?

No, by design. The verifier proves termination and memory safety before the program ever runs. The most that can go wrong is the program returning the wrong value or a buggy probe attaching to a hot function and slowing the system — never a panic.

What is BTF and why does it matter?

BTF (BPF Type Format) is type information embedded in the kernel that lets eBPF programs reference kernel struct fields by name. CO-RE (Compile Once, Run Everywhere) uses BTF to relocate field offsets at load time, so one compiled .o works across kernel versions. Without BTF, you'd recompile for every kernel.

How much overhead does an attached probe add?

A kprobe is roughly 10-50 ns per fire on modern x86. A tracepoint is faster (no breakpoint trap). For a syscall-rate workload (~100K/s) this is well under 1% CPU. For a packet-rate workload (millions/s at XDP), the program itself becomes the bottleneck and you measure in cycles, not nanoseconds.

Why use uprobes for HTTPS instead of MITM?

Because uprobes see plaintext after TLS decryption inside the application's own libssl. No certificate manipulation, no proxy, no traffic redirection. Pixie reads the buffer that SSL_read just decrypted, copies it to a ring buffer, and the application is unaware.

Can I write eBPF in Rust or Go?

Userspace controllers, yes — cilium/ebpf (Go) and aya (Rust) are mature. The kernel-side program is still C (compiled to eBPF bytecode) for almost all production tools. There are experimental Rust frontends but the verifier expects C-like control flow patterns.

What's the difference between BPF and eBPF?

"Classic" BPF (cBPF) was the original 1992 packet filter language for tcpdump — tiny, restricted to packet inspection. eBPF (extended BPF) is the modern in-kernel VM: 64-bit registers, maps, helper functions, multiple program types. cBPF programs are auto-translated to eBPF at load time today.