eBPF
Sandboxed kernel programs for observability without instrumentation
eBPF (extended Berkeley Packet Filter) lets you attach small, verified programs to kernel hook points — system call entry/exit, network packet RX, scheduler events, TCP retransmits, perf counters — without recompiling the kernel or loading a kernel module. The kernel runs your program in a sandbox: it cannot crash the host, cannot loop forever, cannot access arbitrary memory. It can only do what the verifier proves safe.
For observability, this means you can trace every TCP connection, profile every function call, or build a service map of every gRPC call — from outside the application, with overhead measured in the low single-digit percent. Cilium uses it for service mesh dataplane. Pixie uses it for zero-config auto-instrumentation. Parca, Pyroscope, and Profefe use it for continuous CPU profiling.
Architecture
An eBPF program lives in two places: a small in-kernel routine attached to a hook, and a userspace controller that loads it, reads its output, and exposes it as metrics, logs, or traces.
Key Numbers
eBPF Program Types
Every eBPF program has a specific type that determines its allowed hook points, the context struct handed to it at invocation, and which helper functions it may call. Choosing the right program type is the first architectural decision in any eBPF project.
BPF_PROG_TYPE_KPROBE
Attached to arbitrary kernel function entries. The context is the function's register state (r1 = pt_regs*). Fragile across kernel versions because function names and signatures change, but works on any kernel 4.x+. Use BTF-based kprobes (BTF_KPROBE) when available — they name the function and let the verifier type-check the arguments. A kprobe cannot be detached and re-attached cleanly; the probe fires on every invocation of the target function.
BPF_PROG_TYPE_TRACEPOINT
Attached to named, stable kernel tracepoints. These are part of the kernel ABI
and rarely change between versions. Tracepoint arguments are typed: the kernel
generates a struct for each tracepoint (e.g. struct trace_event_raw_sys_enter)
with named fields. This is the preferred attach point for observability —
stable API, low overhead, no dependency on internal kernel layout. The tradeoff:
not every interesting kernel event has a tracepoint.
BPF_PROG_TYPE_PERF_EVENT
Attached to perf hardware or software events via perf_event_open().
The context is struct bpf_perf_event_data, which embeds a
perf_event_header plus the_regs. Used by continuous CPU profilers:
the program reads the instruction pointer, looks it up in a
BPF_MAP_TYPE_STACK_TRACE map, and increments a histogram bucket.
You control the sampling period from userspace; the program itself is just
the per-sample handler.
BPF_PROG_TYPE_XDP
Runs at the earliest possible point in the NIC driver's receive path, before
the packet enters the kernel's network stack. The context is
struct xdp_buff or struct skb (depending on driver mode).
XDP programs can redirect packets to other interfaces, drop them, or modify
headers and pass them up the stack. At 10 Gbps+ line rates, per-packet eBPF
execution is measurable but usually acceptable. Below 1 Gbps it is essentially
free. XDP requires driver support; most modern drivers (mlx5, i40e, ixgbe,
virtio-net) support it.
BPF_PROG_TYPE_SOCKET_FILTER
The oldest eBPF program type, dating to classic BPF. Attaches to a socket via
SO_ATTACH_BPF and receives a pointer to the raw packet buffer.
Used in tcpdump: the filter returns a 0/1 decision (keep or discard the packet).
In modern observability stack, this has largely been superseded by tc and XDP,
which offer richer context. Socket filters still see use for per-application
packet counting.
BPF_PROG_TYPE_SCHED_CLS
Attached to traffic control (qdisc) hooks on network devices — both ingress
and egress. The context is struct __sk_buff, which gives access to
packet headers, mark, cgroup, and socket state. Cilium uses cls programs
extensively for L7 policy enforcement: the program inspects the connection,
looks up the security policy in a map, and either allows or drops the packet.
Slower than XDP but runs after routing decisions and has full skb metadata.
BPF_PROG_TYPE_CGROUP_SOCK_ADDR
Attaches to cgroup hooks for socket-level operations: bind, connect, sendmsg,
recvmsg. The context is either struct bpf_sock_addr (IPv4/IPv6)
or struct bpf_sock_ops (for the full TCP state machine). This is
how Cilium implements transparent proxying and L7 load balancing — intercepting
a connect() call and redirecting it to an envoy sidecar without application
knowledge. Also used by some network namespace isolation tools.
BPF_PROG_TYPE_SOCK_OPS
Attaches to the TCP state machine via a struct bpf_sock_ops context.
The program receives events for connection establishment, close, retransmit,
and congestion window changes. Cilium uses sockops to maintain connection tracking
maps and to implement the sockhash maps that correlate packets to connections for
visibility. Combined with BPF_MAP_TYPE_SOCKHASH, this gives a
per-connection view of throughput, retransmits, and latency — entirely
from the kernel's TCP state.
The Verifier
The eBPF verifier is the gatekeeper that makes kernel-bounded eBPF programs safe to run in kernel context. Before any program is executed, the kernel performs static analysis across all possible control flow paths and rejects programs that cannot be proven safe. Understanding what the verifier allows and what it rejects is essential for writing non-trivial eBPF programs.
Control Flow Graph Analysis
The verifier builds a control flow graph (CFG) from the bytecode and walks every reachable path from the entry instruction. It tracks the contents of all 11 registers (r0-r10) and the stack at each instruction boundary. A register may be invalid (never initialized), a scalar (a known constant or unknown value), or a pointer to a region of memory with known size and read/write permissions.
The key invariant the verifier enforces: every path through the program must pass
the same safety checks. If there is a conditional branch, both paths must be safe.
If the program reads from a pointer, all paths that reach that instruction must have
already validated the pointer. This is why branches that narrow the state space matter
— if (offset < len) { read(offset); } works because the verifier
knows that inside the block, offset is bounded.
Pointer Arithmetic Rules
Pointers in eBPF are typed: ptr + scalar is allowed but the verifier
tracks the arithmetic. The result pointer is only valid if the operation stays
within the original allocation's bounds. The verifier rejects operations where the
bound is unknown:
{`// REJECTED - offset is a u32 from the packet, no upper bound
u32 offset = load_u32(pkt, off);
volatile char *p = pkt_start + offset; // can't add untrusted scalar to ptr
return *p; // verifier error
// ACCEPTED - offset is a constant
volatile char *p = pkt_start + 4;
return *p; // fine, known offset
// ACCEPTED - offset is range-checked
if (offset >= 0 && offset < pkt_len) {
volatile char *p = pkt_start + offset;
return *p; // verifier tracks offset as [0, pkt_len)
}
// ACCEPTED - cast through bpf_skb_load_bytes helper
// which handles the bounds check internally
u8 val;
bpf_skb_load_bytes(ctx, offset, &val, 1);
return val;`} Stack Access Rules
Each program has a 512-byte private stack. It is not zero-initialized; the verifier tracks which slots are initialized and rejects reads from uninitialized slots. Pointers to the stack are allowed as long as they stay within the 512-byte region. You cannot take a pointer to a map value and store it across function calls — the map may be rehashed between invocations, invalidating the pointer. This is a common mistake.
{`// REJECTED - storing a map value pointer in a global variable
struct socket *sock;
bpf_map_lookup_elem(&map, &key, &sock);
bpf_prandom_u32(); // verifier loses track of sock after this
return bpf_sock_ops_connect(sock, 80); // sock may be stale
// ACCEPTED - look up every time (map is rehashed-safe)
struct sock **sockp = bpf_map_lookup_elem(&map, &key);
if (!sockp) return 0;
return bpf_sock_ops_connect(*sockp, 80); // dereference fresh lookup`} Loop Detection
The verifier rejects programs containing loops that cannot be proven to terminate. The heuristic: any back-edge in the CFG (a jump from a later instruction to an earlier one) triggers a check. If the trip count cannot be bounded by a constant at verification time, the program is rejected. This means:
#pragma unrolltells clang to unroll a loop at compile time, producing straight-line bytecode with no back-edge. The verifier sees it as a sequence of instructions, not a loop.- Using a
breakcondition that depends on a non-constant — e.g. iterating a map withbpf_map_get_next_keyin a loop — requires the loop to be bounded by the maximum number of iterations. - Some patterns that look iterative (processing a variable-length linked list) can be converted to tail-call recursion if each step makes forward progress.
{`// REJECTED - verifier sees back-edge, unbounded trip count
int i = 0;
while (1) {
bpf_printk("iteration %d", i);
i++;
}
// ACCEPTED - #pragma unroll forces full unroll at compile time
#pragma unroll
for (int i = 0; i < 64; i++) {
bpf_printk("iteration %d", i); // 64 copies of the body in bytecode
}
// ACCEPTED - bounded by max_entries, still uses back-edge but verifier
// allows it because the loop exits via bpf_map_get_next_key returning non-zero
key = 0;
i = 0;
while ((err = bpf_map_get_next_key(map_fd, &key, &new_key) == 0) && i < 128) {
// process new_key
key = new_key;
i++;
}`} The 1M Instruction Limit
Since kernel 5.2, the instruction limit is 1,048,576 (1M) eBPF instructions, up from 4096 in the classic era. This is not a performance budget — programs approaching the limit will run slowly because the verifier's path exploration is exponential in worst case — but it means that programs with large data tables encoded as instruction sequences (eBPF's early map alternative) are now viable. In practice, most production programs are a few hundred to a few thousand instructions.
The limit applies to the bytecode after clang compilation, not the source. A single
bpf_printk() call generates dozens of instructions (string encoding, format
setup). Large #pragma unroll loops multiply quickly. If you approach the
limit, the first fix is usually to move data lookup into a map and use a few instructions
to retrieve it, rather than encoding it as immediate values in the instruction stream.
Common Verifier Errors
{`// error: invalid context access
// kprobe reading from userspace memory directly
bpf_probe_read_user(dst, size, src); // wrong helper for kernel ptr
// error:Unbounded iteration
// looping through a map without a trip-count bound
while (bpf_map_get_next_key(fd, &key, &next) == 0) { ... }
// error: stack must be within ...
// writing beyond 512-byte stack limit
volatile char buf[600]; // 600 > 512
// error: dereferencing of uninit addr
// using a variable as an index into a stack array without bounds check
int idx = r6; // r6 is a scalar from packet
volatile int val = stack[idx]; // rejected unless idx is range-checked first`} BPF Maps Deep Dive
Maps are the only stateful primitive in the eBPF environment. They are created by
userspace via the bpf() syscall, shared with programs via file descriptors,
and used by programs for counters, histograms, scratch space, and inter-program
communication. The map type determines its access semantics, concurrency guarantees,
and memory model.
BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY
The two fundamental map types. A hash map (BPF_MAP_TYPE_HASH) stores
arbitrary key/value pairs in a resizable hash table. Key and value types are defined
at map creation time. Operations: bpf_map_lookup_elem,
bpf_map_update_elem, bpf_map_delete_elem,
bpf_map_get_next_key. The global hash lock means high-frequency updates
from multiple CPUs cause contention — each update takes a spinlock.
An array map (BPF_MAP_TYPE_ARRAY) is a dense indexed array. Keys are
always integers (0..max_entries-1). Lookup is O(1) with no hashing. For fixed-size
buckets in a histogram, array maps are faster than hashes. The common pattern:
histogram[index] to increment a latency bucket or syscall counter.
PERCPU Variants
BPF_MAP_TYPE_PERCPU_HASH and BPF_MAP_TYPE_PERCPU_ARRAY
maintain a separate copy of every slot for each CPU. When the program increments a
counter, it only touches its own CPU's copy — no cross-CPU synchronization
whatsoever. Userspace reads the map by iterating over all CPUs and summing the values.
This is the right choice for any counter touched more than a few thousand times per
second per CPU. The memory cost is multiplied by the number of CPUs (e.g. 128 CPUs
× 1024 entries).
{`// Wrong: global hash map, cross-CPU lock contention at 100K syscalls/s
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, u32);
__type(value, u64);
} counts SEC(".maps");
// r6 becomes a hot contested slot under heavy syscall load
// Right: per-CPU hash, each CPU touches only its own copy
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, 1024);
__type(key, u32);
__type(value, u64);
} counts SEC(".maps");
// Atomic across CPUs: none. Just per-CPU ops.
// Userspace aggregation
__u32 key = 0, next_key;
long long total = 0;
int cpu;
while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
__u64 vals[nrcpus];
bpf_map_lookup_percpu_elem(map_fd, &next_key, 0, vals, nrcpus);
for (cpu = 0; cpu < nrcpus; cpu++) total += vals[cpu];
key = next_key;
}`} BPF_MAP_TYPE_STACK_TRACE
A specialized map that stores kernel stack traces. The key is a process ID (or 0 for
all processes). The value is an array of instruction pointers. When you call
bpf_map_lookup_elem(&stackmap, &pid), the kernel walks the frame
pointer chain (or uses DWARF unwinding data if available) and returns the
instruction pointers as an array. Userspace or the eBPF program then resolves
those IPs to symbol names via ksym. This is how continuous CPU
profilers reconstruct call stacks: the perf event fires, the program gets the
pid/tid, looks up its stack trace, and increments a histogram bucket keyed
by the concatenated IP list.
BPF_MAP_TYPE_LRU_HASH
An LRU (Least Recently Used) hash map evicts entries automatically when the map
reaches capacity and all buckets are full. The BPF_MAP_TYPE_LRU_PERCPU_HASH
variant combines LRU eviction with per-CPU storage. LRU maps are ideal for caches
where you want the kernel to manage capacity: a connection tracking table, a DNS
response cache, or a rate limiter window. The kernel's LRU implementation is
node-affine, meaning entries are preferentially kept on the NUMA node where
they were last accessed.
BPF_MAP_TYPE_RINGBUF vs BPF_MAP_TYPE_PERF_EVENT
The two mechanisms for streaming events from kernel to userspace. Perf ring buffers
(BPF_MAP_TYPE_PERF_EVENT) were the original mechanism: each event is
a custom struct written to a mmap'd page, and userspace reads via
read() or perf_event_open(). They support per-event
sampling flags and preserving exact event ordering at the cost of higher overhead.
The BPF ring buffer (BPF_MAP_TYPE_RINGBUF, introduced in kernel 5.8)
is a simpler, higher-performance circular buffer. All producers (eBPF programs)
reserve a slot, write their data, and submit. The buffer is single-producer,
single-consumer per CPU, and userspace polls with epoll_wait().
Features lossless batch delivery, memory-mapped I/O, and a consumer/producer
protocol that avoids the overhead of perf's per-sample metadata. Ring buffers
are the right default for new observability pipelines.
{`// Ring buffer - reservedodelivery
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256KB ring, power of 2
} events SEC(".maps");
struct event { u32 pid; u64 ts; char comm[16]; };
SEC("tracepoint/raw_syscalls/sys_enter")
int handle_openat(struct trace_event_raw_sys_enter *ctx) {
struct event *e = bpf_ringbuf_reserve(&events, sizeof(struct event), 0);
if (!e) return 0; // ring full, drop and continue
e->pid = bpf_get_current_pid_tgid() >> 32;
e->ts = bpf_ktime_get_ns();
bpf_get_current_comm(&e->comm, sizeof(e->comm));
bpf_ringbuf_submit(e, 0); // commit to ring, userspace receives it
return 0;
}
// Userspace poll loop
int rb_fd = bpf_map__fd(map_obj, "events");
struct ringbuffer *rb = ringbuffer__new(rb_fd, callback, NULL);
while (running) {
ringbuffer__poll(rb, 500 /* ms timeout */);
}`} BPF_MAP_TYPE_PROGRAM_ARRAY (Tail Calls)
A tail call chains eBPF programs together: program A calls
bpf_tail_call(ctx, &prog_array, index) and the kernel replaces
the current program in the hook with program at index. They share the
same context struct and stack. This is the only mechanism for dynamic program
composition — a dispatcher program does a bounds check then tail-calls into
a specific handler. The instruction counter is shared across the chain: if A is
at 50,000 instructions and chain-calls B, B starts at instruction 50,001 of its
own 1M limit.
BPF_MAP_TYPE_HASH_OF_MAPS and BPF_MAP_TYPE_PROGRAM_ARRAY
These two types enable runtime indirection. BPF_MAP_TYPE_HASH_OF_MAPS
stores map file descriptors as values, letting a program look up a sub-map by key
at runtime. This is used by Cilium's per-connection-policy lookups: the connection
5-tuple is the key, and the value is a pointer to the policy map for that pod.
BPF_MAP_TYPE_PROGRAM_ARRAY stores program file descriptors for
tail-call targets.
BTF and CO-RE
BTF (BPF Type Format) and CO-RE (Compile Once, Run Everywhere) are the twin pillars that make eBPF programs portable across kernel versions without recompilation. Understanding them is essential for any production eBPF deployment.
BTF Type Descriptions
BTF encodes the types of all C structs, unions, enums, and typedefs in the kernel
(and in eBPF programs) into a compact binary format embedded in the kernel image.
The kernel exposes BTF data via the /sys/kernel/btf/vmlinux virtual
file. When a kernel is built with CONFIG_DEBUG_INFO_BTF=y, the BTF
info is emitted as a .BTF section and retained in the vmlinux ELF file.
Tools like bpftool can dump the entire type tree:
{`# Dump the struct sk_buff type from the running kernel
$ bpftool btf dump formatc struct sk_buff
struct sk_buff {
union {
struct {...} /* bit: 0 ~ 31 */
struct {...} /* bit: 32 ~ 39 */
__u32 /* bit: 0 ~ 31 */
} /* offset: 0 (datalen_skb_cb) */
struct sock * /* offset: 8 */
struct net_device * /* offset: 16 */
union {...} /* offset: 24 */
...
__u16 /* offset: 202 len */
__u16 /* offset: 204 data_len */
...
}
// List all structs containing "sock"
$ bpftool btf dump formatc | grep "struct.*sock"
struct sock { ... }
struct socket { ... }
struct msghdr { struct sock * sock; ... }`}
Each type has a BTF ID. When you write ctx->sk in a program, the
verifier looks up the BTF ID for struct sk_buff, confirms that
sk is a field of type struct sock*, and records the
offset. This is how the verifier knows field offsets — it reads them from
BTF, not from compiled headers.
CO-RE: Relocating Struct Fields at Load Time
The kernel's internal structs change between versions. A field that was at offset
24 in struct sk_buff in kernel 5.4 might be at offset 32 in 5.8
due to new fields being inserted. CO-RE solves this by:
- Compiling the eBPF program with BTF information embedded in the .o file
- Recording, for each struct field access, the BTF type ID and field index rather than a hard-coded byte offset
- At load time, the libbpf loader reads the target kernel's BTF (via
/sys/kernel/btf/vmlinux), looks up the current offset of that field, and patches the eBPF bytecode to use the correct offset
{`// source: reading sk_buff->len using CO-RE
SEC("tc")
int handle_skb(struct __sk_buff *ctx) {
// BTF knows that "len" is at some offset that varies by kernel version
// CO-RE patches the bytecode instruction to use the correct offset at load time
__u32 len = ctx->len; // not a hard-coded offset; relocatable
// ...
}
// Compiled .o file contains BTF relocation records:
// relocation[0]: type_id=15 (sk_buff), field_off=20, insn_offset=4
// At load time, loader reads /sys/kernel/btf/vmlinux,
// finds struct sk_buff, field "len" at offset 202,
// patches the bytecode instruction at offset 4 with offset 202
// Program is now valid on any kernel 5.x with the same struct layout`} BTF Generics and BTF-assisted BTF
Kernel 5.18 introduced BTF_KPROBE and BTF_KRETPROBE
attach types that use BTF to resolve function arguments by name rather than
relying on the raw pt_regs structure. This enables a typed function signature:
{`// Instead of the raw kprobe approach:
SEC("kprobe/tcp_sendmsg")
int kprobe__tcp_sendmsg(struct pt_regs *ctx) {
// Must manually parse pt_regs to get the arguments
long addr = PT_REGS_PARM1(ctx); // architecture-specific, ugly
// ...
}
// BTF-based kprobe lets the loader resolve by name:
SEC("kprobe/tcp_sendmsg")
int BPF_KPROBE(tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size) {
// Fully typed, verified against BTF at load time
__u32 pid = bpf_get_current_pid_tgid() >> 32;
// ...
return 0;
}
// BPF_KPROBE expands to a kprobe program with a typed signature.
// The loader resolves "tcp_sendmsg" to its address via kallsyms,
// and the verifier validates that the arguments match the BTF signature
// of the function at that address.`} Programs and Hook Points
Where you attach the program determines what events trigger it and what context the kernel hands you.
kprobe / kretprobe
Attach to the entry or return of any non-inlined kernel function. kprobe/tcp_sendmsg
fires every time the kernel sends TCP data. Works on any kernel from 4.x but the
function name is unstable across versions. The kernel uses a breakpoint trap
(or, with JMP_CALLEE, a JIT'd trampoline when available) to hand control to the
eBPF program. A kretprobe fires after the function returns, with the return
value accessible in r0. The overhead is ~10 ns per fire for JIT'd trampolines.
tracepoint
Stable named hook points the kernel exposes intentionally. tracepoint/syscalls/sys_enter_openat
is part of the kernel ABI and survives version bumps. Always prefer tracepoints
over kprobes when one exists. Each tracepoint has a generated
struct trace_event_raw_* with named, typed fields. The kernel
rarely changes tracepoint signatures; when it does, it is announced in the
stable kernel changelog. Search /sys/kernel/debug/tracing/events/
to see available tracepoints on your kernel.
uprobe / uretprobe
Userspace function entry/return. Hook SSL_write in libssl to capture
decrypted HTTPS payloads. Pixie's auto-instrumentation is built almost entirely
on uprobes into TLS, gRPC, and language runtimes. Uprobe attachment uses the
binary's ELF symbol table to find the function entry point. For position-independent
binaries (PIE), uprobes use the file's section headers to compute the correct
offset. A uretprobe runs after the function returns — the kernel saves the
return value on the stack and restores it after the probe fires.
XDP
Runs at NIC driver level, before the kernel's network stack sees the packet. The
XDP context gives access to the packet buffer directly. XDP return codes:
XDP_PASS (send up the stack), XDP_DROP (discard),
XDP_REDIRECT (send to another interface or to userspace via a
socket), XDP_TX (bounce out the same interface). Used for DDoS
scrubbing, load balancing (Cilium, Katran), and packet capture without
context-switching to userspace. Some drivers support only a subset of modes.
tc (traffic control)
Attach to ingress or egress qdisc on a network device. Slower than XDP but with
access to skb metadata including routing decisions, cgroup info, and connection
state. Cilium uses tc for L7 policy enforcement and service mesh. tc eBPF programs
are attached via tc qdisc add dev eth0 cls and are part of the Linux
traffic control pipeline. The qdisc layer is traversed on both RX and TX paths,
making tc a natural place for per-connection accounting.
perf event
Fires on hardware perf counters (cycles, instructions, cache-misses, branch
mispredictions) or software events (cpu-clock, task-clock). Used for sampling
profilers: every N Hz, the perf interrupt fires, the eBPF program is called
with the current registers and a perf event header. It reads the instruction
pointer, looks up the stack trace in a STACK_TRACE map, and
records a sample. The sampling rate is set from userspace via
ioctl(perf_fd, PERF_EVENT_IOC_REFRESH, N).
BPF Maps
Maps are the only persistent state available to a program and the only way to communicate with userspace. The map type determines its semantics — hash, array, LRU, perf ring buffer, stack trace, per-CPU.
{`// libbpf C - per-CPU hash map for syscall counts
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, 1024);
__type(key, u32); // syscall number
__type(value, u64); // count
} syscall_counts SEC(".maps");
SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscall(struct trace_event_raw_sys_enter *ctx) {
u32 nr = ctx->id;
u64 *count = bpf_map_lookup_elem(&syscall_counts, &nr);
if (count) (*count)++;
else { u64 init = 1; bpf_map_update_elem(&syscall_counts, &nr, &init, 0); }
return 0;
}
// Userspace reads the map periodically
__u32 key, *prev = NULL;
while (bpf_map_get_next_key(map_fd, prev, &key) == 0) {
__u64 vals[num_cpus];
bpf_map_lookup_elem(map_fd, &key, vals);
__u64 total = 0;
for (int i = 0; i < num_cpus; i++) total += vals[i];
printf("syscall %u: %llu\\n", key, total);
prev = &key;
}`} Per-CPU maps avoid lock contention — each CPU has its own copy and userspace sums them. The perf ring buffer (and its successor, the BPF ring buffer) is the preferred way to stream high-volume events to userspace; it preserves event ordering and uses memory-mapped circular buffers.
bpftrace
bpftrace is a high-level DSL for eBPF that lets you write one-liners and short scripts that compile and run at runtime. It sits on top of the kernel's eBPF infrastructure and exposes an awk-like language with built-in variables for common kernel state. It is the fastest way to go from "what does this function do" to "I have data" for ad-hoc investigation.
One-liners
{`# Count file opens by process name
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'
# Trace TCP retransmits, print the function symbol of the skb being retransmitted
bpftrace -e 'kprobe:tcp_retransmit_skb { printf("%s -> %s\\n", comm, ksym(arg1)); }'
# Histogram of read() syscall latency via vfs_read
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
# Count lock contentions by lock address
bpftrace -e 'kprobe:__lock_text_start { @locks[arg0] = sum(1); }'
# Print every new process with its arguments
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s(%s)\\n", comm, str(args->filename)); }'
# Bytes read per file descriptor per process
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes[pid, args->fd] = sum(args->ret); }'`} USDT Probes (Statically Defined Tracepoints)
User-level Statically Defined Tracepoints (USDT) are compile-time instrumentation inserted into application code. They are used by gRPC, MongoDB, MySQL, Node.js, and many other runtimes. bpftrace can attach to these without any kernel involvement:
{`# List available USDT probes in a process
bpftrace -l '# USDT' | grep
# or via:
# readelf -n /path/to/binary | grep -A2 stapdt
# Trace gRPC server received messages (example)
bpftrace -e '
USDT::grpc:server_received_message {
printf("pid=%d method=%s size=%d\\n", pid, str(arg0), arg1);
}
USDT::grpc:server_sent_message {
printf("pid=%d method=%s size=%d\\n", pid, str(arg0), arg1);
}' -p $(pgrep -x myservice)`} bpftrace Scripts and Filters
A bpftrace script file (.bt) has the same language as one-liners but organized into multiple probes with shared state. Filtering in bpftrace uses an expression after the probe specifier that acts as a predicate:
{`#!/usr/bin/env bpftrace
// Only fire when the file being opened is in /tmp (filter expression)
tracepoint:syscalls:sys_enter_openat
/strcontains(str(args->filename), "/tmp")/ {
@tmp_opens[comm] = count();
}
// Attach to scheduler to measure voluntary context switches
tracepoint:sched:sched_switch {
@switches[prev_comm] = sum(1);
}
// Aggregate TCP connection state changes with a 10-second interval
interval:s:10 {
print(@tcp_states);
clear(@tcp_states);
}
END {
print(@tmp_opens);
print(@switches);
}`} Output Formats
bpftrace supports several output modes. The default is human-readable terminal
output with auto-scaling histograms. For machine consumption, use
format json or format struct to get structured events.
bpftrace also supports -o filename to write to a file, and can
stream to a Unix domain socket for ingestion into a log aggregation pipeline.
The Toolchain: BCC vs bpftrace vs libbpf
Three layers, each with a different tradeoff between ease of use and runtime cost.
| Tool | Style | Compile | Best for |
|---|---|---|---|
| bpftrace | awk-like DSL | at runtime | ad-hoc one-liners, troubleshooting |
| BCC | Python + embedded C | at runtime via clang | full programs with userspace UI; clang dependency in production |
| libbpf + CO-RE | C, compiled to BTF | ahead of time | production deployments; small, fast, portable |
| cilium/ebpf (Go) | Go bindings to libbpf | ahead of time | Go-native exporters, agents |
| aya (Rust) | Rust bindings | ahead of time | memory-safe userspace controllers |
BCC: C Embedded in Python
The BPF Compiler Collection (BCC) embeds eBPF C code as a multi-line string inside a Python (or Lua) program. At runtime, BCC compiles the C with clang/LLVM (which must be installed on the target machine), creates maps, and attaches the programs via the bpf syscall. The Python side provides the userspace logic: reading maps, aggregating data, printing tables, serving HTTP. The tradeoff is the clang dependency — compiling eBPF at runtime takes seconds, and shipping clang in a production container image is heavyweight. BCC tools are widely used for ad-hoc debugging scripts.
{`#!/usr/bin/env python3
# from BCC examples/openaat
from bcc import BPF
from bcc.utils import printb
program = r"""
#include
#include
#include
// Per-process counter map
BPF_HASH(counts, u32, u64);
int hello(struct trace_event_raw_sys_enter *ctx) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *cnt = counts.lookup(&pid);
if (cnt)
(*cnt)++;
else {
u64 init = 1;
counts.update(&pid, &init);
}
return 0;
}
"""
b = BPF(text=program)
b.attach_uprobe(name="c", sym="openat", fn_name="hello")
print("Tracing openat()... Hit Ctrl-C to end.")
while True:
sleep(1)
for pid, cnt in b["counts"].items():
print(f"PID {pid.value}: openat called {cnt.value} times")`} Cilium: eBPF as Service Mesh Dataplane
Cilium is the most sophisticated production use of eBPF in the cloud-native ecosystem. It replaces kube-proxy and much of the Linux networking stack with eBPF programs that handle load balancing, routing, encryption (WireGuard), and L7 policy enforcement entirely in-kernel.
How Cilium Replaces iptables
Before Cilium, Linux NetworkPolicy enforcement used iptables rules that were evaluated
linearly for every packet. At 100K connections, this became O(n) per packet —
a significant CPU cost at scale. Cilium replaces this with BPF_MAP_TYPE_SOCKHASH
and BPF_PROG_TYPE_SOCK_OPS: when a connection is established, a sockops
program stores a reference to the socket in a hash map keyed by the 4-tuple (src IP,
src port, dst IP, dst port). Subsequent packets are matched against this map via a
SKIP (socket lookup) in the tc path, bypassing the iptables chain entirely.
Service load balancing — mapping a Kubernetes Service cluster IP to backend pods
— is handled by a dedicated BPF_MAP_TYPE_ARRAY (the backend map) and
a cls program that rewrites destination addresses. This is both faster than kube-proxy's
iptables rules and more predictable: eBPF's O(1) map lookup is constant-time regardless
of the number of backends.
Cilium Hubble: Observability via eBPF
Hubble is Cilium's observability layer. It attaches a cls program at the egress hook that captures every connection event (connection opened, closed, data transferred) and writes it to a ring buffer. A separate userspace daemon (hubble-relay) aggregates the per-node flows and exposes them via a gRPC API consumed by the Hubble UI and CLI. The result: a full service map and connection timeline for every pod in the cluster, with no application instrumentation required.
Pixie: Auto-Instrumentation via uprobes
Pixie is an open-source observability platform for Kubernetes that uses eBPF uprobes to automatically capture telemetry data from applications without requiring code changes, sidecars, or configuration. It is the most ambitious consumer-facing use of eBPF auto-instrumentation.
What Pixie Hooks
Pixie's auto-instrumentation scripts define a set of well-known function targets organized by protocol. For each protocol, a matching script attaches uprobes to the relevant library functions and decodes the protocol messages from the function arguments.
- TLS (OpenSSL / BoringSSL):
SSL_read,SSL_write. These fire after the TLS record layer has decrypted or before it encrypts the payload. The uprobe receives pointers to the application buffer, which eBPF copies into the ring buffer for userspace to decode as an HTTP request or response. - gRPC (HTTP/2 framer):
nghttp2_session_mem_send,nghttp2_session_mem_recv. Pixie's gRPC scripts parse the HTTP/2 frames that wrap gRPC messages and extract the method name and payload size. - DNS:
getaddrinfo,libc_sendtowith a filter for port 53. Pixie correlates DNS queries with application connections. - MySQL, Postgres, Redis, Kafka: Entry and return probes on the protocol framing functions in each library — libmysqlclient, libpq, hiredis, librdkafka. Pixie parses the wire protocol to extract query text and response rows.
The Coupling Problem
The fundamental challenge with Pixie's approach is library version coupling. An uprobe targets a specific function with a specific signature at a specific offset. When a library ships a new version that changes the function layout, adds new inlined helpers, or modifies the wire protocol parser, the probes break silently or produce incorrect data. Pixie maintains a table of library versions and their corresponding probe specs, updating it when distributions ship new library versions. For long-lived clusters where base images change slowly, this works well. For clusters with rolling updates and heterogeneous images, the maintenance burden is significant.
Parca: Continuous CPU Profiling with eBPF
Parca is a continuous CPU profiler that uses eBPF to sample CPU utilization across an entire fleet with minimal overhead, storing only the aggregated stack/histogram rather than raw samples.
A traditional profiler using perf record writes gigabytes of raw
perf.data per host per day — every sample includes the full
call chain. Parca's eBPF program instead maintains a BPF_MAP_TYPE_STACK_TRACE
map keyed by (pid, stack_id) and a separate histogram map keyed by the concatenated
instruction pointer list. Every N microseconds (the default sampling interval), the
perf event fires, the program looks up the stack trace, and increments a single counter
for that unique call chain. Parca ships only the delta — deltas of the histogram
bucket counts — to the server. The data rate is orders of magnitude smaller than
raw perf data.
For unwinding, Parca supports both frame-pointer-based unwinding (fast, works on
most x86_64 binaries compiled with -fno-omit-frame-pointer) and
DWARF-based unwinding (accurate, works for optimized binaries). The DWARF path uses
the kernel's eBPF-based DWARF unwinder (when available in the kernel) to resolve
inlined frames and produce accurate stacks for applications compiled with
-O2 or -O3.
Tetragon: Runtime Security Enforcement via eBPF
Tetragon (by Aqua Security) is a runtime security tool that uses eBPF tracepoints and uprobes to capture every syscall execution with full argument and return value context, then enforces policy-based access control at the kernel level.
Unlike Cilium, which focuses on network policy, Tetragon enforces process-level
and file-level security policies by attaching to raw syscalls (via tracepoint
raw_syscalls:sys_enter and raw_syscalls:sys_exit). When
a syscall matches a policy rule — e.g. "process /usr/bin/curl may
not execve anything except /lib/*/curl" — the eBPF program can
send a signal to kill the process (SIGKILL), return an error code,
or send an event to the tracing infrastructure for alerting. The enforcement happens
in-kernel without userspace involvement, making it tamper-resistant against
root-level attackers.
For observability, Tetragon exports traced events as structured JSON via a ring buffer, with fields for the process identity (pid, tid, uid, gid), the syscall arguments, the return value, and the stack trace. This is the most comprehensive syscall-level visibility available without a kernel module.
Performance Considerations
eBPF programs run in performance-critical paths. Design decisions that seem minor at development time can become measurable overhead at production traffic rates.
Per-CPU Maps Are Non-Negotiable for Hot Counters
A counter in a global hash map that is updated 50,000 times per second per CPU
(for example, a connection tracker on a busy node) will serialize on the map's
global spinlock. On a 128-CPU machine, this creates 128 CPUs contending for a single
lock, which can manifest as 10-30% CPU overhead in the profiling data itself.
Using BPF_MAP_TYPE_PERCPU_HASH eliminates this entirely: each CPU
touches only its own copy. Userspace aggregates at read time.
Batching with Ringbuf for High-Volume Events
For streaming events (every HTTP request, every syscall, every packet), the choice
between perf ring buffers and the BPF ring buffer matters. The BPF ring buffer's
bpf_ringbuf_reserve() / bpf_ringbuf_submit() API is designed
for batch delivery: consumers can retrieve multiple events per epoll_wait()
wake. Each event has minimal per-event overhead. For workloads with millions of events
per second, ringbuf's efficiency advantage over perf buffers is significant.
High-Frequency Tracepoint Overhead
A tracepoint attached to a syscall like sys_enter_read fires on every
read syscall across all processes. On a database server doing 200K reads/second, the
probe fires 200K times/second. At ~10 ns per fire for a simple counter increment,
that's 2 ms/second of CPU — well under 1% of a single core. However, the
combined overhead of many probes on hot tracepoints can become measurable. Profile
with perf stat -e cycles -e instructions with and without the probes
attached to establish a baseline.
XDP Per-Packet Cost
XDP programs process every packet that arrives at the NIC. At 10 Gbps line rate with 1500-byte packets, that's ~833K packets/second. A simple XDP program (drop, pass, or redirect) takes 50-100 ns on modern x86, which is ~0.08% of a single core per million packets/second. Programs that do map lookups, parse headers, or make policy decisions cost more. For line-rate packet processing at 40+ Gbps, the program must be kept minimal and data-dependent branching should be predictable.
Common Mistakes
Map lookup in hot path without per-CPU
Using BPF_MAP_TYPE_HASH for a high-frequency counter causes global
spinlock contention. Always use PERCPU_HASH for counters updated
more than a few thousand times per second.
Sleeping in non-preemptible context
eBPF programs run with preemption disabled. Calling bpf_printk
inside a tight loop, or using a map lookup that might trigger memory allocation
under pressure, can cause the program to fail the verifier or cause scheduling
latency. Use bpf_ringbuf_reserve instead of blocking helpers.
Exceeding instruction limits with unroll
#pragma unroll creates N copies of the loop body in bytecode.
A 10,000-iteration loop creates a 10,000×N-instruction program that
will fail to load if it exceeds the 1M instruction limit. Use a bounded loop
with a realistic maximum (e.g. 256) that your use case actually needs.
Storing map pointers across hook invocations
A map entry's address may change when the map is rehashed. If you store a pointer to a map entry in a stack variable or global and use it on the next hook invocation, the pointer may be stale. Always re-look up the entry.
Assuming kprobe function names are stable
A function name like tcp_sendmsg can be renamed, inlined, or
replaced by a different function between kernel versions. Programs that attach
via BTF names (BTF_KPROBE) are more resilient. For kprobes without BTF, test
on every target kernel version.
Libbpf skeleton mismatch
bpftool gen skeleton embeds map and program names into a header.
If the .o file is rebuilt and the skeleton is not regenerated, the skeleton
still references the old names and the load fails. Automate skeleton generation
in the build pipeline.
perf events vs eBPF
Both can sample the kernel, but with very different cost models.
| Aspect | perf | eBPF |
|---|---|---|
| Where data lives | perf.data file (gigabytes) | BPF map (kilobytes-megabytes) |
| In-kernel filtering | limited (perf-filter) | arbitrary verified C |
| Continuous use | impractical (storage) | designed for it |
| Aggregation | userspace post-processing | in-kernel via maps |
| Stack traces | frame pointer / DWARF | BPF_MAP_TYPE_STACK_TRACE |
A continuous profiler using perf would write multi-GB perf.data files per host per day. The eBPF equivalent (Parca, Pyroscope) keeps a per-CPU stack-trace histogram in a BPF map and ships only deltas to the backend — orders of magnitude less data.
Tradeoffs
Zero instrumentation
No code changes needed. Works for closed-source binaries, third-party services, and languages that resist instrumentation.
Kernel coupling
Programs target specific kernel data structures. CO-RE + BTF mitigates this but doesn't eliminate it — some attach points change between major versions.
Privileged install
Loading eBPF requires CAP_BPF or CAP_SYS_ADMIN. In multi-tenant clusters that's a security policy decision, not just a technical one.
Verifier rejection is opaque
"invalid access to packet" with no line number is a common debugging experience. The verifier log is verbose; learning to read it is a skill.
FAQ
Can eBPF crash the kernel?
No, by design. The verifier proves termination and memory safety before the program ever runs. The most that can go wrong is the program returning the wrong value or a buggy probe attaching to a hot function and slowing the system — never a panic.
What is BTF and why does it matter?
BTF (BPF Type Format) is type information embedded in the kernel that lets eBPF programs reference kernel struct fields by name. CO-RE (Compile Once, Run Everywhere) uses BTF to relocate field offsets at load time, so one compiled .o works across kernel versions. Without BTF, you'd recompile for every kernel.
How much overhead does an attached probe add?
A kprobe is roughly 10-50 ns per fire on modern x86. A tracepoint is faster (no breakpoint trap). For a syscall-rate workload (~100K/s) this is well under 1% CPU. For a packet-rate workload (millions/s at XDP), the program itself becomes the bottleneck and you measure in cycles, not nanoseconds.
Why use uprobes for HTTPS instead of MITM?
Because uprobes see plaintext after TLS decryption inside the application's own libssl. No certificate manipulation, no proxy, no traffic redirection. Pixie reads the buffer that SSL_read just decrypted, copies it to a ring buffer, and the application is unaware.
Can I write eBPF in Rust or Go?
Userspace controllers, yes — cilium/ebpf (Go) and aya (Rust) are mature. The kernel-side program is still C (compiled to eBPF bytecode) for almost all production tools. There are experimental Rust frontends but the verifier expects C-like control flow patterns.
What's the difference between BPF and eBPF?
"Classic" BPF (cBPF) was the original 1992 packet filter language for tcpdump — tiny, restricted to packet inspection. eBPF (extended BPF) is the modern in-kernel VM: 64-bit registers, maps, helper functions, multiple program types. cBPF programs are auto-translated to eBPF at load time today.