eBPF
Programmable kernel without kernel modules
eBPF (extended Berkeley Packet Filter) is an in-kernel virtual machine that lets you run sandboxed programs at kernel hook points — tracepoints, kprobes, network packets, syscalls — without writing a kernel module. The verifier proves your program terminates and is memory-safe before the JIT compiles it to native machine code. It runs in the kernel, close to hardware, at the speed of native code.
eBPF is the foundation of a new generation of observability (Pixie, Parca), networking (Cilium, Katran, Cloudflare Magic Firewall), and security (Falco, Tetragon, Aqua Tetragon) tooling. The programs are small, fast, and composable — and they run in production at companies processing billions of packets per second.
Architecture
Key Numbers
Verifier: the safety contract
Before any eBPF program runs, the verifier walks every possible execution path and proves a set of safety properties. The verifier is your contract with the kernel: if it accepts your program, the kernel guarantees it won't crash, hang, or corrupt memory.
// What the verifier checks, in order:
1. TERMINATION — program must reach a return instruction on all paths
2. BOUNDS — all memory accesses provably within valid regions
3. TYPE SAFETY — helper arguments match expected types
4. STACK DEPTH — no more than 512 bytes of stack used
5. NO UNBOUNDED LOOPS — classic loops rejected unless bounded (5.3+: BPF_LOOP)
/* REJECTED: unbounded loop — verifier walks every path and sees no exit */
int loop(struct pt_regs *ctx) {
int *p = (int *)ctx->di; // pointer from ctx — unknown value
while (*p > 0) { // if *p is huge, this never terminates
(*p)--;
}
return 0;
}
// Error: Unbounded loop, max iteration unknown (bounded flag not set)
// In practice: verifier explores all paths by unrolling — exponential in loop depth.
// Complex programs with deep nesting can take seconds to verify.
// For 5.3+ kernels, use BPF_LOOP instruction with a static bound.
/* ACCEPTED: bounded loop — verifier unrolls 8 iterations max */
int bounded_loop(struct pt_regs *ctx) {
int sum = 0;
for (int i = 0; i < 8; i++) { // static bound, verifier unrolls 8x
sum += i;
}
return sum;
} What "memory-safe" means in practice
// Stack — max 512 bytes, accessed with compile-time-known offsets
SEC("tracepoint/syscalls/sys_enter_read")
int trace_read(struct syscall_trace *ctx)
{
char buf[64]; // fine: stack is bounded
bpf_probe_read_user(buf, sizeof(buf), ctx->args[1]); // read 64 bytes from user ptr
bpf_printk("buf: %s", buf); // trace ring buffer, consumer reads via bpftool
return 0;
}
// Maps — accessed via helpers only, not raw pointers
// This is key: you never get a raw pointer to map data.
// You call bpf_map_lookup_elem() and get either NULL or a __bpf_md_ptr.
// The verifier tracks provenance — if a pointer came from a map lookup,
// it knows its bounds.
struct bpf_map *map = bpf_map__open("my_hash.map"); // libbpf
__u32 key = 0;
struct stats *value;
value = bpf_map_lookup_elem(&map_fd, &key);
// value is NULL or pointer into map data. The verifier tracks this.
// Accessing value->field is fine. Accessing value + sizeof(struct stats) → rejected.
// Packets — verifier knows the sk_buff layout at attachment point.
// bpf_hdr_start() / bpf_pull_data() give you safe packet pointers.
// Out-of-bounds packet access → verifier rejection. State explosion and practical limits
The verifier uses a worst-case exponential algorithm in theory. In practice, it uses symbolic execution with pruning: it merges states when two paths reach the same instruction with equivalent register values. Programs with 10k+ instructions verify in under a second. Programs with pathological branching patterns (e.g., a chain of 2000 branches each with slightly different bounds) can take minutes and may be rejected as "too complex" regardless of their actual runtime cost.
// What triggers verifier explosion:
// deeply nested branches with different state
int bad_pattern(struct pt_regs *ctx) {
int r0 = ctx->di;
int r1 = ctx->si;
int r2 = ctx->dx;
// Each branch creates a new state. 10 nested branches = 2^10 states.
// If registers have different values at each branch, no pruning happens.
if (r0 > 0) {
if (r1 > 0) {
if (r2 > 0) { /* ... */ }
}
}
return 0;
}
// Writing verifier-friendly code:
// Keep branches shallow. Use bounded loops. Avoid complex pointer chains.
// When you need a lookup table, use an array map instead of if/else chains.
/* Kernel BTF shows you verifier state on rejection */
$ bpftool prog dump jited PROG_ID
# or with verification details:
$ echo 1 > /proc/sys/kernel/bpf_stats_enabled
$ bpftool prog show id X Program types and hooks
Each program type attaches to a specific kernel hook point. The program type determines
which context struct is available (e.g., struct pt_regs * for kprobes,
struct xdp_md * for XDP), which helpers are legal, and what the return
value means. Picking the wrong program type for your goal is a common mistake.
kprobe / kretprobe — kernel function entry/exit
// Attach to __netif_receive_skb_core entry (every packet entering the networking stack)
$ bpftrace -e 'tracepoint:net:netif_receive_skb { @[comm]++ }'
// BCC Python equivalent
from bcc import BPF
program = r"""
TRACEPOINT_PROBE(net, netif_receive_skb) {
char comm[TASK_COMM_LEN];
bpf_get_current_comm(&comm, sizeof(comm));
bpf_trace_printk("packet from %s\n", comm);
return 0;
}
"""
b = BPF(text=program)
b.attach_tracepoint(tp="net:netif_receive_skb", fn_name="tracepoint_probe")
b.trace_print()
// Cost: ~1 µs per invocation. Offset parameter lets you skip
// the first N bytes of a function (useful for trampolines that wrap syscalls). tracepoint — stable kernel ABI events
// tracepoints live at /sys/kernel/debug/tracing/events/
// They're stable — kernel struct layout won't change between releases.
// kprobes can break across kernel versions; tracepoints won't.
$ ls /sys/kernel/debug/tracing/events/
block/ ext4/ jbd2/ sched/ syscalls/ tcp/ ...
// Find sys_enter_write tracepoint:
$ ls /sys/kernel/debug/tracing/events/syscalls/sys_enter_write/
format id filter enable
// Attach with bpftrace:
$ bpftrace -e 'tracepoint:syscalls:sys_enter_write {
printf("PID %d (%s) write(%d, 0x%x, %d)\n",
pid, comm, args->fd, args->buf, args->count);
}'
// Lower overhead than kprobe — tracepoints are statically placed
// in the kernel at safe points. ~500 ns vs ~1000 ns per invocation. XDP — network driver, before sk_buff allocation
XDP (eXpress Data Path) runs at the earliest possible point in the networking stack:
when the NIC driver receives a packet, before the kernel allocates an sk_buff.
You get a raw packet buffer (struct xdp_buff *) and can redirect, drop,
or pass the packet. The performance difference is dramatic: at 10 Gbps line rate,
you need to process 14.8M packets/second — any per-packet overhead matters.
// XDP program attached to a netdev
SEC("xdp")
int xdp_drop_tcp(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) // bounds check first
return XDP_PASS;
if (eth->h_proto == htons(ETH_P_IP)) {
struct iphdr *ip = data + sizeof(struct ethhdr);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol == IPPROTO_TCP)
return XDP_DROP; // drop TCP packets
}
return XDP_PASS;
}
// Return actions:
// XDP_PASS — hand to normal kernel networking stack
// XDP_DROP — discard the packet (DDoS scrubbing, firewall)
// XDP_REDIRECT — send to another netdev or cpumap
// XDP_TX — transmit out the same interface (bridging)
// tc (traffic control) hooks run later — after sk_buff is allocated,
// so you have access to L3/L4 headers, routing decisions, etc.
SEC("tc")
int tc_skb(struct __sk_buff *skb)
{
bpf_skb_verdict(skb, BPF_SKB_VERDICT_PROG); // can redirect, mirror, etc.
return TC_ACT_OK;
} uprobe / uretprobe — userspace function hooks
// Attach to a userspace function in a binary or shared library
// via /proc or via file path
$ bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:close {
printf("close() called\n");
}'
//uretprobe gets the return value:
$ bpftrace -e 'uretprobe:/bin/bash:readline {
printf("user typed: %s\n", retval);
}'
// Common use: instrument Redis, Postgres, or any app without
// touching its code. Works for any dynamically-linked symbol.
// Offset parameter lets you hook specific instructions in a function. sock_ops — socket-level optimization
// sock_ops programs attach to cgroups and intercept socket operations.
// Cilium uses this to accelerate pod-to-pod traffic — redirecting
// socket operations to bypass the iptables/IPvs pipeline entirely.
SEC("sockops")
int sock_ops_policy(struct bpf_sock_ops *ctx)
{
// ctx->op: BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, etc.
// ctx->family: AF_INET / AF_INET6
// ctx->remote_port, remote_ip4, local_ip4
// Can call bpf_setsockopt() to tune TCP params (buf sizes, congestion algo)
if (ctx->family == AF_INET6) {
// accelerate IPv6 flows
bpf_setsockopt(ctx, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
}
return 1; // 1 = continue, 0 = drop
} Maps: the data plane
eBPF programs can't allocate memory dynamically (no malloc, no
kmalloc). State persists between invocations through maps — keyed
data structures managed by the kernel and accessed via helpers. Maps are the
plumbing between your eBPF programs and user-space control plane.
Creating maps with libbpf
#include <bpf/bpf_helpers.h>
// Declare a map — the macro generates a struct + file descriptor
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, __u32);
__type(value, struct stats);
} stats_map SEC(".maps");
// Create from the command line:
$ bpftool map create /sys/fs/bpf/stats_map type hash key 4 value 48 max_entries 1024
// Or via bpf() syscall directly:
int map_fd = bpf(BPF_MAP_CREATE, &attr); // attr specifies type, size, flags BPF_MAP_TYPE_HASH — general key/value
// Key: process PID. Value: count of syscalls.
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10000);
__type(key, __u32);
__type(value, __u64);
} pid_count SEC(".maps");
SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscalls(struct trace_entry *ctx)
{
__u32 pid = bpf_get_current_pid_tgid() >> 32;
__u64 *cnt = bpf_map_lookup_elem(&pid_count, &pid);
if (cnt) {
__sync_fetch_and_add(cnt, 1); // atomic increment
} else {
__u64 one = 1;
bpf_map_update_elem(&pid_count, &pid, &one, BPF_ANY);
}
return 0;
} BPF_MAP_TYPE_ARRAY — indexed, efficient for ring buffers
// Arrays are indexed by integer, faster than hash for sequential access.
// Good for per-CPU counters, event ring buffers, jump tables.
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 256);
__type(key, __u32);
__type(value, struct event);
} event_array SEC(".maps");
// Batch operations are faster for bulk processing:
bpf_map_update_batch(&event_array, &keys[0], &values[0],
&count, &flags); // Linux 5.6+ BPF_MAP_TYPE_PERCPU_HASH / PERCPU_ARRAY — avoid atomic contention
// Per-CPU maps: each CPU has its own copy of every entry.
// No locking needed — each CPU writes to its own copy.
// At aggregation time (user space), sum across CPUs.
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, 1024);
__type(key, __u64);
__type(value, struct percpu_counter);
} percpu_stats SEC(".maps");
// In the eBPF program: just write, no atomic needed
struct percpu_counter *c = bpf_map_lookup_percpu_elem(&percpu_stats, &key, 0);
// cpu_id = 0 is always valid for lookup. Updates go to current CPU's copy.
// Aggregation in user space:
// for each cpu: total += cpu_value[cpu] BPF_MAP_TYPE_LRU_HASH — automatic eviction under pressure
// LRU map evicts least-recently-used entries when full.
// No external garbage collection needed — good for tracking
// short-lived connections, recent IPs, sliding windows.
// Without LRU, a full map blocks new insertions until user space evicts.
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 65536);
__type(key, struct flow_key); // src_ip, dst_ip, src_port, dst_port, proto
__type(value, struct flow_stats);
} active_flows SEC(".maps");
// If you're in an environment without explicit eviction (no user space
// polling), LRU prevents the map from ever filling and blocking inserts. BPF_MAP_TYPE_RINGBUF — high-throughput event streaming
// Ring buffer (5.8+) is preferred over perf buffer for most use cases.
// Advantages: single ring (vs per-CPU perf buffer), variable-length records,
// more efficient memory use, memory-mapped for consumer.
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // size in bytes, power of 2
} events SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_write")
int record_write(struct syscall_trace *ctx)
{
struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
e->fd = ctx->args[0];
e->count = ctx->args[2];
bpf_ringbuf_output(e, sizeof(*e), 0); // submit to consumer
return 0;
}
// Consumer in Python/C reads via mmap'd ring:
int fd = bpf_map_get_fd_by_id(map_id);
// mmap the ring, poll on fd, read records BPF_MAP_TYPE_PROG_ARRAY — tail calls
// Prog array holds references to other eBPF programs.
// bpf_tail_call(ctx, &prog_array, index) jumps to that program.
// The new program replaces the current one (different stack, same maps).
struct {
__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
__uint(max_entries, 8);
__type(key, __u32);
__type(value, __u32); // file descriptors of other programs
} tail_calls SEC(".maps");
// In program A:
SEC("xdp")
int handle_packet(struct xdp_md *ctx)
{
// classify by size
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
__u64 pkt_size = data_end - data;
__u32 index = pkt_size > 1024 ? 1 : 0; // select sub-handler
bpf_tail_call(ctx, &tail_calls, index);
return XDP_PASS; // default
}
// Programs 1 and 0 are pre-loaded into the map at indices 0 and 1
// Chain depth limit: 33 (BPF_MAX_TAIL_CALL_CNT)
// Each tail call: ~50-100 ns overhead The toolchains
Three generations of eBPF tooling exist today, each suited to different use cases. The ecosystem has consolidated around libbpf + CO-RE for production; bpftrace for ad-hoc debugging; BCC for transitional tooling that needs to run on older kernels.
BCC — embedded LLVM, C in Python/Lua templates
# BCC: compile C code at runtime via embedded LLVM, call from Python/Lua
from bcc import BPF
program = r"""
#include <uapi/linux/ptrace.h>
struct key_t {
u32 pid;
char comm[16];
};
BPF_HASH(counts, struct key_t);
BPF_PERF_OUTPUT(events);
int do_entry(struct pt_regs *ctx) {
struct key_t key = {};
key.pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&key.comm, sizeof(key.comm));
counts.increment(key);
return 0;
}
"""
b = BPF(text=program)
b.attach_uprobe(name="c", sym="malloc", fn_name="do_entry")
print("Tracing malloc() calls...")
b.trace_print() BCC embeds a full LLVM/Clang in the process that compiles the C program at runtime. This makes distribution painful (the target machine needs LLVM libraries) but is very flexible. BCC is being gradually replaced by libbpf + CO-RE in production.
bpftrace — one-liners for ad-hoc investigation
# bpftrace: D-language-like syntax, no compilation needed
# Great for live debugging on a production machine
# Count system calls by process name:
$ bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[comm]++; }'
# Watch all file opens by a specific UID:
$ bpftrace -e 'tracepoint:syscalls:sys_enter_openat /uid == 1000/ {
printf("%s opened %s\n", comm, str(args->filename));
}'
# Profile CPU by user-space stack trace (needs frame pointers):
$ bpftrace -e 'profile.cpu /pid == 1234/ { @[ustack]++; }'
# Measure TCP retransmit frequency:
$ bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[tcp_sock_to_skb(args->sk)]++; }'
# What processes are calling execve most:
$ bpftrace -e 'tracepoint:syscalls:sys_enter_execve { @[comm]++; }' | head -20 libbpf + CO-RE — production deployment
// libbpf + CO-RE: compile once, run everywhere
// clang -target bpf -O2 -g -c prog.bpf.c
// Generates .o with BTF debug info embedded.
// prog.bpf.c — the skeleton approach (preferred):
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 8192);
__type(key, __u32);
__type(value, struct conn_stats);
} conn_stats SEC(".maps");
SEC("tracepoint/tcp/tcp_retransmit_synack")
int on_retransmit(struct tcp_retransmit_skb *ctx)
{
__u32 sock = ctx->sk; // cast via BTF, no hardcoded struct offsets
struct conn_stats *s = bpf_map_lookup_elem(&conn_stats, &sock);
if (s) {
__sync_fetch_and_add(&s->retransmits, 1);
}
return 0;
}
char LICENSE[] SEC("license") = "GPL";
// In user-space loader (loader.c):
#include <bpf/skel.h>
int main() {
struct prog_bpf *skel = prog_bpf__open();
prog_bpf__load(skel); // CO-RE patches struct offsets here
prog_bpf__attach(skel);
// polls ring buffer for events
while (1) { ring_buffer__poll(skel->events, 100); }
return 0;
}
// Build and run:
// $ clang -target bpf -O2 -g -c prog.bpf.c
// $ clang -o loader loader.c -lbpf -lelf
// $ sudo ./loader
// The loader is architecture-specific (x86_64), not the eBPF program. Real-world deployments
Cilium — kube-proxy replacement, pod networking
Cilium replaces the iptables-based Kubernetes kube-proxy with eBPF
programs that redirect pod-to-pod traffic at the socket layer, bypassing the
netfilter/iptables pipeline entirely. At 10,000 services, iptables lookup time
grows as O(n). Cilium's eBPF hash-map lookups are O(1). Pod-to-pod latency drops
from ~100 µs to ~20 µs. Cilium's datapath also uses the sockmap to redirect
connections through a side-car proxy without changing application code.
Meta Katran — XDP load balancer
Katran is Meta's open-source L4 load balancer, deployed at facebook.com to distribute traffic across thousands of servers. Katran uses XDP to process packets at the NIC driver level — before the kernel's networking stack even sees them. It uses RSS (Receive Side Steering) to spread traffic across CPU cores and a BPF program for per-flow consistent hashing, so packets from the same flow always go to the same backend. Achieves ~100 Gbps per server with sub-microsecond per-packet overhead.
Pixie — auto-instrumentation for Kubernetes
Pixie instruments Kubernetes applications without requiring code changes or sidecars. It uses uprobe to attach to the Go runtime's HTTP handling functions and read request/response headers directly off the socket. This gives you distributed tracing, application metrics, and profiling data from every pod without any agent in the application container. Pixie's Phoronix eBPF runtime reads HTTP/gRPC/AMQP payloads at uprobe points in the application's protocol handlers.
Tetragon / Falco — runtime security via syscall tracing
Falco (now part of the Falco project) uses a kernel module to attach to syscall tracepoints. Tetragon (Isovalent's successor) replaces that kernel module with eBPF — the program attaches to raw_syscalls:sys_enter and raw_syscalls:sys_exit and evaluates a rules engine to detect suspicious behavior: crypto miners, container escape attempts, privilege escalation. The eBPF program runs continuously and writes security events to a ring buffer; user space receives them and enforces policy. Because it's eBPF, the detection runs in the kernel and is fast enough to not miss syscalls even under heavy load.
Tradeoffs
- Kernel-level speed — programs run at line rate with no userspace round-trips
- Verifier guarantees no kernel crashes (unlike kernel modules)
- Dynamic load/unload — no reboot required, no module signing headaches
- CO-RE makes programs portable across kernel versions (4.18+)
- Low overhead: XDP processes packets before the kernel even allocates an sk_buff
- Verifier complexity: complex pointer chasing or deep loops are rejected; writing "verifier-friendly" code requires learning idioms
- 1M instruction limit sounds large but is limiting for complex algorithms (sorting, complex parsing)
- Loading requires CAP_BPF + CAP_PERFMON (or CAP_SYS_ADMIN) — not available to unprivileged users
- Kernel struct layout changes between versions — CO-RE helps but not every program works on every kernel
- Debugging is hard — verifier rejection messages are cryptic, tracing JIT-compiled programs needs bpftool
Frequently Asked Questions
How much overhead does an eBPF program add?
Depends on hook and program. A simple kprobe adds ~1 µs per probe — you're inserting a trap on every kernel function call. XDP programs run at line rate on a single core and can forward 10-100 Gbps with proper RSS (Receive Side Steering). Sampling profilers at 99 Hz add <1% CPU overhead. High-frequency tracepoints (hundreds of thousands per second) will show up as measurable cost. The key is choosing the right hook: tracepoints are lower overhead than kprobes; socket filters in the cgroup layer are cheaper still.
Can eBPF programs crash the kernel?
By design, no — the verifier rejects any program that could crash the kernel before it runs. It enforces no NULL dereferences, bounded loops, valid memory accesses within stack/map/packet bounds, and correct helper usage. But the verifier is software and has had bugs (e.g., CVE-2022-2327 allowed a crafted program to corrupt kernel memory). eBPF's attack surface is drastically smaller than kernel modules, but 'no kernel modules required' doesn't mean 'no kernel bugs'. Production deployments keep eBPF programs updated and subscribe to linux-distros and bpf-devel for rapid verifier CVE notifications.
What's the difference between eBPF and DTrace?
DTrace originated on Solaris (2004), exists on macOS and FreeBSD. It has a high-level D language for asking arbitrary questions about kernel state. Linux's eBPF is more general-purpose — it covers networking, security, and observability, not just tracing. Linux also has perf, ftrace, and tracepoints as lower-level primitives that eBPF builds on top of. bpftrace explicitly models DTrace's D language for ad-hoc one-liners. The key practical difference: DTrace is a kernel-level facility with a script front-end; eBPF is a programmable hook system that production tools (Cilium, Katran, Pixie, Falco) build real systems on.
Why not just write a kernel module?
Kernel modules are unsafe: one bug can panic the kernel. They must be compiled against the exact kernel version (symbol addresses change between releases). Distribution is painful — a .ko built on Ubuntu likely doesn't load on Fedora. Installation requires root and marks the kernel as 'tainted', breaking vendor support contracts. eBPF programs are sandboxed by the verifier (no crash risk), dynamically loaded and unloaded without reboot, and CO-RE lets a single binary run across kernel versions because BTF embeds the running kernel's struct layout and the loader patches the program at load time.
What is CO-RE and why does it matter?
CO-RE (Compile Once — Run Everywhere) solves the kernel version problem. Before CO-RE, you'd compile your eBPF program against kernel headers, and if the kernel struct layout changed (e.g., struct sk_buff grew a new field in 5.1), the program wouldn't load on the other kernel. CO-RE embeds BTF (BPF Type Format) in the kernel and in the compiled binary — BTF describes every struct field and its offset in the running kernel. At load time, libbpf reads the kernel's BTF, compares it to what the program was compiled against, and patches the bytecode to correct for any offset differences. You ship one binary and it works across kernels 4.18+.
How do tail calls differ from function calls?
A normal BPF function call (BPF_CALL) is like an inline — it transfers control within the same program and shares the same stack, context, and verification state. A tail call (BPF_TAIL_CALL) replaces the current program entirely with a different one loaded into a prog_array map. The new program starts fresh (different stack, same maps). Tail calls let you chain programs to exceed the 1M instruction limit or compose programs written by different teams. The chain depth is limited to 33 (BPF_MAX_TAIL_CALL_CNT). Each hop in the chain costs an extra ~50-100 ns. Tail calls cannot pass values on a stack between them — if you need data flow, use maps.