Performance Profiling
Finding where your code burns time and memory — from ad-hoc debugging to always-on fleet-wide observability
What Is Profiling?
Profiling is systematic measurement of a program's resource consumption — where does CPU time go? Where does memory get allocated? What does a cache miss cost? Unlike logging, which tells you what happened, profiling tells you where the bottleneck is. It is the foundation of data-driven performance optimization.
The classic profiling workflow: observe a performance problem (high P99 latency, excessive CPU, OOM crashes), collect a profile, find the hot code path, understand why it's slow, fix it, verify with a follow-up profile. The last step is critical — without a before/after profile, you don't know if the fix actually helped.
Profiling is fundamentally a sampling problem. Every N milliseconds, the profiler pauses the program and records the current stack trace. Do this 10,000 times and you have a statistically meaningful picture of where time or memory was consumed. The key insight is that you don't need to capture every single function call — a representative sample is enough, because the expensive code paths appear proportionally more often.
There are two modes: ad-hoc profiling (SSH in, run perf record,
analyze the result) and continuous profiling (always-on agents on every host,
shipping aggregated pprof profiles to a query backend). The former is better for deep dives;
the latter is better for catching regressions before they reach production.
Types of Profiling
Different profile types answer different questions. Choosing the wrong type means you miss the actual bottleneck.
CPU Profiling
Who is burning CPU cycles? Samples are collected on-CPU only — threads that are waiting on I/O or sleeping don't appear. This is the most common profile type. The key metric is self CPU time — how much time a function spent executing its own code, excluding time spent in callees.
{`# CPU profile: what is executing right now?
perf record -F 99 -a -g -- sleep 30
# Every 10ms, freeze all CPUs, capture stack`} Wall-Clock Profiling
Samples every running thread regardless of state — on-CPU, blocked on I/O, sleeping, waiting on a lock. Shows the full picture of where time goes including I/O wait and scheduler intervention. Produces wider, messier flamegraphs but reveals "why is this request slow" answers that CPU profiles miss.
{`# Wall-clock: on-CPU + off-CPU time
perf record -F 99 -a -g --sleep 30 # hardware events + stack
# or via async-profiler
profiler.sh -d 30 -e wall `} Memory / Heap Profiling
Tracks heap allocations — either live bytes (objects still in memory) or allocation rate (bytes allocated per second). Live-byte profiles reveal memory leaks and high-water marks. Allocation profiles reveal GC pressure: which call paths allocate the most, triggering more frequent garbage collection.
{`# Java allocation profile: every 512KB of heap allocation
profiler.sh -d 30 -e alloc
# Go live heap (current in-use allocations)
curl http://localhost:6060/debug/pprof/heap`} Mutex / Lock Profiling
Tracks lock contention — how much time threads spend blocked waiting on a mutex or RWLock. The profile shows which call paths are waiting and which hold locks. High contention often points to overly coarse locking or lock-free algorithm choices.
{`# Java lock contention
profiler.sh -d 30 -e lock
# Go mutex profile
curl http://localhost:6060/debug/pprof/mutex`} Block / I/O Profiling
Tracks time threads spend blocked on I/O operations — disk, network, or inter-process communication. Useful for understanding when your program is waiting on external services or filesystem operations. Often paired with wall-clock profiling.
{`# Go block profile (goroutine blocking on sync primitives)
curl http://localhost:6060/debug/pprof/block
# Linux block I/O via perf
perf record -e block:block_rq_insert -a -g -- sleep 30`} Goroutine Profiling
Not a performance profile per se — a snapshot of all goroutines and their stack traces. Shows what every goroutine is doing right now. Great for diagnosing goroutine leaks, runaway goroutine counts, and deadlock-adjacent situations. Think of it as a process dump that you can diff over time.
{`# Dump all goroutine stacks (Go)
curl http://localhost:6060/debug/pprof/goroutine?debug=1
# Goroutine profile (aggregated counts)
curl http://localhost:6060/debug/pprof/goroutine`} Wall-Clock vs CPU Time
The difference between wall-clock and CPU time is fundamental. A thread that spends 900ms of its 1000ms wall-clock time blocked on a network call shows up as 900ms of off-CPU wall-clock time but 0ms of CPU time. A CPU profile misses it entirely. Consider: an HTTP handler calling a downstream service has two distinct cost centers — the CPU to serialize the request and parse the response, and the wall-clock time the downstream call takes. A CPU profile only sees the first. A wall-clock profile sees both, letting you attribute the total latency correctly.
{`# Example: CPU vs wall-clock attribution for an HTTP handler
# CPU profile shows: json.Marshal = 5ms, db.Query = 3ms, response.Write = 2ms
# Wall-clock profile shows: json.Marshal = 5ms, db.Query = 3ms, downstream RPC = 990ms
# → The real problem is the downstream call, not your code`} Linux Profiling Tools
Linux has a rich set of profiling tooling, from lightweight timer-based sampling (perf) to interpreter-level tracing (valgrind) to kernel-level stack collection (eBPF). Each serves a different niche.
perf
The standard Linux profiler. Uses hardware performance counters (PMCs) and the kernel's perf_event subsystem to sample CPU stacks. Low overhead, works for any language, requires frame pointers or DWARF debug info for useful stacks.
perf stat— count events (cycles, instructions, cache misses)perf record— record samples for later analysisperf report— text-based profile browserperf annotate— per-instruction assembly with source interleavedperf script— raw sample output for flamegraph tools
eBPF / stap
SystemTap (stap) and raw eBPF programs attach to kernel probes and can collect stack traces
from kernel context. The bcc toolkit provides ready-made tools like profile.py
that do continuous eBPF profiling with minimal overhead. Requires Linux 4.x+.
profile.py— bcc tool, continuous CPU profileroffcputime.py— eBPF tool for off-CPU timealloc_flow.py— allocation tracking by call pathbiolockprobe— block I/O contention
gperftools (Google Performance Tools)
Google's heap and CPU profiler for C++. Works by patching malloc with an interceptor that records allocation call stacks. CPU profiler uses timer-based sampling with low overhead. Outputs pprof-compatible profiles. Good for native services where perf can't get DWARF stacks.
CPUPROFILEenv var enables CPU profilingHEAPPROFILEenv var enables heap profilingpprof --text— text profile outputpprof --gif— call graph visualization
valgrind / callgrind
Valgrind runs your program in a synthetic CPU (x86 emulation) and intercepts every memory operation. callgrind is its call-graph profiling tool — deterministic, instruction-level accuracy. Extremely slow (10-100x) but complete: no sampling bias, every call counted. Best for understanding algorithmic complexity and cache behavior in small workloads.
valgrind --tool=callgrind ./mybinarycallgrind_annotate— per-function instruction countskcachegrind— GUI call graph explorer
perf: The Linux Profiler
perf is the built-in Linux profiler, backed by hardware performance monitoring
counters (PMCs) and the kernel perf_event API. It can sample any event the kernel knows about —
CPU cycles, instructions retired, cache misses, branch mispredictions — and combine that with
stack traces.
perf stat — Counting Events
Before profiling, start with perf stat to get an overview of what hardware events
your workload exhibits. This is non-intrusive and uses PMC hardware counters directly.
{`# Run with full event set
sudo perf stat -e cycles,instructions,cache-references,cache-misses,\\
branch-instructions,branch-misses ./my_program
# Output:
# 1,234,567,890 cycles # ~1.2 GHz * 1s
# 567,890,123 instructions # ~0.46 IPC
# 12,345,678 cache-references
# 123,456 cache-misses # ~1% miss rate
# 98,765,432 branch-instructions
# 1,234 branch-misses # excellent prediction rate`} perf record — Collecting Samples
perf record programs the PMC to overflow after N events and fires a timer. The
kernel pauses the program, captures the instruction pointer and stack, then resumes. Stack
walking requires either frame pointers (compile with -fno-omit-frame-pointer) or
DWARF debug info.
{`# Record 60 seconds of CPU profiles at 99 Hz on all CPUs
sudo perf record -F 99 -a -g -- sleep 60
# Record specific events instead of timer ticks
sudo perf record -e cycles:u -a -g -- ./my_program
# Record with call-graph dwarf (for stripped binaries)
sudo perf record -F 99 -a -g --call-graph dwarf -- sleep 30
# Check the resulting data
perf report # interactive TUI browser
perf report --stdio # text output
perf annotate # per-instruction annotation`} perf annotate — Reading Assembly
perf annotate takes a symbol from the profile and shows each assembly instruction
with a percentage of samples. Combined with DWARF debug info it interleaves source lines.
This is how you find the exact instruction inside a hot function.
{`perf annotate --stdio --symbol=hot_function
# │
# 0.00 │ mov %rax,%rdx
# 0.00 │ test %rax,%rax
# 78.34 │ imul %rax,%rax ← hot multiply inside loop
# 12.51 │ add $0x1,%eax
# 8.65 │ cmp $0x64,%eax
# 0.50 │ jne ..`} perf + DWARF: When Binaries Are Stripped
If your binary was compiled without frame pointers (common with -O2) and lacks
DWARF debug info, perf record will produce stacks that end in
[unknown]. Three solutions:
- Compile with
-fno-omit-frame-pointer -gin debug builds - Install debuginfo packages on the host (
debuginfo-installon RHEL/Fedora) - Use
--call-graph dwarfwhich uses DWARF stack unwinding instead of frame pointers
{`# Check what your perf data has
perf report --guest-none --symbol-limit=5
# If you see [unknown], check for missing debug info
objdump -t ./mybinary | grep -i debug # does it have .debug_info?
readelf -S ./mybinary | grep debug # list debug sections
# Fix: install debuginfo on RHEL/Fedora
sudo debuginfo-install mypackage
# Or rebuild with frame pointers
CFLAGS="-fno-omit-frame-pointer -g" ./configure && make`} Flame Graphs
A flame graph is a stacked bar chart where each bar represents a stack frame, the width represents the proportion of samples for that frame, and the stacking shows the full call path. The visual encoding is deliberate: the hottest code paths produce the widest bars at the top, immediately drawing the eye. Introduced by Brendan Gregg in 2016, they have become the de facto standard for CPU and memory flame visualization.
How to Generate a Flame Graph
The canonical tool is Brendan Gregg's FlameGraph suite. The workflow: collect raw stack samples, fold them into a single-line-per-stack format, render to SVG.
{`# 1. Collect stacks with perf
sudo perf record -F 99 -a -g -- sleep 60
perf script --header > /tmp/out.stacks
# 2. Clone the FlameGraph toolkit
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
# 3. Fold stacks: many-sample-lines → one-line-per-unique-stack
./stackcollapse-perf.pl /tmp/out.stacks > /tmp/out.folded
# out.folded looks like:
# main;handle_request;db.Query;index_scan 12345
# main;handle_request;json.Marshal 7890
# main;gc_run;gcDrain 4567
# 4. Render to interactive SVG
./flamegraph.pl /tmp/out.folded > /tmp/flame.svg
# Open in a browser — click to zoom into any frame`} Reading a Flame Graph
The horizontal axis is time (or count), not time. Each column is one unique stack. The width of each bar is proportional to how many samples had that stack. The vertical axis is stack depth — top is the leaf (the function currently executing when the sample hit).
Finding the bottleneck: Look for tall bars at the top — wide at the top means a specific leaf function consumed most samples. The hottest frame is always at the top of a stack, never buried in the middle. Wide mid-level bars indicate a function is called from many places (not necessarily hot itself).
Colors have meaning: In Brendan Gregg's default color scheme, colors are random per frame (or based on palette). In differential flame graphs, red means "got worse" and blue means "got better." Some color schemes encode the profile type (CPU = red/orange, alloc = blue, I/O = green).
Differential Flame Graphs
The most powerful flamegraph technique for performance work: compare a baseline profile to a post-fix profile. The output shows the delta — widened bars in red (regression), widened bars in blue (improvement). This eliminates the psychological trap of seeing a small hot spot and missing that a bigger one appeared.
{`# After optimizing, record a new profile
sudo perf record -F 99 -a -g -- sleep 60 -o /tmp/after.perf
# Fold both
./stackcollapse-perf.pl /tmp/before.perf > /tmp/before.folded
./stackcollapse-perf.pl /tmp/after.perf > /tmp/after.folded
# Generate differential SVG
./difffolded.pl /tmp/before.folded /tmp/after.folded \\
| ./flamegraph.pl --negate > /tmp/diff.svg
# --negate flips colors: now red = improved (was more, now less)`} Continuous Profiling: Architecture
Ad-hoc profiling means SSH'ing to a sick box, running perf record for
sixty seconds, scp'ing the trace home, and squinting at a flamegraph. By the time
you have data, the incident has moved on. Continuous profiling instead samples
every host all the time at low overhead (typically <1% CPU) and ships the
aggregated stacks to a query backend. You can ask "show me the top CPU consumer
across the fleet over the last hour, broken down by service" without ever logging
into a host.
The data model is uniform: a flamegraph is a tree of stack frames with a count (CPU samples, allocated bytes, blocked time). The wire format is Google's pprof protobuf. The collection mechanism is per-language: async-profiler for Java, perf+pprof for Linux native, runtime/pprof for Go, and increasingly eBPF for everything else.
Continuous Profiling Systems
The continuous profiling ecosystem has converged on a few major systems. All speak pprof and expose flamegraph UIs; they differ in agent strategy, storage engine, and operational model.
| System | Agent Strategy | Storage | Notes |
|---|---|---|---|
| Grafana Pyroscope | Per-language SDKs + eBPF | Parquet on object store | formerly Pyroscope + Phlare merged; part of Grafana LGTM stack |
| Parca | eBPF-only (Parca Agent) | FrostDB (columnar) | Zero instrumentation required; kernel 5.x required |
| Polar Signals Cloud | Same eBPF agent as Parca | Hosted FrostDB | Founded by original Parca team; fully managed SaaS |
| Datadog Profiler | Per-language SDK in Datadog Agent | Datadog SaaS | Strong APM integration (links profiles to traces); per-host pricing |
| Profefe | Pull-based scraping of /debug/pprof | BadgerDB / S3 | Open source; self-hosted; K Native |
| Elastic APM Profiling | eBPF + per-lang SDK | Elasticsearch | Part of Elastic APM; integrated in Kibana |
The pprof Wire Format
pprof is a Protocol Buffer format that encodes a sample-typed profile. Every Go binary exposes it, async-profiler emits it, and all continuous profiling backends ingest it. Understanding the schema helps when debugging symbolization issues and writing custom exporters.
{`message Profile {
repeated ValueType sample_type = 1; // e.g. [["samples","count"],["cpu","nanoseconds"]]
repeated Sample sample = 2; // the actual samples
repeated Mapping mapping = 3; // /lib/libc.so loaded at 0x7f...
repeated Location location = 4; // address → function + line number
repeated Function function = 5; // name, system_name, filename
repeated string string_table= 6; // deduplicated strings
int64 period = 12; // sample interval in nanoseconds
ValueType period_type = 11;
TimeSeries time_series = 13;
}
message Sample {
repeated uint64 location_id = 1; // stack trace (leaf-first order)
repeated int64 value = 2; // values for each sample_type
repeated Label label = 3; // span_id, trace_id, pod, service_name
}
// Sample types for a CPU profile:
# cpu:nanoseconds: samples count=9999, cpu_ns=123456789
# cpu:nanoseconds: samples count=9999, cpu_ns=123456789
// Sample types for a heap profile:
# alloc_space:bytes:alloc_space:bytes (cumulative allocated bytes)
# alloc_space:bytes:alloc_space:bytes (since program start)`} Java and .NET Profilers
Managed runtimes present unique profiling challenges: JIT-compiled code hides call frames, garbage collection introduces safepoint bias, and large heap sizes make heap profiling expensive. Specialized profilers handle these.
async-profiler (Java)
The JVM has JFR (Java Flight Recorder) but it has safepoint bias — samples land only
at JVM safepoints where all threads are paused, systematically missing code that runs between
safepoints (tight loops, certain GC phases). async-profiler uses AsyncGetCallTrace,
an undocumented HotSpot API that walks the stack from a signal handler without pausing the JVM.
{`# Attach to a running JVM, sample CPU for 30 seconds
./profiler.sh -d 30 -e cpu -f /tmp/cpu.html
# Allocation profile: sample every 512KB of heap allocation
./profiler.sh -d 30 -e alloc -f /tmp/alloc.html
# Lock contention: who is blocking on monitors?
./profiler.sh -d 30 -e lock -f /tmp/lock.html
# Wall-clock: where does time actually go (on + off CPU)?
./profiler.sh -d 30 -e wall -f /tmp/wall.html
# Output as pprof for upload to Pyroscope/Parca
./profiler.sh -d 30 -e cpu -o pprof -f /tmp/cpu.pb.gz
# Frame count: how deep are the call stacks?
./profiler.sh -d 30 -e cpu -f /tmp/cpu.html --frame-count=64 `} JDK Flight Recorder (JFR)
JFR is the JVM's built-in profiler. Unlike async-profiler, it runs inside the JVM and has full access to internal events (GC, class loading,-compilation). It is safepoint-biased but produces richer diagnostic data than CPU samples alone. For continuous profiling in production, async-profiler + Pyroscope is the better choice. For JVM diagnostics (GC tuning, class loading storms), JFR is irreplaceable.
{`# Start a JVM with JFR enabled (continuous recording)
java -XX:StartFlightRecording=filename=recording.jfr,dumponexit=true \\
-XX:FlightRecorderOptions=maxsize=256M,stackdepth=256 \\
-jar myapp.jar
# Dump JFR to file via jcmd
jcmd JFR.dump recording=/tmp/recording.jfr
# Analyze with JDK Mission Control (JMC)
jmc -JFRRecording=recording.jfr
# Or via CLI with jfr command (JDK 14+)
jfr summary recording.jfr
jfr metadata recording.jfr`} dotnet-trace / dotnet-counters (.NET)
.NET Core ships with built-in profiling tools accessed via the dotnet-trace,
dotnet-counters, and dotnet-dump CLI utilities.
{`# Collect a CPU trace for 30 seconds
dotnet-trace collect --duration 30 --providers Microsoft-DotNETCore-EventPipe \\
--profile CPU --process-id
# List available counters (live monitoring)
dotnet-counters monitor --process-id \\
System.Runtime CPUUsage GC-Gen0Size ...
# Collect a heap dump for dotnet-dump analysis
dotnet-dump collect -p
# Analyze in dotnet-dump interactive CLI
dotnet-dump analyze /tmp/core_dump
# > dumpheap -stat ← show live object sizes
# > gcroot -all ← find roots preventing GC
# > clrthreads ← list managed threads and stacks`} Java Profiling Labels: Linking Profiles to Requests
A key feature of continuous profiling is correlating a profile to a specific request or trace.
async-profiler supports labels — key-value pairs attached to samples — and Pyroscope uses them
to filter profiles by trace_id, endpoint, or any custom label.
{`// Java: set a profiling label on the current thread
import com.newrelic.api.agent.Trace;
import com.newrelic.agent.bridge.AgentBridge;
@Trace(dispatcher = true)
public Response handleRequest(Request req) {
// This label appears on every sample in this thread's stack
AgentBridge.getAgent().getProfiler()
.setLabel("trace_id", req.getTraceId());
return process(req);
}
// Now in Pyroscope: filter by trace_id="abc-123"
// to see the flamegraph for exactly that slow request`} Database Profiling: PostgreSQL and ClickHouse
Database performance problems are often invisible from the application side. A query that takes 500ms doesn't show up as a hot CPU flamegraph — it shows up as wall-clock time blocked on the network. Database-level profiling gives you the query plan and execution statistics.
PostgreSQL: pg_stat_statements and auto_explain
pg_stat_statements tracks query-level statistics across all queries, enabling
identification of the top consumers by calls, total time, and I/O. auto_explain
logs query plans for slow queries automatically.
{`-- Enable extensions (requires restart or superuser)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS auto_explain;
-- Reload with auto_explain enabled (log plans for queries > 100ms)
ALTER SYSTEM SET auto_explain.log_min_duration = 100;
SELECT pg_reload_conf();
-- View the top queries by total execution time
SELECT query,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
shared_blks_hit,
shared_blks_read,
temp_blks_read,
temp_blks_write
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Find queries that do a lot of temp disk spills
SELECT query, temp_blks_read, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_read + temp_blks_written > 0
ORDER BY temp_blks_read + temp_blks_written DESC
LIMIT 10;`} PostgreSQL: EXPLAIN ANALYZE
EXPLAIN ANALYZE executes the query and returns the actual execution plan with
per-node timing and row counts. Always use ANALYZE (not just EXPLAIN)
— the unexecuted EXPLAIN can be wildly wrong for queries with conditional logic, volatile
functions, or parameterized values.
{`EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, count(o.id) AS order_count, sum(o.total) AS revenue
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id, u.name
ORDER BY revenue DESC
LIMIT 20;
-- Key things to look for:
-- 1. "actual rows" vs "rows" — large discrepancy = stale statistics → ANALYZE
-- 2. "Seq Scan" on large tables — missing index
-- 3. "Sort" with high memory ("Sort Method: quicksort" + "Sort Space Used: 12345kB")
-- 4. "Hash Join" with large in-memory hash table → may spill to disk
-- 5. "Nested Loop" with high actual rows → potentially O(n²) disaster`} PostgreSQL: pg_stat_activity — Live Query Inspection
pg_stat_activity shows every active query and what it's doing right now. Use it
to find queries that are blocking, long-running, or waiting on locks.
{`-- Who is running what, right now?
SELECT pid, usename, application_name, state,
query_start, state_change,
round(extract(epoch FROM now() - query_start), 1) AS duration_s,
wait_event_type, wait_event,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
AND pid != pg_backend_pid()
ORDER BY query_start;
-- Find queries waiting on locks
SELECT pid, relation::regclass, mode, granting, fastpath
FROM pg_locks
WHERE NOT granted;
-- Find the blocking pid's query
SELECT pid, left(query, 200)
FROM pg_stat_activity
WHERE pid IN (SELECT pid FROM pg_locks WHERE NOT granted);`} ClickHouse: system.query_log and query_log
ClickHouse logs every query to system.query_log with per-query metrics including
read rows, read bytes, memory usage, and query duration. It's a time-series table you can
query with SQL. The query_thread_log additionally logs per-thread statistics for
multi-threaded queries.
{`-- Top queries by peak memory (last 1 hour)
SELECT query, query_kind, threads,
round(memory_usage/1024/1024, 2) AS mem_mb,
round(read_rows/1024/1024, 2) AS rows_m,
round(read_bytes/1024/1024/1024, 2) AS gb_read,
formatDATETIME(event_time, '%Y-%m-%d %H:%i:%s') AS ts,
round(query_duration_ms/1000, 2) AS dur_s
FROM system.query_log
WHERE type IN ('QueryFinish', 'ExceptionWhileProcessing')
AND event_time > now() - INTERVAL 1 HOUR
AND memory_usage > 0
ORDER BY memory_usage DESC
LIMIT 20;
-- Query profile: per-thread breakdown for a specific query
SELECT thread_id, read_rows, read_bytes,
main_loop_calls, main_loop_time_ms
FROM system.query_thread_log
WHERE event_date = today()
AND query LIKE '%my_slow_query_pattern%'
ORDER BY read_bytes DESC;
-- ClickHouse query-level profiler (100ms+ queries)
SELECT query, ProfileEvents,
round(query_duration_ms/1000, 2) AS dur_s,
read_rows, result_rows
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 100000 -- > 100 seconds
ORDER BY query_duration_ms DESC
LIMIT 10;`} Profiling in Kubernetes
Kubernetes environments add a layer of indirection: the process you want to profile is inside a container, inside a pod, on a node you may not have direct access to. There are three main approaches.
Approach 1: kubectl exec into the Pod
The most straightforward approach: exec into the container and run profiling tools directly. Works when the container image includes the profiler binaries or you can install them.
{`# Find the pod
kubectl get pods -n production | grep my-service
# NAME READY STATUS RESTARTS AGE
# my-service-5f4b8c9-x2k7 2/2 Running 0 5d
# Exec in and profile
kubectl exec -it -n production my-service-5f4b8c9-x2k7 -- sh
# Inside container:
apk add perf # if Alpine-based
apt-get install perf # if Debian-based
# Record CPU for 30 seconds
perf record -F 99 -a -g -- sleep 30
# Exit and copy out the perf.data
kubectl cp production/my-service-5f4b8c9-x2k7:/root/perf.data ./perf.data`} Approach 2: Profile Sidecar
Deploy a profiling sidecar alongside your application container. The sidecar runs the profiler and ships the resulting data to a storage backend. This approach doesn't require modifying your application container and works for any language.
{`# Example: Pyroscope sidecar in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
env:
- name: PYROSCOPE_SERVER_ADDRESS
value: "http://pyroscope:4040"
- name: PYROSCOPE_APPLICATION_NAME
value: "my-service"
- name: PYROSCOPE_PROFILER_ALLOC
value: "true"
- name: PYROSCOPE_PROFILER_CPU
value: "true"
- name: pyroscope-agent # sidecar: eBPF profiler
image: pyroscope/pyroscope:latest
args:
- /bin/pyroscope
- exec
- python3,/myapp/main.py # profile the main process
env:
- name: PYROSCOPE_SERVER_ADDRESS
value: "http://pyroscope:4040"`} Approach 3: Host PID Namespace Access
For node-level profiling (e.g., profiling a process that runs as a Kubernetes system service), use a privileged DaemonSet with access to the host's PID namespace.
{`# Privileged profiling DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: perf-agent
spec:
spec:
hostPID: true
securityContext:
privileged: true
containers:
- name: perf
image: alpine:latest
command: ["sh", "-c", "while true; do sleep 3600; done"]
securityContext:
CAP_SYS_ADMIN: true
volumeMounts:
- name: src
mountPath: /debug
volumes:
- name: src
hostPath:
path: /sys/kernel/debug`} Allocation Profiling
Allocation profiling tells you which call paths allocate the most memory. This is distinct from a heap dump — it measures allocation rate, not current live set. The key question it answers is "where is GC pressure coming from?" Reducing allocation rate is the most effective way to lower GC overhead in managed languages.
tcmalloc Heap Checker (C++, Go)
gperftools' tcmalloc includes a heap checker that records live allocations at a point in time, and a heap profiler that tracks allocation rates. Go's runtime uses a tcmalloc-inspired allocator, and Go's pprof heap endpoint exposes similar data.
{`# Go: heap profile (current live allocations)
curl -s http://localhost:6060/debug/pprof/heap > heap.pb
# Go: heap profile with 30-second peak tracking
# This records the peak memory used during the interval
curl -s http://localhost:6060/debug/pprof/heap?seconds=30 > heap_peak.pb
# pprof text output
go tool pprof -text heap.pb
# Showing nodes accounting for 100% of 45.28MB of allocation space
# File: heap.go
# flat flat% sum% cum cum%
# 23.50MB 51.9% 51.9% 23.50MB 51.9% main.makeLargeAlloc
# 12.30MB 27.2% 79.1% 12.30MB 27.2% main.processOrders
# 5.20MB 11.5% 90.6% 8.10MB 17.9% main.serializeResponse
# ...
# Go: allocation profile (rate-based)
# Add runtime/pprof.StartCPUProfile() to track new allocations
curl -s http://localhost:6060/debug/pprof/heap?type=alloc_space > alloc.pb`} jemalloc Stats
jemalloc (used by Firefox, Dropbox, and many high-throughput services) exposes detailed
allocation statistics via its stats API. The mallctl interface lets you query
cumulative allocation counters broken down by size class and call site.
{`# jemalloc: dump stats to a file
curl -s http://localhost:6060/debug/malloc?stats=1 > jemalloc_stats.txt
# jemalloc_stats.txt excerpt:
#
# Background threads: 4
# Config: opt_prof, opt_stats_print
# Allocated: 2,345,678,912 bytes
# Active: 2,567,890,123 bytes
# Metadata: 45,678,901 bytes
# Resident: 3,456,789,012 bytes
#
# Allocation counters:
# small: 2,123,456,789 ops
# large: 456,789 ops
# huge: 1,234 ops
#
# By size class:
# 256B: 234,567 allocations, 60,123,456 bytes
# 512B: 123,456 allocations, 63,234,567 bytes
# 4096B: 34,567 allocations, 141,234,567 bytes
# large: 456,789 allocations, 987,654,321 bytes
# huge: 1,234 allocations, 234,567,890 bytes
# Enable heap profiling with jemalloc
MALLOC_CONF="prof:true,prof_active:true" ./my_service
# Dump via jeprof (from gperftools)
jeprof --text ./my_service jemalloc_heap.1.heap`} Reading Allocation Flame Graphs
Allocation flame graphs look like CPU flame graphs but encode a different quantity. A bar
representing 10MB of allocation at the json.Marshal leaf means "10MB was allocated
by calls to json.Marshal across all sampled stacks." This is not memory in use — it's the
cumulative amount allocated over the profile window. If you see 1GB allocated in 30 seconds
and your heap is 512MB, the GC had to run at least twice to keep up.
{`# async-profiler allocation output (HTML flamegraph)
./profiler.sh -d 30 -e alloc -f /tmp/alloc.html
# The flamegraph shows:
# - Width = cumulative bytes allocated (not current heap usage)
# - Colors are typically blue-to-purple gradient
# - Look for: many wide bars at a leaf function = high allocation rate
# - Target: find the call sites that allocate the most
# and check if those allocations can be cached/reused`} Latency Profiling: Histograms and Percentiles
A latency profile isn't a stack trace — it's a histogram of request durations. The goal is to understand the full distribution, not just the average. Optimizing for average latency while ignoring P99 is a common mistake.
p50 / p95 / p99 / p999 — What Each Percentile Tells You
Percentiles are not equally useful. p50 (median) tells you about typical load; if it spikes when p99 spikes, you have a load-dependent problem. p95 captures the "long tail" of user experience — typically the SLO threshold. p99 catches rare but real problems. p999 (0.1%) is where you find the rarest worst-case events, often correlated with GC pauses, OS scheduler preemption, or external dependency timeouts.
{`# Go: latency histogram with HistogramVec from Prometheus client library
histogram = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
})
# Python: use prometheus_client
from prometheus_client import Histogram
h = Histogram('request_latency_seconds', 'HTTP request latency',
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5])
# What the percentiles mean:
# p50 = 10ms → median user sees this latency
# p95 = 250ms → 19 in 20 users see this or less
# p99 = 800ms → 1 in 100 users wait longer (often the SLO threshold)
# p999 = 2.5s → 1 in 1000 users sees a rare pathological case
# → check this for GC pauses, major page faults`} Latency Flame Graphs
Standard flame graphs show CPU time distribution. Latency-attributed flame graphs are different: each sample is a stack trace annotated with the latency of the request that produced it. A request that took 5 seconds contributes a 5-second weight to every stack frame in its trace. This makes the slowest requests visually dominate the flame graph, immediately revealing which call path is responsible for the tail.
{`# Pyroscope: latency-weighted flamegraph
# Set the profile type to "wall" and filter by high-latency requests
# Pyroscope query:
/my-service:wall{profile_type="wall",quantile="p99"}
# → Shows the flamegraph weighted by p99-latency requests
# This is equivalent to:
# For each request with latency > p99 threshold:
# Record a wall-clock sample for each stack frame
# Weight the sample by (request_latency / sample_interval)
# Stack frames from slow requests become wider → immediately visible
# Differential latency flamegraph: which path regressed?
/my-service:wall{compare="1h"}
# → Compare current hour vs previous hour, widths are deltas`} Building a Latency Histogram from Scratch
When you don't have a metrics library, you can build a simple histogram using a logarithmic bucketing strategy. The key design choice is the bucketing scheme: linear buckets (0-1ms, 1-2ms, ...) work for fine-grained latency; logarithmic buckets (0-1ms, 1-10ms, 10-100ms, 100ms-1s) work for capturing the full tail.
{`// Simple latency histogram with logarithmical buckets
class LatencyHistogram {
// Buckets: <1ms, 1-5ms, 5-10ms, 10-50ms, 50-100ms, 100ms-500ms, 500ms-1s, 1s+
private long[] buckets = new long[8];
private long sum = 0;
private long count = 0;
private static final long[] THRESHOLDS =
{1_000_000, 5_000_000, 10_000_000, 50_000_000, 100_000_000, 500_000_000, 1_000_000_000};
public void record(long durationNanos) {
for (int i = 0; i < THRESHOLDS.length; i++) {
if (durationNanos < THRESHOLDS[i]) {
buckets[i]++;
break;
}
}
sum += durationNanos;
count++;
}
public Map percentiles() {
long[] sorted = sortSamples(); // you'd need to store raw samples or use t-digest
return Map.of(
"p50", percentile(sorted, 0.50),
"p95", percentile(sorted, 0.95),
"p99", percentile(sorted, 0.99),
"p999", percentile(sorted, 0.999)
);
}
}`} Practical Example: Finding a CPU Bottleneck
Here's a step-by-step walkthrough of finding and fixing a real CPU bottleneck. We start with a slow endpoint and end with a verified improvement via before/after profiles.
Step 1: Observe the Problem
Your SLO is p99 latency < 500ms, but the /api/reports endpoint is hitting
1.2s at p99. You suspect it's a database query but you're not sure. You pull up your
continuous profiling dashboard and filter to this endpoint.
{`# Query your profiling backend for this endpoint
# In Grafana Phlare / Pyroscope:
# Query: my-service {endpoint="/api/reports", quantile="p99"}
# The flamegraph shows a dominant bar: sql.query("SELECT ...") = 68% of wall-clock time`} Step 2: Get an Ad-Hoc CPU Profile
Since the continuous profile shows a hot path, confirm it with an ad-hoc profile on the live system to get higher resolution data and verify the signal.
{`# SSH to a production host running the service
ssh prod-host-03
# Find the process
ps aux | grep my-service
# USER PID %CPU COMMAND
# app 12345 85% ./my-service --config=/etc/myapp/prod.yaml
# Record CPU profile for 60 seconds during traffic
sudo perf record -F 99 -a -g -p 12345 -- sleep 60
# Ctrl+C when done → creates perf.data
# Copy it off the host
perf script --header > /tmp/perf.stacks
# Now generate the flamegraph
./FlameGraph/stackcollapse-perf.pl /tmp/perf.stacks > /tmp/perf.folded
./FlameGraph/flamegraph.pl /tmp/perf.folded > /tmp/flame.svg
# scp to your laptop and open in browser`} Step 3: Read the Flame Graph
Open the flame graph. The widest bar at the top is your hot leaf. Work your way down the stack — each frame is a caller. In this example, the flamegraph shows:
{`# Folded output (simplified)
main.handleRequest;reportController.Generate;sql.Query;indexScan 4523
main.handleRequest;reportController.Generate;sql.Query;hashAgg 1234
main.handleRequest;reportController.Generate;serializeJSON 890
main.handleRequest;reportController.Generate;db.Exec 234
# → indexScan is the bottleneck: 4523 samples / 7200 total = 62.8% of CPU time`}
The flamegraph visually shows a very wide bar for indexScan. This is a
sequential scan on the reports table — there's no index on
created_at, so Postgres is reading every row.
Step 4: Verify with EXPLAIN ANALYZE
Pull the exact query from the flamegraph and run EXPLAIN ANALYZE on it to see
the execution plan.
{`EXPLAIN (ANALYZE, BUFFERS)
SELECT id, user_id, created_at, data
FROM reports
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 100;
-- Output:
-- Sort (cost=12345.67..12348.90 rows=100 width=45)
-- (actual time=1234.5..1235.1 rows=99 loops=1)
-- Sort Key: created_at DESC
-- Sort Method: quicksort Sort Space Used: 12345kB
-- -> Seq Scan on reports (cost=0..12345.67 rows=99 width=45)
-- (actual time=0.1..1233.0 rows=99 loops=1)
-- Filter: (user_id = 42)
-- Rows Removed by Filter: 0
-- Buffers: shared hit=12 read=12345678 ← reading 12GB!`}
The Seq Scan with Buffers: read=12345678 confirms it: 12GB of
data read from disk to return 99 rows. A simple index on (user_id, created_at)
would eliminate this entirely.
Step 5: Fix and Verify
{`-- Add the missing index
CREATE INDEX CONCURRENTLY idx_reports_user_created
ON reports (user_id, created_at DESC);
-- Verify with EXPLAIN ANALYZE (should now show Index Scan, not Seq Scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, user_id, created_at, data
FROM reports
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 100;
-- Planning now shows: Index Scan using idx_reports_user_created ...
-- Buffers: shared hit=103 read=0
-- actual time=0.05..0.45 rows=99 loops=1
-- → 12GB read → 103KB, latency 1234ms → 0.45ms`} Step 6: Profile the Fixed Code
After deploying the fix, collect another profile to verify the improvement and catch any new bottlenecks that were previously hidden behind this dominant one (this is a common pattern — fixing the #1 bottleneck reveals #2).
{`# Collect post-fix profile
sudo perf record -F 99 -a -g -p 12345 -- sleep 60
# Fold and diff
./stackcollapse-perf.pl /tmp/before.stacks > /tmp/before.folded
./stackcollapse-perf.pl /tmp/after.stacks > /tmp/after.folded
# Differential flamegraph
./difffolded.pl /tmp/before.folded /tmp/after.folded \\
| ./flamegraph.pl --negate > /tmp/diff.svg
# The diff shows:
# indexScan: -62.8% (fixed!)
# hashAgg: +22.1% (previously hidden, now visible)
# serializeJSON: +15.3% (previously hidden)
# → Secondary hot spots are now visible — proceed to Step 3 for hashAgg`} Overhead Considerations
Profiling has overhead. Understanding it helps you choose safe sample rates and avoid introducing latency regressions through the act of measuring.
The Sampling Overhead Math
Profiler overhead = (sample rate × stack walk cost) + bookkeeping. Both are tunable.
{`overhead_per_sample = stack_walk_cost + bookkeeping_overhead
overhead_total = sample_rate_hz * overhead_per_sample / 1_000_000_000 # ns → fraction
# Frame pointer walk: ~500 ns per stack
99 Hz × 500 ns = 49,500 ns/sec = 0.005% per CPU per host
# At 4 CPUs: 0.02% overhead
# DWARF unwinding: ~5 μs per stack
99 Hz × 5,000 ns = 495,000 ns/sec = 0.05% per CPU per host
# At 4 CPUs: 0.2% overhead
# eBPF stack walk + map insert: ~2 μs
99 Hz × 2,000 ns = 198,000 ns/sec = 0.02% per CPU per host
# At 4 CPUs: 0.08% overhead
# At 999 Hz on a latency-sensitive workload (1ms target):
# Each sample adds ~2-5 μs overhead → you start to see tail latency spikes`} The 99 Hz Convention
The convention of sampling at 99 Hz (not 100 Hz) exists because many Linux kernels use a timer interrupt at HZ=100. If your profiler samples at exactly 100Hz and the kernel timer fires at 100Hz, you systematically miss stacks that execute during the timer interrupt handler itself — a form of profiling blind spot. 99Hz avoids the beat frequency.
{`# Check your kernel HZ
getconf CLK_TCK
# Usually 100 on server kernels, 250 or 1000 on desktop/preempt kernels
# The math: 99 Hz × 10ms = 990ms covered per second
# You miss 10ms due to sampling gap = 1% statistical uncertainty
# This is the standard trade-off and is fine for most workloads`} Safe Rates by Profile Type
| Profile Type | Safe Rate | Overhead Estimate | Avoid When |
|---|---|---|---|
| CPU (frame pointer) | 99 Hz | <0.01% | Very latency-sensitive (<1ms target) |
| CPU (DWARF) | 99 Hz | <0.05% | Production latency-critical paths |
| Wall-clock | 99 Hz | <0.02% | Same as CPU for off-CPU analysis |
| Heap allocation | 512KB-1MB interval | 1-5% | High-throughput (>100k req/s) services |
| Java async-profiler | 99 Hz CPU / 512KB alloc | <1% | Ultra-low-latency trading systems |
| eBPF continuous | 99 Hz | <0.02% | Kernel 4.x without BTF |
Tradeoffs
Sample Rate vs Overhead
Doubling the sample rate (99 Hz → 198 Hz) doubles the overhead but barely improves the flame graph for steady-state workloads. 99 Hz produces ~7,200 samples per minute per CPU, which is statistically sufficient to resolve 0.1% CPU regressions. Higher rates matter when profiling short-lived processes or burst workloads.
Wall-Clock vs CPU Profile
CPU profile only samples on-CPU threads — misses time spent blocked on I/O, waiting on locks, or in scheduler preemption. Wall-clock samples all threads and reveals "why is this RPC slow" answers that CPU profiles completely miss. For web services with downstream dependencies, wall-clock is usually the right starting point.
SDK vs eBPF Agent
Per-language SDKs (Pyroscope Java agent, Go's runtime/pprof) get language-aware stacks: interpreted Python frames, JIT-inlined Java methods, Go's goroutine stacks. eBPF works for any binary but struggles with managed runtimes that hide their stacks inside the interpreter or JIT. For Go and Java services in Kubernetes, start with SDK profiling. For mixed binary + container environments, eBPF is more cost-effective.
Push vs Pull Collection
Push (Pyroscope SDK, gperftools) sends profiles on a schedule from inside the process.
It works for short-lived processes and serverless functions that exit before a pull
scrape would fire. Pull (Phlare scraping /debug/pprof) integrates with
existing Prometheus-style service discovery and is simpler operationally. Short-lived
services (<10s lifetime) almost always need a push-based approach.
Heap Profile vs Heap Dump
A heap profile is statistical (sampled allocations, low overhead). A heap dump is complete (every live object, potentially hundreds of gigabytes). For routine optimization work, heap profiles are usually sufficient. Heap dumps are for debugging specific OOM situations or investigating leaks that sampling can't isolate. Never take a heap dump in production without understanding the pause time.
Symbol Storage vs On-Demand Symbolization
eBPF profilers like Parca need symbols for user-space stacks. Without symbols, stacks
show as [unknown]. Options: (1) upload symbol tables to the backend per
binary version, (2) run a local symbolization service on each host, (3) require
debuginfo packages on hosts. Option 1 is most common for continuous profiling;
options 2-3 are for ad-hoc perf work.
Key Numbers Reference
Frequently Asked Questions
Is continuous profiling safe in production?
Yes, at standard sample rates (99 Hz CPU, 512 KB allocation intervals). Major hyperscalers — Google (with gprofiler), Meta, Netflix, Uber — run continuous profiling on every production host 24/7. The risk is mainly bugs in the profiler itself (segfaults, deadlocks in signal handlers). Pin profilers to released versions, roll out gradually, and monitor for anomalous restarts. For Java, async-profiler is battle-tested. For Go, runtime/pprof is in the standard library and is extremely stable.
Why are flame graphs better than top?
top tells you which process is eating CPU. Flame graphs tell you which function inside which call path is eating CPU. When multiple functions in different call paths all contribute to CPU usage, top shows them as separate process percentages and you have to mentally reconstruct the call hierarchy. Flame graphs visualize the tree directly — the width of each bar encodes the time contribution, and the stack hierarchy is preserved. Hot leaves are immediately obvious; the caller relationship is unambiguous.
How do I diff two flame graphs?
Both the FlameGraph difffolded.pl tool and continuous profiling backends
(Pyroscope, Parca) support differential flame graphs. The process: collect a baseline
profile (before), collect a comparison profile (after), compute the per-stack delta.
Differential flame graphs color widened bars red (regression) and narrowed bars blue
(improvement). The --negate flag in FlameGraph flips this convention to
match the "red = bad" intuition. Pyroscope and Parca have first-class diff UIs with
baseline/comparison time range pickers.
Why does my flame graph look like a pyramid of [unknown]?
Symbolization failed. Your binary was compiled without frame pointers
(-fomit-frame-pointer, common with -O2 or -O3)
and without DWARF debug info. Three fixes: (1) compile debug builds with
-fno-omit-frame-pointer -g, (2) install matching debuginfo packages
(debuginfo-install on RHEL, apt-get install foo-dbgsym on
Debian), or (3) use --call-graph dwarf in perf, which uses DWARF stack
unwinding instead of frame pointers.
Can I link a profile to a specific request?
Yes, via pprof labels. Each sample in a pprof profile carries a list of label key-value
pairs (span_id, trace_id, route, pod, etc.). In Go, set them via
runtime/pprof.SetGoroutineLabels before entering a request context. In Java
with async-profiler/Pyroscope, use Profiler.setLabel(). In Python, the
Pyroscope SDK auto-sets labels for Django and Flask request metadata. Filter by
trace_id="abc-123" in the Pyroscope UI to see exactly which call path that specific
slow request took.
What's the difference between a heap profile and a heap dump?
A heap profile is a statistical sample of allocations (typically 1 in 512KB or 1 in 1MB). Low overhead, suitable for always-on production use. A heap dump captures every live object in the process — potentially hundreds of gigabytes for large Java heaps or C++ processes. Taking a heap dump causes a stop-the-world pause while the GC walks the entire heap. Use heap dumps for debugging specific OOM crashes or isolated memory leak investigations. Use heap profiles for ongoing allocation rate monitoring and GC tuning.
What's the difference between wall-clock and CPU profiling for finding latency issues?
A CPU profile only samples threads that are actively executing on a CPU. A thread that is waiting on a network call, a lock, or disk I/O produces zero CPU samples — even though it is consuming user-facing latency. For RPC-heavy services, a CPU profile will tell you "your code is fast" while the actual bottleneck is a slow downstream service. Wall-clock profiling captures both on-CPU time (serialization, computation) and off-CPU time (I/O, waiting). Start with wall-clock; use CPU as a second step when you know the problem is in your own compute.
How do I profile a short-lived process (Lambda, job, cron)?
Push-based collection is required — a pull scraper that runs every 10 seconds will
miss a Lambda that runs for 3 seconds. Options: (1) Profile built into the process
startup/shutdown path (Go: import net/http/pprof, Java: async-profiler
API), (2) Pyroscope's Lambda layer or sidecar, (3) for cron jobs, record a local pprof
profile on exit and upload it to the backend asynchronously. Some backends (Profefe)
have a /profile?seconds=N endpoint that starts a timer-based profile on the
next push.
Why does perf report show [kernel] but no user-space stacks?
This typically means the kernel is executing a lot of CPU (system time), not the user
process. This can be real — a kernel bottleneck — or a profiling setup problem. Check:
(1) Are you using -g for stack collection? (2) Is the binary compiled with
frame pointers or DWARF? (3) Are debuginfo packages installed? (4) Try perf
report --no-branch-stack to see if there's a data collection issue. If the
kernel stacks are real, use perf sched to analyze scheduler behavior,
bcc-tools/offcputime.py to find off-CPU kernel time, and
bcc-tools/biotop.py for block I/O.
When should I use valgrind instead of perf?
valgrind (specifically callgrind) is for deterministic, instruction-level accurate profiling of small, bounded workloads. perf uses statistical sampling — it's approximate but very fast. valgrind runs your program 10-100x slower in a synthetic CPU and counts every instruction. Use it when: (1) your workload is short (completes in seconds), (2) you need instruction-level accuracy (not just function-level), (3) you want to analyze cache behavior in detail, or (4) you need to understand algorithmic complexity (counting actual function call frequencies). Never use valgrind for production profiling — the slowdown will change the behavior of caches, locks, and I/O-bound code.