Performance Profiling

Finding where your code burns time and memory — from ad-hoc debugging to always-on fleet-wide observability

What Is Profiling?

Profiling is systematic measurement of a program's resource consumption — where does CPU time go? Where does memory get allocated? What does a cache miss cost? Unlike logging, which tells you what happened, profiling tells you where the bottleneck is. It is the foundation of data-driven performance optimization.

The classic profiling workflow: observe a performance problem (high P99 latency, excessive CPU, OOM crashes), collect a profile, find the hot code path, understand why it's slow, fix it, verify with a follow-up profile. The last step is critical — without a before/after profile, you don't know if the fix actually helped.

Profiling is fundamentally a sampling problem. Every N milliseconds, the profiler pauses the program and records the current stack trace. Do this 10,000 times and you have a statistically meaningful picture of where time or memory was consumed. The key insight is that you don't need to capture every single function call — a representative sample is enough, because the expensive code paths appear proportionally more often.

There are two modes: ad-hoc profiling (SSH in, run perf record, analyze the result) and continuous profiling (always-on agents on every host, shipping aggregated pprof profiles to a query backend). The former is better for deep dives; the latter is better for catching regressions before they reach production.

Types of Profiling

Different profile types answer different questions. Choosing the wrong type means you miss the actual bottleneck.

CPU Profiling

Who is burning CPU cycles? Samples are collected on-CPU only — threads that are waiting on I/O or sleeping don't appear. This is the most common profile type. The key metric is self CPU time — how much time a function spent executing its own code, excluding time spent in callees.

{`# CPU profile: what is executing right now?
perf record -F 99 -a -g -- sleep 30
# Every 10ms, freeze all CPUs, capture stack`}

Best for: compute-bound workloads, hot loops, algorithmic inefficiencies

Wall-Clock Profiling

Samples every running thread regardless of state — on-CPU, blocked on I/O, sleeping, waiting on a lock. Shows the full picture of where time goes including I/O wait and scheduler intervention. Produces wider, messier flamegraphs but reveals "why is this request slow" answers that CPU profiles miss.

{`# Wall-clock: on-CPU + off-CPU time
perf record -F 99 -a -g --sleep 30  # hardware events + stack
# or via async-profiler
profiler.sh -d 30 -e wall `}

Best for: I/O-bound workloads, off-CPU time analysis, RPC latency investigation

Memory / Heap Profiling

Tracks heap allocations — either live bytes (objects still in memory) or allocation rate (bytes allocated per second). Live-byte profiles reveal memory leaks and high-water marks. Allocation profiles reveal GC pressure: which call paths allocate the most, triggering more frequent garbage collection.

{`# Java allocation profile: every 512KB of heap allocation
profiler.sh -d 30 -e alloc 

# Go live heap (current in-use allocations)
curl http://localhost:6060/debug/pprof/heap`}

Best for: memory leaks, GC tuning, reducing allocation rate

Mutex / Lock Profiling

Tracks lock contention — how much time threads spend blocked waiting on a mutex or RWLock. The profile shows which call paths are waiting and which hold locks. High contention often points to overly coarse locking or lock-free algorithm choices.

{`# Java lock contention
profiler.sh -d 30 -e lock 

# Go mutex profile
curl http://localhost:6060/debug/pprof/mutex`}

Best for: concurrency bottlenecks, high-contention hot locks

Block / I/O Profiling

Tracks time threads spend blocked on I/O operations — disk, network, or inter-process communication. Useful for understanding when your program is waiting on external services or filesystem operations. Often paired with wall-clock profiling.

{`# Go block profile (goroutine blocking on sync primitives)
curl http://localhost:6060/debug/pprof/block

# Linux block I/O via perf
perf record -e block:block_rq_insert -a -g -- sleep 30`}

Best for: I/O-bound services, slow disk analysis, network wait debugging

Goroutine Profiling

Not a performance profile per se — a snapshot of all goroutines and their stack traces. Shows what every goroutine is doing right now. Great for diagnosing goroutine leaks, runaway goroutine counts, and deadlock-adjacent situations. Think of it as a process dump that you can diff over time.

{`# Dump all goroutine stacks (Go)
curl http://localhost:6060/debug/pprof/goroutine?debug=1

# Goroutine profile (aggregated counts)
curl http://localhost:6060/debug/pprof/goroutine`}

Best for: goroutine leaks, deadlock investigation, runaway concurrency

Wall-Clock vs CPU Time

The difference between wall-clock and CPU time is fundamental. A thread that spends 900ms of its 1000ms wall-clock time blocked on a network call shows up as 900ms of off-CPU wall-clock time but 0ms of CPU time. A CPU profile misses it entirely. Consider: an HTTP handler calling a downstream service has two distinct cost centers — the CPU to serialize the request and parse the response, and the wall-clock time the downstream call takes. A CPU profile only sees the first. A wall-clock profile sees both, letting you attribute the total latency correctly.

{`# Example: CPU vs wall-clock attribution for an HTTP handler
# CPU profile shows: json.Marshal = 5ms, db.Query = 3ms, response.Write = 2ms
# Wall-clock profile shows: json.Marshal = 5ms, db.Query = 3ms, downstream RPC = 990ms
# → The real problem is the downstream call, not your code`}

Linux Profiling Tools

Linux has a rich set of profiling tooling, from lightweight timer-based sampling (perf) to interpreter-level tracing (valgrind) to kernel-level stack collection (eBPF). Each serves a different niche.

perf

The standard Linux profiler. Uses hardware performance counters (PMCs) and the kernel's perf_event subsystem to sample CPU stacks. Low overhead, works for any language, requires frame pointers or DWARF debug info for useful stacks.

perf stat — count events (cycles, instructions, cache misses)
perf record — record samples for later analysis
perf report — text-based profile browser
perf annotate — per-instruction assembly with source interleaved
perf script — raw sample output for flamegraph tools

eBPF / stap

SystemTap (stap) and raw eBPF programs attach to kernel probes and can collect stack traces from kernel context. The bcc toolkit provides ready-made tools like profile.py that do continuous eBPF profiling with minimal overhead. Requires Linux 4.x+.

profile.py — bcc tool, continuous CPU profiler
offcputime.py — eBPF tool for off-CPU time
alloc_flow.py — allocation tracking by call path
biolockprobe — block I/O contention

gperftools (Google Performance Tools)

Google's heap and CPU profiler for C++. Works by patching malloc with an interceptor that records allocation call stacks. CPU profiler uses timer-based sampling with low overhead. Outputs pprof-compatible profiles. Good for native services where perf can't get DWARF stacks.

CPUPROFILE env var enables CPU profiling
HEAPPROFILE env var enables heap profiling
pprof --text — text profile output
pprof --gif — call graph visualization

valgrind / callgrind

Valgrind runs your program in a synthetic CPU (x86 emulation) and intercepts every memory operation. callgrind is its call-graph profiling tool — deterministic, instruction-level accuracy. Extremely slow (10-100x) but complete: no sampling bias, every call counted. Best for understanding algorithmic complexity and cache behavior in small workloads.

valgrind --tool=callgrind ./mybinary
callgrind_annotate — per-function instruction counts
kcachegrind — GUI call graph explorer

perf: The Linux Profiler

perf is the built-in Linux profiler, backed by hardware performance monitoring counters (PMCs) and the kernel perf_event API. It can sample any event the kernel knows about — CPU cycles, instructions retired, cache misses, branch mispredictions — and combine that with stack traces.

perf stat — Counting Events

Before profiling, start with perf stat to get an overview of what hardware events your workload exhibits. This is non-intrusive and uses PMC hardware counters directly.

{`# Run with full event set
sudo perf stat -e cycles,instructions,cache-references,cache-misses,\\
    branch-instructions,branch-misses ./my_program

# Output:
#   1,234,567,890  cycles               # ~1.2 GHz * 1s
#     567,890,123  instructions         # ~0.46 IPC
#      12,345,678  cache-references
#         123,456  cache-misses        # ~1% miss rate
#      98,765,432  branch-instructions
#           1,234  branch-misses       # excellent prediction rate`}

perf record — Collecting Samples

perf record programs the PMC to overflow after N events and fires a timer. The kernel pauses the program, captures the instruction pointer and stack, then resumes. Stack walking requires either frame pointers (compile with -fno-omit-frame-pointer) or DWARF debug info.

{`# Record 60 seconds of CPU profiles at 99 Hz on all CPUs
sudo perf record -F 99 -a -g -- sleep 60

# Record specific events instead of timer ticks
sudo perf record -e cycles:u -a -g -- ./my_program

# Record with call-graph dwarf (for stripped binaries)
sudo perf record -F 99 -a -g --call-graph dwarf -- sleep 30

# Check the resulting data
perf report           # interactive TUI browser
perf report --stdio  # text output
perf annotate         # per-instruction annotation`}

perf annotate — Reading Assembly

perf annotate takes a symbol from the profile and shows each assembly instruction with a percentage of samples. Combined with DWARF debug info it interleaves source lines. This is how you find the exact instruction inside a hot function.

{`perf annotate --stdio --symbol=hot_function
#      │
#  0.00 │  mov    %rax,%rdx
#  0.00 │  test   %rax,%rax
# 78.34 │  imul   %rax,%rax               ← hot multiply inside loop
# 12.51 │  add    $0x1,%eax
#  8.65 │  cmp    $0x64,%eax
#  0.50 │  jne    ..`}

perf + DWARF: When Binaries Are Stripped

If your binary was compiled without frame pointers (common with -O2) and lacks DWARF debug info, perf record will produce stacks that end in [unknown]. Three solutions:

Compile with -fno-omit-frame-pointer -g in debug builds
Install debuginfo packages on the host (debuginfo-install on RHEL/Fedora)
Use --call-graph dwarf which uses DWARF stack unwinding instead of frame pointers

{`# Check what your perf data has
perf report --guest-none --symbol-limit=5

# If you see [unknown], check for missing debug info
objdump -t ./mybinary | grep -i debug    # does it have .debug_info?
readelf -S ./mybinary | grep debug       # list debug sections

# Fix: install debuginfo on RHEL/Fedora
sudo debuginfo-install mypackage

# Or rebuild with frame pointers
CFLAGS="-fno-omit-frame-pointer -g" ./configure && make`}

Flame Graphs

A flame graph is a stacked bar chart where each bar represents a stack frame, the width represents the proportion of samples for that frame, and the stacking shows the full call path. The visual encoding is deliberate: the hottest code paths produce the widest bars at the top, immediately drawing the eye. Introduced by Brendan Gregg in 2016, they have become the de facto standard for CPU and memory flame visualization.

How to Generate a Flame Graph

The canonical tool is Brendan Gregg's FlameGraph suite. The workflow: collect raw stack samples, fold them into a single-line-per-stack format, render to SVG.

{`# 1. Collect stacks with perf
sudo perf record -F 99 -a -g -- sleep 60
perf script --header > /tmp/out.stacks

# 2. Clone the FlameGraph toolkit
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# 3. Fold stacks: many-sample-lines → one-line-per-unique-stack
./stackcollapse-perf.pl /tmp/out.stacks > /tmp/out.folded

# out.folded looks like:
# main;handle_request;db.Query;index_scan 12345
# main;handle_request;json.Marshal 7890
# main;gc_run;gcDrain 4567

# 4. Render to interactive SVG
./flamegraph.pl /tmp/out.folded > /tmp/flame.svg
# Open in a browser — click to zoom into any frame`}

Reading a Flame Graph

The horizontal axis is time (or count), not time. Each column is one unique stack. The width of each bar is proportional to how many samples had that stack. The vertical axis is stack depth — top is the leaf (the function currently executing when the sample hit).

Finding the bottleneck: Look for tall bars at the top — wide at the top means a specific leaf function consumed most samples. The hottest frame is always at the top of a stack, never buried in the middle. Wide mid-level bars indicate a function is called from many places (not necessarily hot itself).

Colors have meaning: In Brendan Gregg's default color scheme, colors are random per frame (or based on palette). In differential flame graphs, red means "got worse" and blue means "got better." Some color schemes encode the profile type (CPU = red/orange, alloc = blue, I/O = green).

Differential Flame Graphs

The most powerful flamegraph technique for performance work: compare a baseline profile to a post-fix profile. The output shows the delta — widened bars in red (regression), widened bars in blue (improvement). This eliminates the psychological trap of seeing a small hot spot and missing that a bigger one appeared.

{`# After optimizing, record a new profile
sudo perf record -F 99 -a -g -- sleep 60 -o /tmp/after.perf

# Fold both
./stackcollapse-perf.pl /tmp/before.perf > /tmp/before.folded
./stackcollapse-perf.pl /tmp/after.perf  > /tmp/after.folded

# Generate differential SVG
./difffolded.pl /tmp/before.folded /tmp/after.folded \\
    | ./flamegraph.pl --negate > /tmp/diff.svg

# --negate flips colors: now red = improved (was more, now less)`}

Continuous Profiling: Architecture

Ad-hoc profiling means SSH'ing to a sick box, running perf record for sixty seconds, scp'ing the trace home, and squinting at a flamegraph. By the time you have data, the incident has moved on. Continuous profiling instead samples every host all the time at low overhead (typically <1% CPU) and ships the aggregated stacks to a query backend. You can ask "show me the top CPU consumer across the fleet over the last hour, broken down by service" without ever logging into a host.

The data model is uniform: a flamegraph is a tree of stack frames with a count (CPU samples, allocated bytes, blocked time). The wire format is Google's pprof protobuf. The collection mechanism is per-language: async-profiler for Java, perf+pprof for Linux native, runtime/pprof for Go, and increasingly eBPF for everything else.

Continuous Profiling Systems

The continuous profiling ecosystem has converged on a few major systems. All speak pprof and expose flamegraph UIs; they differ in agent strategy, storage engine, and operational model.

System	Agent Strategy	Storage	Notes
Grafana Pyroscope	Per-language SDKs + eBPF	Parquet on object store	formerly Pyroscope + Phlare merged; part of Grafana LGTM stack
Parca	eBPF-only (Parca Agent)	FrostDB (columnar)	Zero instrumentation required; kernel 5.x required
Polar Signals Cloud	Same eBPF agent as Parca	Hosted FrostDB	Founded by original Parca team; fully managed SaaS
Datadog Profiler	Per-language SDK in Datadog Agent	Datadog SaaS	Strong APM integration (links profiles to traces); per-host pricing
Profefe	Pull-based scraping of /debug/pprof	BadgerDB / S3	Open source; self-hosted; K Native
Elastic APM Profiling	eBPF + per-lang SDK	Elasticsearch	Part of Elastic APM; integrated in Kibana

The pprof Wire Format

pprof is a Protocol Buffer format that encodes a sample-typed profile. Every Go binary exposes it, async-profiler emits it, and all continuous profiling backends ingest it. Understanding the schema helps when debugging symbolization issues and writing custom exporters.

{`message Profile {
  repeated ValueType sample_type = 1;     // e.g. [["samples","count"],["cpu","nanoseconds"]]
  repeated Sample    sample      = 2;     // the actual samples
  repeated Mapping   mapping     = 3;     // /lib/libc.so loaded at 0x7f...
  repeated Location  location    = 4;     // address → function + line number
  repeated Function  function    = 5;     // name, system_name, filename
  repeated string    string_table= 6;     // deduplicated strings
  int64              period      = 12;    // sample interval in nanoseconds
  ValueType          period_type = 11;
  TimeSeries         time_series = 13;
}

message Sample {
  repeated uint64 location_id = 1;        // stack trace (leaf-first order)
  repeated int64  value       = 2;        // values for each sample_type
  repeated Label  label      = 3;        // span_id, trace_id, pod, service_name
}

// Sample types for a CPU profile:
# cpu:nanoseconds: samples count=9999, cpu_ns=123456789
# cpu:nanoseconds: samples count=9999, cpu_ns=123456789
// Sample types for a heap profile:
# alloc_space:bytes:alloc_space:bytes  (cumulative allocated bytes)
# alloc_space:bytes:alloc_space:bytes  (since program start)`}

Java and .NET Profilers

Managed runtimes present unique profiling challenges: JIT-compiled code hides call frames, garbage collection introduces safepoint bias, and large heap sizes make heap profiling expensive. Specialized profilers handle these.

async-profiler (Java)

The JVM has JFR (Java Flight Recorder) but it has safepoint bias — samples land only at JVM safepoints where all threads are paused, systematically missing code that runs between safepoints (tight loops, certain GC phases). async-profiler uses AsyncGetCallTrace, an undocumented HotSpot API that walks the stack from a signal handler without pausing the JVM.

{`# Attach to a running JVM, sample CPU for 30 seconds
./profiler.sh -d 30 -e cpu -f /tmp/cpu.html 

# Allocation profile: sample every 512KB of heap allocation
./profiler.sh -d 30 -e alloc -f /tmp/alloc.html 

# Lock contention: who is blocking on monitors?
./profiler.sh -d 30 -e lock -f /tmp/lock.html 

# Wall-clock: where does time actually go (on + off CPU)?
./profiler.sh -d 30 -e wall -f /tmp/wall.html 

# Output as pprof for upload to Pyroscope/Parca
./profiler.sh -d 30 -e cpu -o pprof -f /tmp/cpu.pb.gz 

# Frame count: how deep are the call stacks?
./profiler.sh -d 30 -e cpu -f /tmp/cpu.html --frame-count=64 `}

JDK Flight Recorder (JFR)

JFR is the JVM's built-in profiler. Unlike async-profiler, it runs inside the JVM and has full access to internal events (GC, class loading,-compilation). It is safepoint-biased but produces richer diagnostic data than CPU samples alone. For continuous profiling in production, async-profiler + Pyroscope is the better choice. For JVM diagnostics (GC tuning, class loading storms), JFR is irreplaceable.

{`# Start a JVM with JFR enabled (continuous recording)
java -XX:StartFlightRecording=filename=recording.jfr,dumponexit=true \\
     -XX:FlightRecorderOptions=maxsize=256M,stackdepth=256 \\
     -jar myapp.jar

# Dump JFR to file via jcmd
jcmd  JFR.dump recording=/tmp/recording.jfr

# Analyze with JDK Mission Control (JMC)
jmc -JFRRecording=recording.jfr

# Or via CLI with jfr command (JDK 14+)
jfr summary recording.jfr
jfr metadata recording.jfr`}

dotnet-trace / dotnet-counters (.NET)

.NET Core ships with built-in profiling tools accessed via the dotnet-trace, dotnet-counters, and dotnet-dump CLI utilities.

{`# Collect a CPU trace for 30 seconds
dotnet-trace collect --duration 30 --providers Microsoft-DotNETCore-EventPipe \\
    --profile CPU --process-id 

# List available counters (live monitoring)
dotnet-counters monitor --process-id  \\
    System.Runtime CPUUsage GC-Gen0Size ... 

# Collect a heap dump for dotnet-dump analysis
dotnet-dump collect -p 

# Analyze in dotnet-dump interactive CLI
dotnet-dump analyze /tmp/core_dump
# > dumpheap -stat          ← show live object sizes
# > gcroot -all             ← find roots preventing GC
# > clrthreads              ← list managed threads and stacks`}

Java Profiling Labels: Linking Profiles to Requests

A key feature of continuous profiling is correlating a profile to a specific request or trace. async-profiler supports labels — key-value pairs attached to samples — and Pyroscope uses them to filter profiles by trace_id, endpoint, or any custom label.

{`// Java: set a profiling label on the current thread
import com.newrelic.api.agent.Trace;
import com.newrelic.agent.bridge.AgentBridge;

@Trace(dispatcher = true)
public Response handleRequest(Request req) {
    // This label appears on every sample in this thread's stack
    AgentBridge.getAgent().getProfiler()
        .setLabel("trace_id", req.getTraceId());

    return process(req);
}

// Now in Pyroscope: filter by trace_id="abc-123"
// to see the flamegraph for exactly that slow request`}

Database Profiling: PostgreSQL and ClickHouse

Database performance problems are often invisible from the application side. A query that takes 500ms doesn't show up as a hot CPU flamegraph — it shows up as wall-clock time blocked on the network. Database-level profiling gives you the query plan and execution statistics.

PostgreSQL: pg_stat_statements and auto_explain

pg_stat_statements tracks query-level statistics across all queries, enabling identification of the top consumers by calls, total time, and I/O. auto_explain logs query plans for slow queries automatically.

{`-- Enable extensions (requires restart or superuser)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS auto_explain;

-- Reload with auto_explain enabled (log plans for queries > 100ms)
ALTER SYSTEM SET auto_explain.log_min_duration = 100;
SELECT pg_reload_conf();

-- View the top queries by total execution time
SELECT query,
       calls,
       round(total_exec_time::numeric, 2) AS total_ms,
       round(mean_exec_time::numeric, 2)  AS mean_ms,
       round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
       shared_blks_hit,
       shared_blks_read,
       temp_blks_read,
       temp_blks_write
FROM   pg_stat_statements
ORDER  BY total_exec_time DESC
LIMIT  20;

-- Find queries that do a lot of temp disk spills
SELECT query, temp_blks_read, temp_blks_written
FROM   pg_stat_statements
WHERE  temp_blks_read + temp_blks_written > 0
ORDER  BY temp_blks_read + temp_blks_written DESC
LIMIT  10;`}

PostgreSQL: EXPLAIN ANALYZE

EXPLAIN ANALYZE executes the query and returns the actual execution plan with per-node timing and row counts. Always use ANALYZE (not just EXPLAIN) — the unexecuted EXPLAIN can be wildly wrong for queries with conditional logic, volatile functions, or parameterized values.

{`EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, count(o.id) AS order_count, sum(o.total) AS revenue
FROM   users u
JOIN   orders o ON o.user_id = u.id
WHERE  u.created_at > '2024-01-01'
GROUP  BY u.id, u.name
ORDER  BY revenue DESC
LIMIT  20;

-- Key things to look for:
-- 1. "actual rows" vs "rows" — large discrepancy = stale statistics → ANALYZE
-- 2. "Seq Scan" on large tables — missing index
-- 3. "Sort" with high memory ("Sort Method: quicksort" + "Sort Space Used: 12345kB")
-- 4. "Hash Join" with large in-memory hash table → may spill to disk
-- 5. "Nested Loop" with high actual rows → potentially O(n²) disaster`}

PostgreSQL: pg_stat_activity — Live Query Inspection

pg_stat_activity shows every active query and what it's doing right now. Use it to find queries that are blocking, long-running, or waiting on locks.

{`-- Who is running what, right now?
SELECT pid, usename, application_name, state,
       query_start, state_change,
       round(extract(epoch FROM now() - query_start), 1) AS duration_s,
       wait_event_type, wait_event,
       left(query, 200) AS query_preview
FROM   pg_stat_activity
WHERE  state != 'idle'
  AND  pid != pg_backend_pid()
ORDER  BY query_start;

-- Find queries waiting on locks
SELECT pid, relation::regclass, mode, granting, fastpath
FROM   pg_locks
WHERE  NOT granted;

-- Find the blocking pid's query
SELECT pid, left(query, 200)
FROM   pg_stat_activity
WHERE  pid IN (SELECT pid FROM pg_locks WHERE NOT granted);`}

ClickHouse: system.query_log and query_log

ClickHouse logs every query to system.query_log with per-query metrics including read rows, read bytes, memory usage, and query duration. It's a time-series table you can query with SQL. The query_thread_log additionally logs per-thread statistics for multi-threaded queries.

{`-- Top queries by peak memory (last 1 hour)
SELECT query, query_kind, threads,
       round(memory_usage/1024/1024, 2) AS mem_mb,
       round(read_rows/1024/1024, 2)    AS rows_m,
       round(read_bytes/1024/1024/1024, 2) AS gb_read,
       formatDATETIME(event_time, '%Y-%m-%d %H:%i:%s') AS ts,
       round(query_duration_ms/1000, 2) AS dur_s
FROM   system.query_log
WHERE  type IN ('QueryFinish', 'ExceptionWhileProcessing')
  AND  event_time > now() - INTERVAL 1 HOUR
  AND  memory_usage > 0
ORDER  BY memory_usage DESC
LIMIT  20;

-- Query profile: per-thread breakdown for a specific query
SELECT thread_id, read_rows, read_bytes,
       main_loop_calls, main_loop_time_ms
FROM   system.query_thread_log
WHERE  event_date = today()
  AND  query LIKE '%my_slow_query_pattern%'
ORDER  BY read_bytes DESC;

-- ClickHouse query-level profiler (100ms+ queries)
SELECT query, ProfileEvents,
       round(query_duration_ms/1000, 2) AS dur_s,
       read_rows, result_rows
FROM   system.query_log
WHERE  type = 'QueryFinish'
  AND  query_duration_ms > 100000   -- > 100 seconds
ORDER  BY query_duration_ms DESC
LIMIT  10;`}

Profiling in Kubernetes

Kubernetes environments add a layer of indirection: the process you want to profile is inside a container, inside a pod, on a node you may not have direct access to. There are three main approaches.

Approach 1: kubectl exec into the Pod

The most straightforward approach: exec into the container and run profiling tools directly. Works when the container image includes the profiler binaries or you can install them.

{`# Find the pod
kubectl get pods -n production | grep my-service
# NAME                     READY   STATUS    RESTARTS   AGE
# my-service-5f4b8c9-x2k7  2/2     Running   0          5d

# Exec in and profile
kubectl exec -it -n production my-service-5f4b8c9-x2k7 -- sh
# Inside container:
apk add perf              # if Alpine-based
apt-get install perf      # if Debian-based

# Record CPU for 30 seconds
perf record -F 99 -a -g -- sleep 30
# Exit and copy out the perf.data
kubectl cp production/my-service-5f4b8c9-x2k7:/root/perf.data ./perf.data`}

Approach 2: Profile Sidecar

Deploy a profiling sidecar alongside your application container. The sidecar runs the profiler and ships the resulting data to a storage backend. This approach doesn't require modifying your application container and works for any language.

{`# Example: Pyroscope sidecar in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        env:
        - name: PYROSCOPE_SERVER_ADDRESS
          value: "http://pyroscope:4040"
        - name: PYROSCOPE_APPLICATION_NAME
          value: "my-service"
        - name: PYROSCOPE_PROFILER_ALLOC
          value: "true"
        - name: PYROSCOPE_PROFILER_CPU
          value: "true"
      - name: pyroscope-agent          # sidecar: eBPF profiler
        image: pyroscope/pyroscope:latest
        args:
        - /bin/pyroscope
        - exec
        - python3,/myapp/main.py       # profile the main process
        env:
        - name: PYROSCOPE_SERVER_ADDRESS
          value: "http://pyroscope:4040"`}

Approach 3: Host PID Namespace Access

For node-level profiling (e.g., profiling a process that runs as a Kubernetes system service), use a privileged DaemonSet with access to the host's PID namespace.

{`# Privileged profiling DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: perf-agent
spec:
  spec:
    hostPID: true
    securityContext:
      privileged: true
    containers:
    - name: perf
      image: alpine:latest
      command: ["sh", "-c", "while true; do sleep 3600; done"]
      securityContext:
        CAP_SYS_ADMIN: true
      volumeMounts:
      - name: src
        mountPath: /debug
    volumes:
    - name: src
      hostPath:
        path: /sys/kernel/debug`}

Allocation Profiling

Allocation profiling tells you which call paths allocate the most memory. This is distinct from a heap dump — it measures allocation rate, not current live set. The key question it answers is "where is GC pressure coming from?" Reducing allocation rate is the most effective way to lower GC overhead in managed languages.

tcmalloc Heap Checker (C++, Go)

gperftools' tcmalloc includes a heap checker that records live allocations at a point in time, and a heap profiler that tracks allocation rates. Go's runtime uses a tcmalloc-inspired allocator, and Go's pprof heap endpoint exposes similar data.

{`# Go: heap profile (current live allocations)
curl -s http://localhost:6060/debug/pprof/heap > heap.pb

# Go: heap profile with 30-second peak tracking
# This records the peak memory used during the interval
curl -s http://localhost:6060/debug/pprof/heap?seconds=30 > heap_peak.pb

# pprof text output
go tool pprof -text heap.pb
# Showing nodes accounting for 100% of 45.28MB of allocation space
# File: heap.go
#     flat  flat%   sum%   cum   cum%
#   23.50MB 51.9%  51.9% 23.50MB 51.9%  main.makeLargeAlloc
#   12.30MB 27.2%  79.1% 12.30MB 27.2%  main.processOrders
#    5.20MB 11.5%  90.6%  8.10MB 17.9%  main.serializeResponse
#    ...

# Go: allocation profile (rate-based)
# Add runtime/pprof.StartCPUProfile() to track new allocations
curl -s http://localhost:6060/debug/pprof/heap?type=alloc_space > alloc.pb`}

jemalloc Stats

jemalloc (used by Firefox, Dropbox, and many high-throughput services) exposes detailed allocation statistics via its stats API. The mallctl interface lets you query cumulative allocation counters broken down by size class and call site.

{`# jemalloc: dump stats to a file
curl -s http://localhost:6060/debug/malloc?stats=1 > jemalloc_stats.txt

# jemalloc_stats.txt excerpt:
# 
# Background threads: 4
# Config:  opt_prof, opt_stats_print
# Allocated:   2,345,678,912 bytes
# Active:      2,567,890,123 bytes
# Metadata:       45,678,901 bytes
# Resident:    3,456,789,012 bytes
# 
# Allocation counters:
#  small:       2,123,456,789 ops
#  large:             456,789 ops
#  huge:                1,234 ops
# 
# By size class:
#  256B:       234,567 allocations,  60,123,456 bytes
#  512B:       123,456 allocations,  63,234,567 bytes
#  4096B:        34,567 allocations, 141,234,567 bytes
#  large:       456,789 allocations, 987,654,321 bytes
#  huge:         1,234 allocations, 234,567,890 bytes

# Enable heap profiling with jemalloc
MALLOC_CONF="prof:true,prof_active:true" ./my_service
# Dump via jeprof (from gperftools)
jeprof --text ./my_service jemalloc_heap.1.heap`}

Reading Allocation Flame Graphs

Allocation flame graphs look like CPU flame graphs but encode a different quantity. A bar representing 10MB of allocation at the json.Marshal leaf means "10MB was allocated by calls to json.Marshal across all sampled stacks." This is not memory in use — it's the cumulative amount allocated over the profile window. If you see 1GB allocated in 30 seconds and your heap is 512MB, the GC had to run at least twice to keep up.

{`# async-profiler allocation output (HTML flamegraph)
./profiler.sh -d 30 -e alloc -f /tmp/alloc.html 

# The flamegraph shows:
# - Width = cumulative bytes allocated (not current heap usage)
# - Colors are typically blue-to-purple gradient
# - Look for: many wide bars at a leaf function = high allocation rate
# - Target: find the call sites that allocate the most
#   and check if those allocations can be cached/reused`}

Latency Profiling: Histograms and Percentiles

A latency profile isn't a stack trace — it's a histogram of request durations. The goal is to understand the full distribution, not just the average. Optimizing for average latency while ignoring P99 is a common mistake.

p50 / p95 / p99 / p999 — What Each Percentile Tells You

Percentiles are not equally useful. p50 (median) tells you about typical load; if it spikes when p99 spikes, you have a load-dependent problem. p95 captures the "long tail" of user experience — typically the SLO threshold. p99 catches rare but real problems. p999 (0.1%) is where you find the rarest worst-case events, often correlated with GC pauses, OS scheduler preemption, or external dependency timeouts.

{`# Go: latency histogram with HistogramVec from Prometheus client library
histogram = prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
})

# Python: use prometheus_client
from prometheus_client import Histogram
h = Histogram('request_latency_seconds', 'HTTP request latency',
             buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5])

# What the percentiles mean:
# p50  = 10ms   → median user sees this latency
# p95  = 250ms  → 19 in 20 users see this or less
# p99  = 800ms  → 1 in 100 users wait longer (often the SLO threshold)
# p999 = 2.5s   → 1 in 1000 users sees a rare pathological case
#            → check this for GC pauses, major page faults`}

Latency Flame Graphs

Standard flame graphs show CPU time distribution. Latency-attributed flame graphs are different: each sample is a stack trace annotated with the latency of the request that produced it. A request that took 5 seconds contributes a 5-second weight to every stack frame in its trace. This makes the slowest requests visually dominate the flame graph, immediately revealing which call path is responsible for the tail.

{`# Pyroscope: latency-weighted flamegraph
# Set the profile type to "wall" and filter by high-latency requests
# Pyroscope query:
/my-service:wall{profile_type="wall",quantile="p99"}
# → Shows the flamegraph weighted by p99-latency requests

# This is equivalent to:
# For each request with latency > p99 threshold:
#   Record a wall-clock sample for each stack frame
#   Weight the sample by (request_latency / sample_interval)
# Stack frames from slow requests become wider → immediately visible

# Differential latency flamegraph: which path regressed?
/my-service:wall{compare="1h"}
# → Compare current hour vs previous hour, widths are deltas`}

Building a Latency Histogram from Scratch

When you don't have a metrics library, you can build a simple histogram using a logarithmic bucketing strategy. The key design choice is the bucketing scheme: linear buckets (0-1ms, 1-2ms, ...) work for fine-grained latency; logarithmic buckets (0-1ms, 1-10ms, 10-100ms, 100ms-1s) work for capturing the full tail.

{`// Simple latency histogram with logarithmical buckets
class LatencyHistogram {
    // Buckets: <1ms, 1-5ms, 5-10ms, 10-50ms, 50-100ms, 100ms-500ms, 500ms-1s, 1s+
    private long[] buckets = new long[8];
    private long sum = 0;
    private long count = 0;
    private static final long[] THRESHOLDS =
        {1_000_000, 5_000_000, 10_000_000, 50_000_000, 100_000_000, 500_000_000, 1_000_000_000};

    public void record(long durationNanos) {
        for (int i = 0; i < THRESHOLDS.length; i++) {
            if (durationNanos < THRESHOLDS[i]) {
                buckets[i]++;
                break;
            }
        }
        sum += durationNanos;
        count++;
    }

    public Map percentiles() {
        long[] sorted = sortSamples(); // you'd need to store raw samples or use t-digest
        return Map.of(
            "p50", percentile(sorted, 0.50),
            "p95", percentile(sorted, 0.95),
            "p99", percentile(sorted, 0.99),
            "p999", percentile(sorted, 0.999)
        );
    }
}`}

Practical Example: Finding a CPU Bottleneck

Here's a step-by-step walkthrough of finding and fixing a real CPU bottleneck. We start with a slow endpoint and end with a verified improvement via before/after profiles.

Step 1: Observe the Problem

Your SLO is p99 latency < 500ms, but the /api/reports endpoint is hitting 1.2s at p99. You suspect it's a database query but you're not sure. You pull up your continuous profiling dashboard and filter to this endpoint.

{`# Query your profiling backend for this endpoint
# In Grafana Phlare / Pyroscope:
# Query: my-service {endpoint="/api/reports", quantile="p99"}
# The flamegraph shows a dominant bar: sql.query("SELECT ...") = 68% of wall-clock time`}

Step 2: Get an Ad-Hoc CPU Profile

Since the continuous profile shows a hot path, confirm it with an ad-hoc profile on the live system to get higher resolution data and verify the signal.

{`# SSH to a production host running the service
ssh prod-host-03

# Find the process
ps aux | grep my-service
# USER  PID  %CPU  COMMAND
# app  12345  85%  ./my-service --config=/etc/myapp/prod.yaml

# Record CPU profile for 60 seconds during traffic
sudo perf record -F 99 -a -g -p 12345 -- sleep 60
# Ctrl+C when done → creates perf.data

# Copy it off the host
perf script --header > /tmp/perf.stacks
# Now generate the flamegraph
./FlameGraph/stackcollapse-perf.pl /tmp/perf.stacks > /tmp/perf.folded
./FlameGraph/flamegraph.pl /tmp/perf.folded > /tmp/flame.svg
# scp to your laptop and open in browser`}

Step 3: Read the Flame Graph

Open the flame graph. The widest bar at the top is your hot leaf. Work your way down the stack — each frame is a caller. In this example, the flamegraph shows:

{`# Folded output (simplified)
main.handleRequest;reportController.Generate;sql.Query;indexScan  4523
main.handleRequest;reportController.Generate;sql.Query;hashAgg   1234
main.handleRequest;reportController.Generate;serializeJSON      890
main.handleRequest;reportController.Generate;db.Exec             234

# → indexScan is the bottleneck: 4523 samples / 7200 total = 62.8% of CPU time`}

The flamegraph visually shows a very wide bar for indexScan. This is a sequential scan on the reports table — there's no index on created_at, so Postgres is reading every row.

Step 4: Verify with EXPLAIN ANALYZE

Pull the exact query from the flamegraph and run EXPLAIN ANALYZE on it to see the execution plan.

{`EXPLAIN (ANALYZE, BUFFERS)
SELECT id, user_id, created_at, data
FROM   reports
WHERE  user_id = 42
ORDER  BY created_at DESC
LIMIT  100;

-- Output:
-- Sort  (cost=12345.67..12348.90 rows=100 width=45)
--        (actual time=1234.5..1235.1 rows=99 loops=1)
--   Sort Key: created_at DESC
--   Sort Method: quicksort  Sort Space Used: 12345kB
--   ->  Seq Scan on reports  (cost=0..12345.67 rows=99 width=45)
--         (actual time=0.1..1233.0 rows=99 loops=1)
--         Filter: (user_id = 42)
--         Rows Removed by Filter: 0
--         Buffers: shared hit=12 read=12345678  ← reading 12GB!`}

The Seq Scan with Buffers: read=12345678 confirms it: 12GB of data read from disk to return 99 rows. A simple index on (user_id, created_at) would eliminate this entirely.

Step 5: Fix and Verify

{`-- Add the missing index
CREATE INDEX CONCURRENTLY idx_reports_user_created
ON reports (user_id, created_at DESC);

-- Verify with EXPLAIN ANALYZE (should now show Index Scan, not Seq Scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, user_id, created_at, data
FROM   reports
WHERE  user_id = 42
ORDER  BY created_at DESC
LIMIT  100;
-- Planning now shows: Index Scan using idx_reports_user_created ...
-- Buffers: shared hit=103 read=0
-- actual time=0.05..0.45 rows=99 loops=1
-- → 12GB read → 103KB, latency 1234ms → 0.45ms`}

Step 6: Profile the Fixed Code

After deploying the fix, collect another profile to verify the improvement and catch any new bottlenecks that were previously hidden behind this dominant one (this is a common pattern — fixing the #1 bottleneck reveals #2).

{`# Collect post-fix profile
sudo perf record -F 99 -a -g -p 12345 -- sleep 60

# Fold and diff
./stackcollapse-perf.pl /tmp/before.stacks > /tmp/before.folded
./stackcollapse-perf.pl /tmp/after.stacks  > /tmp/after.folded

# Differential flamegraph
./difffolded.pl /tmp/before.folded /tmp/after.folded \\
    | ./flamegraph.pl --negate > /tmp/diff.svg

# The diff shows:
# indexScan:   -62.8% (fixed!)
# hashAgg:      +22.1% (previously hidden, now visible)
# serializeJSON: +15.3% (previously hidden)
# → Secondary hot spots are now visible — proceed to Step 3 for hashAgg`}

Overhead Considerations

Profiling has overhead. Understanding it helps you choose safe sample rates and avoid introducing latency regressions through the act of measuring.

The Sampling Overhead Math

Profiler overhead = (sample rate × stack walk cost) + bookkeeping. Both are tunable.

{`overhead_per_sample = stack_walk_cost + bookkeeping_overhead
overhead_total      = sample_rate_hz * overhead_per_sample / 1_000_000_000   # ns → fraction

# Frame pointer walk: ~500 ns per stack
99 Hz × 500 ns  = 49,500 ns/sec = 0.005% per CPU per host
# At 4 CPUs: 0.02% overhead

# DWARF unwinding: ~5 μs per stack
99 Hz × 5,000 ns = 495,000 ns/sec = 0.05% per CPU per host
# At 4 CPUs: 0.2% overhead

# eBPF stack walk + map insert: ~2 μs
99 Hz × 2,000 ns = 198,000 ns/sec = 0.02% per CPU per host
# At 4 CPUs: 0.08% overhead

# At 999 Hz on a latency-sensitive workload (1ms target):
# Each sample adds ~2-5 μs overhead → you start to see tail latency spikes`}

The 99 Hz Convention

The convention of sampling at 99 Hz (not 100 Hz) exists because many Linux kernels use a timer interrupt at HZ=100. If your profiler samples at exactly 100Hz and the kernel timer fires at 100Hz, you systematically miss stacks that execute during the timer interrupt handler itself — a form of profiling blind spot. 99Hz avoids the beat frequency.

{`# Check your kernel HZ
getconf CLK_TCK
# Usually 100 on server kernels, 250 or 1000 on desktop/preempt kernels

# The math: 99 Hz × 10ms = 990ms covered per second
# You miss 10ms due to sampling gap = 1% statistical uncertainty
# This is the standard trade-off and is fine for most workloads`}

Safe Rates by Profile Type

Profile Type	Safe Rate	Overhead Estimate	Avoid When
CPU (frame pointer)	99 Hz	<0.01%	Very latency-sensitive (<1ms target)
CPU (DWARF)	99 Hz	<0.05%	Production latency-critical paths
Wall-clock	99 Hz	<0.02%	Same as CPU for off-CPU analysis
Heap allocation	512KB-1MB interval	1-5%	High-throughput (>100k req/s) services
Java async-profiler	99 Hz CPU / 512KB alloc	<1%	Ultra-low-latency trading systems
eBPF continuous	99 Hz	<0.02%	Kernel 4.x without BTF

Tradeoffs

Sample Rate vs Overhead

Doubling the sample rate (99 Hz → 198 Hz) doubles the overhead but barely improves the flame graph for steady-state workloads. 99 Hz produces ~7,200 samples per minute per CPU, which is statistically sufficient to resolve 0.1% CPU regressions. Higher rates matter when profiling short-lived processes or burst workloads.

Wall-Clock vs CPU Profile

CPU profile only samples on-CPU threads — misses time spent blocked on I/O, waiting on locks, or in scheduler preemption. Wall-clock samples all threads and reveals "why is this RPC slow" answers that CPU profiles completely miss. For web services with downstream dependencies, wall-clock is usually the right starting point.

SDK vs eBPF Agent

Per-language SDKs (Pyroscope Java agent, Go's runtime/pprof) get language-aware stacks: interpreted Python frames, JIT-inlined Java methods, Go's goroutine stacks. eBPF works for any binary but struggles with managed runtimes that hide their stacks inside the interpreter or JIT. For Go and Java services in Kubernetes, start with SDK profiling. For mixed binary + container environments, eBPF is more cost-effective.

Push vs Pull Collection

Push (Pyroscope SDK, gperftools) sends profiles on a schedule from inside the process. It works for short-lived processes and serverless functions that exit before a pull scrape would fire. Pull (Phlare scraping /debug/pprof) integrates with existing Prometheus-style service discovery and is simpler operationally. Short-lived services (<10s lifetime) almost always need a push-based approach.

Heap Profile vs Heap Dump

A heap profile is statistical (sampled allocations, low overhead). A heap dump is complete (every live object, potentially hundreds of gigabytes). For routine optimization work, heap profiles are usually sufficient. Heap dumps are for debugging specific OOM situations or investigating leaks that sampling can't isolate. Never take a heap dump in production without understanding the pause time.

Symbol Storage vs On-Demand Symbolization

eBPF profilers like Parca need symbols for user-space stacks. Without symbols, stacks show as [unknown]. Options: (1) upload symbol tables to the backend per binary version, (2) run a local symbolization service on each host, (3) require debuginfo packages on hosts. Option 1 is most common for continuous profiling; options 2-3 are for ad-hoc perf work.

Key Numbers Reference

99 Hz

standard sampling rate (avoids HZ=100 timer beat)

<1%

typical CPU overhead at 99 Hz

~20 KB

size of a 10-second pprof profile (pre-compress)

10s

default push/scrap interval for Pyroscope/Parca

256

stack depth limit for most eBPF stack collectors

DWARF

debug info needed for frame-pointer-less binaries

4 weeks

typical retention for fleet-wide profile data

512 KB

default allocation sampling interval for async-profiler

~500 ns

perf frame-pointer stack walk cost per sample

~5 μs

perf DWARF stack walk cost per sample

~2 μs

eBPF stack walk + map insert cost per sample

10-100x

slowdown from valgrind/callgrind (instruction-level emulation)

Frequently Asked Questions

Is continuous profiling safe in production?

Yes, at standard sample rates (99 Hz CPU, 512 KB allocation intervals). Major hyperscalers — Google (with gprofiler), Meta, Netflix, Uber — run continuous profiling on every production host 24/7. The risk is mainly bugs in the profiler itself (segfaults, deadlocks in signal handlers). Pin profilers to released versions, roll out gradually, and monitor for anomalous restarts. For Java, async-profiler is battle-tested. For Go, runtime/pprof is in the standard library and is extremely stable.

Why are flame graphs better than top?

top tells you which process is eating CPU. Flame graphs tell you which function inside which call path is eating CPU. When multiple functions in different call paths all contribute to CPU usage, top shows them as separate process percentages and you have to mentally reconstruct the call hierarchy. Flame graphs visualize the tree directly — the width of each bar encodes the time contribution, and the stack hierarchy is preserved. Hot leaves are immediately obvious; the caller relationship is unambiguous.

How do I diff two flame graphs?

Both the FlameGraph difffolded.pl tool and continuous profiling backends (Pyroscope, Parca) support differential flame graphs. The process: collect a baseline profile (before), collect a comparison profile (after), compute the per-stack delta. Differential flame graphs color widened bars red (regression) and narrowed bars blue (improvement). The --negate flag in FlameGraph flips this convention to match the "red = bad" intuition. Pyroscope and Parca have first-class diff UIs with baseline/comparison time range pickers.

Why does my flame graph look like a pyramid of [unknown]?

Symbolization failed. Your binary was compiled without frame pointers (-fomit-frame-pointer, common with -O2 or -O3) and without DWARF debug info. Three fixes: (1) compile debug builds with -fno-omit-frame-pointer -g, (2) install matching debuginfo packages (debuginfo-install on RHEL, apt-get install foo-dbgsym on Debian), or (3) use --call-graph dwarf in perf, which uses DWARF stack unwinding instead of frame pointers.

Can I link a profile to a specific request?

Yes, via pprof labels. Each sample in a pprof profile carries a list of label key-value pairs (span_id, trace_id, route, pod, etc.). In Go, set them via runtime/pprof.SetGoroutineLabels before entering a request context. In Java with async-profiler/Pyroscope, use Profiler.setLabel(). In Python, the Pyroscope SDK auto-sets labels for Django and Flask request metadata. Filter by trace_id="abc-123" in the Pyroscope UI to see exactly which call path that specific slow request took.

What's the difference between a heap profile and a heap dump?

A heap profile is a statistical sample of allocations (typically 1 in 512KB or 1 in 1MB). Low overhead, suitable for always-on production use. A heap dump captures every live object in the process — potentially hundreds of gigabytes for large Java heaps or C++ processes. Taking a heap dump causes a stop-the-world pause while the GC walks the entire heap. Use heap dumps for debugging specific OOM crashes or isolated memory leak investigations. Use heap profiles for ongoing allocation rate monitoring and GC tuning.

What's the difference between wall-clock and CPU profiling for finding latency issues?

A CPU profile only samples threads that are actively executing on a CPU. A thread that is waiting on a network call, a lock, or disk I/O produces zero CPU samples — even though it is consuming user-facing latency. For RPC-heavy services, a CPU profile will tell you "your code is fast" while the actual bottleneck is a slow downstream service. Wall-clock profiling captures both on-CPU time (serialization, computation) and off-CPU time (I/O, waiting). Start with wall-clock; use CPU as a second step when you know the problem is in your own compute.

How do I profile a short-lived process (Lambda, job, cron)?

Push-based collection is required — a pull scraper that runs every 10 seconds will miss a Lambda that runs for 3 seconds. Options: (1) Profile built into the process startup/shutdown path (Go: import net/http/pprof, Java: async-profiler API), (2) Pyroscope's Lambda layer or sidecar, (3) for cron jobs, record a local pprof profile on exit and upload it to the backend asynchronously. Some backends (Profefe) have a /profile?seconds=N endpoint that starts a timer-based profile on the next push.

Why does perf report show [kernel] but no user-space stacks?

This typically means the kernel is executing a lot of CPU (system time), not the user process. This can be real — a kernel bottleneck — or a profiling setup problem. Check: (1) Are you using -g for stack collection? (2) Is the binary compiled with frame pointers or DWARF? (3) Are debuginfo packages installed? (4) Try perf report --no-branch-stack to see if there's a data collection issue. If the kernel stacks are real, use perf sched to analyze scheduler behavior, bcc-tools/offcputime.py to find off-CPU kernel time, and bcc-tools/biotop.py for block I/O.

When should I use valgrind instead of perf?

valgrind (specifically callgrind) is for deterministic, instruction-level accurate profiling of small, bounded workloads. perf uses statistical sampling — it's approximate but very fast. valgrind runs your program 10-100x slower in a synthetic CPU and counts every instruction. Use it when: (1) your workload is short (completes in seconds), (2) you need instruction-level accuracy (not just function-level), (3) you want to analyze cache behavior in detail, or (4) you need to understand algorithmic complexity (counting actual function call frequencies). Never use valgrind for production profiling — the slowdown will change the behavior of caches, locks, and I/O-bound code.