Three Pillars (Plus One)

Metrics, logs, traces — and continuous profiles as the fourth signal

The "three pillars of observability" — metrics, logs, and traces — are not three views of the same data. They are three different storage models with three different cost curves and three different question shapes. Metrics answer "is the system healthy right now?" Logs answer "what happened on this specific event?" Traces answer "where did this single request spend its time?" Continuous profiles increasingly fill a fourth role: "where is CPU and memory actually going across the fleet?"

The mature observability stack treats them as linked signals. A metric spike points to a trace exemplar. A trace span points to log lines from that span. A log line carries a trace_id you can pivot on. The "unified observability ideal" is having one query language and one identifier (typically trace_id) that flows through all four stores so that pivoting takes a click, not a context switch.

Three Different Shapes

Each pillar has a fundamentally different cost-per-question and shape of answer.

Metrics aggregated over time low cardinality ~10ms queries "Is it broken?" cheap, lossy Logs per-event, structured high cardinality OK seconds-minutes "What happened?" expensive, exact Traces per-request span tree causal structure 100ms-1s queries "Where slow?" sampled, structural Profiles stack frequency all hosts, 99 Hz ms queries "Why hot?" ~1% overhead The links between them are the value metric exemplar → trace_id → trace span → log lines → profile by trace A single click pivots through all four. That's "unified observability". OTel + Tempo + Loki + Pyroscope + Grafana is one vendor-neutral stack that does it

Key Numbers

1.6 B
compressed bytes per metric sample (Gorilla)
200 B
bytes per structured log line (compressed)
1 KB
average span on the wire (OTLP, gzip)
~10x
cost ratio: traces per RPS vs metrics
~100x
cost ratio: logs vs metrics for same workload
~1%
CPU overhead for continuous profiling at 99 Hz
3+1
pillars: metrics, logs, traces, profiles

Metrics: When to Use

Metrics answer aggregate questions cheaply. They lose information about individual events but compress beautifully and stay queryable for years.

  • Health questions. "Is the error rate elevated right now?" needs counters and gauges, not log search.
  • SLI computation. SLOs are defined as ratios of metric counts. Compute them from logs and you'll never afford the query.
  • Long retention. Per-second metrics for a year cost gigabytes. The same fidelity in logs would cost terabytes.
  • Alerts. Thresholds and burn rates evaluate cheaply against time series. Querying logs every 30 seconds for an alert is a money pit.
  • Dashboards. Painting a graph from a metric is microseconds. Painting one from logs is seconds — users notice.
{`# Good metrics: low-cardinality, aggregable
http_requests_total{method,status,route}        # 5 x 40 x 200 = ~40K series
http_request_duration_seconds_bucket{route,le}  # 200 x 10 buckets = 2K series

# Bad metrics: identifiers as labels
http_requests_total{user_id, request_id, ...}   # unbounded, kills TSDB`}

Logs: When to Use

Logs preserve event-level information. Anything you want to remember about a single request, transaction, or operation goes in a log line. Modern logs are structured: not text, but JSON or logfmt with typed fields.

{`# Bad: unstructured log
[2024-08-12 14:23:11] ERROR: Failed to charge user [email protected] $42.99 (insufficient_funds)

# Good: structured log
{
  "ts":         "2024-08-12T14:23:11Z",
  "level":      "ERROR",
  "msg":        "charge failed",
  "user_id":    "u_abc123",
  "user_email": "[email protected]",
  "amount_usd": 42.99,
  "reason":     "insufficient_funds",
  "trace_id":   "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id":    "00f067aa0ba902b7",
  "service":    "billing",
  "version":    "v1.42.0"
}`}

The trace_id is the magic field. With it, every log line can be pivoted to its trace, and every trace span can pull the log lines emitted during it. Without it, logs are a search index that can only answer "show me logs matching X" — a much weaker question.

Traces: When to Use

Traces preserve causal structure: which call invoked which, with what parameters, taking how long. They're the only signal that captures the shape of a single request through a distributed system.

  • "Why is this request slow?" The span tree shows exactly which child span ate the latency.
  • "What does this request actually do?" The trace is documentation of the runtime behavior.
  • "Where did this exception come from?" Spans carry exception events; the trace shows the call chain.
  • "What's the dependency graph?" Aggregating traces produces a service-to-service map.

Traces are sampled because at full fidelity they cost more than logs. But unlike metrics, sampled traces still preserve the question shape — you just have a 1% sample of the request shapes, not aggregate counts.

Profiles: The Fourth Pillar

Continuous CPU and memory profiling captures stack-frame frequencies across the fleet at low cost. It answers questions metrics, logs, and traces can't: "which line of code is consuming all this CPU?"

  • Cross-cutting performance. A function used in 50 places shows up in profiles by total cost, not 50 separate spans.
  • Hot leaves. JSON parsing eating 30% CPU? Profile sees it. Trace doesn't — it's not a span.
  • Memory leaks. Heap profile shows live allocations by stack — the leaking call site is right there.
  • Diff before/after. Compare last week's flamegraph to this week's. The diff highlights regressions a metric never would.

Exemplars: Linking Metrics to Traces

An exemplar is a trace_id attached to a metric sample, typically a histogram bucket. Click the spike on the latency dashboard, drill into a trace from that exact moment.

{`# OpenMetrics exemplar format - trace_id + span_id attached to a histogram bucket
http_request_duration_seconds_bucket{le="0.5",route="/api"} 12492 # {trace_id="4bf92f...",span_id="00f067..."} 0.473 1660354527.123

# In Prometheus + Grafana + Tempo:
#   - Prometheus stores the exemplar pointer with the bucket
#   - Grafana surfaces "view trace" links on the histogram
#   - Click jumps to Tempo with the trace_id pre-filled

# OTel SDK enables exemplars by default for histograms when:
#   - the span is currently sampled
#   - the metric is OTel-emitted (not Prometheus-direct)`}

Exemplars are the cheapest way to bridge the metrics-traces gap. Each histogram bucket holds at most one exemplar at a time (the latest), so the storage cost is tiny — but the navigation value is huge.

Vendor-Neutral Stack

The full unified-observability stack today, all vendor-neutral and OTel-friendly.

PillarStorageQuery UIWire format
MetricsPrometheus / Mimir / VictoriaMetricsGrafanaOTLP, remote_write
LogsLoki / Elasticsearch / OpenSearchGrafana / KibanaOTLP, syslog, fluentbit
TracesTempo / JaegerGrafana / Jaeger UIOTLP
ProfilesPyroscope / Phlare / ParcaGrafana / Polar Signalspprof

Tradeoffs

Metrics-only is blind

You see "errors are up" but can't answer "for which user, doing what?" Without logs/traces, every incident is a pivot to SSH.

Logs-only is expensive

Computing dashboards from logs is technically possible and operationally infeasible. The query latency and storage cost don't scale with traffic.

Traces-only is sampled

Sampling means traces don't answer "is the rate elevated" reliably. They answer "what does a slow request look like" instead. Different question.

All four = pivot ability

The investment that pays off is wiring trace_id through every signal. It's not three or four pillars; it's one investigation experience that flows across them.

FAQ

Are the three pillars outdated?

The phrase is, but the underlying model isn't. Modern critique is "stop thinking of them as separate; treat observability as one connected experience." That's about UX integration, not abandoning the three storage shapes.

Should logs replace metrics?

No. The cost ratio is 50-100x. You can derive metrics from logs in low-volume systems, but at scale the storage and query cost makes it impossible. Both, with logs sampled or filtered to keep the volume reasonable.

What about events and incidents as a fifth pillar?

Some frameworks add "events" (deploys, config changes, incidents) as a separate signal. They're useful for correlating "we deployed v1.42 at 14:00, errors spiked at 14:02" but they're really just sparse, important logs. Treat them as such.

How do I unify across vendors?

OpenTelemetry. OTLP is the wire format every modern backend accepts. Instrument once with OTel and export to any combination of Prometheus, Tempo, Loki, Datadog, Honeycomb, New Relic. Vendor lock-in moves from instrumentation to backend.

Where does RUM (Real User Monitoring) fit?

RUM is browser-side telemetry: Core Web Vitals metrics, session logs, frontend traces. Same three pillars, different vantage point. OTel has a JS SDK for it; the data flows into the same backend.

Should I store logs in my metrics database?

Generally no. ClickHouse and similar can do both, but their storage layouts optimize differently. Metrics want fast aggregation over many series; logs want fast text search and high-cardinality filtering. Use the right tool for each.