Observability Internals
Observability is the practice of inferring a system's internal state from the data it emits — metrics, logs, traces, and profiles. It has become a 30-billion-dollar market because modern distributed systems are too complex to debug by reading code or stepping through a debugger. The interesting engineering is not in any single signal but in the protocols that move billions of telemetry events per second cheaply, the sampling strategies that keep storage costs sane, and the correlation primitives (W3C trace context, exemplars) that let you pivot from a metric anomaly to the exact failed request in two clicks. This hub covers what actually changed since the move to OpenTelemetry.
Grounded in the OpenTelemetry specification, the SRE workbook, and operational lessons from running this stack at scale.
Pipeline Architecture
Key Numbers
Why Modern Observability Exists
OpenTelemetry
The vendor-neutral standard for traces, metrics, and logs — collectors, SDKs, semantic conventions, and how OTel finally won the instrumentation war
The Three Pillars
Metrics, logs, and traces — what each one is good at, where they overlap, and why exemplars finally let you jump between them in one click
Distributed Tracing
W3C trace context, B3 headers, span propagation, and what actually flows over the wire when a request crosses 14 microservices
Sampling Strategies
Head-based vs tail-based, probabilistic vs deterministic, the tail sampler's accuracy/cost tradeoff, and how to keep traces cheap without missing the rare bad ones
Cardinality Explosions
The single most expensive mistake in metrics — when your user_id label turns 50 time-series into 50 million, and how to claw back without losing fidelity
eBPF Observability
Cilium Tetragon, Pixie, Parca, and how kernel-side hooks generate metrics, traces, and continuous profiles with zero application instrumentation
Continuous Profiling
Pyroscope, Parca, Polar Signals — flame graphs sampled continuously in production at <1% overhead, and what they catch that traces don't
SLI / SLO / SLA
Service Level Indicator, Objective, Agreement — error budgets, the burn-rate alerts that actually page humans, and how Google's SRE book translates to your team
The Three Pillars (And the Fourth)
Observability has historically rested on three signal types. Metrics are pre-aggregated numerical time-series — counters, gauges, histograms. They are cheap to store (one number plus a few label values per scrape interval) and great for alerting and dashboards, but they are dimensional and lose individual events. Logs are structured records of discrete events. Expensive to store at high cardinality but irreplaceable for "what exactly happened to this request" questions. Traces are causally-linked records of work spanning processes — span trees that show how a request flowed through services.
A fourth pillar has emerged: continuous profiling. Sampled CPU/memory profiles from production with negligible overhead, stored as flame graphs over time. Pyroscope, Parca, and Polar Signals proved you can run pprof-style profilers continuously and answer questions metrics and traces can't ("which line of code is using my CPU"). OpenTelemetry's profiling signal is the formalisation.
The interesting question isn't "which pillar?" but how they correlate. Exemplars
attach a sample trace_id to a histogram bucket: a high-latency p99 metric link directly to
the slow trace. Logs include trace_id in their structured fields so a single trace ID jumps
from "slow request" to "the exact log line where it stalled." This is the wholesale shift from "three
pillars" to "one signal graph with multiple projections."
OpenTelemetry: The Standard
OTel is three things: a specification (data model + semantic conventions), a set of SDKs in every major language, and a Collector binary that receives, processes, and exports telemetry. The wire protocol, OTLP (OpenTelemetry Protocol, gRPC or HTTP), is the canonical format. Every vendor accepts OTLP at the front door now.
The semantic conventions are arguably more important than the protocol. They define
standardised attribute names: http.request.method, db.system.name,
messaging.kafka.partition. When everyone agrees these names, dashboards built for one
service work for all of them; correlation across services becomes trivial. Vendor SDKs that
stick OTel-incompatible attribute names on spans break this for everyone.
Auto-instrumentation libraries (Java agent, Python opentelemetry-instrument, .NET, Node)
attach to popular frameworks and emit OTLP without code changes. Manual instrumentation fills gaps the
auto layer can't see. The combination usually gives you 80% coverage in an afternoon.
Distributed Tracing and Context Propagation
A trace is a directed graph of spans, each representing a unit of work (an HTTP
request, a DB query, an internal function). Spans have a start, an end, attributes, events, and a
parent span ID. The trace is identified by a 128-bit trace_id; spans by a 64-bit
span_id. The whole graph is reconstructed by joining child spans to their parents.
For a trace to span processes, the trace context must travel with the request. The
W3C Trace Context standard defines two headers: traceparent (the binary
identifiers in a fixed format) and tracestate (vendor-specific data). All OTel-compliant
libraries inject and extract these headers automatically across HTTP, gRPC, and most messaging clients.
The older B3 headers (Zipkin) are still common in legacy environments.
A subtle correctness pitfall: if your trace context propagation is incomplete (you call a service that doesn't know how to forward the headers), spans become orphans. They show up as separate small traces. The fix is auto-instrumentation everywhere or explicit propagation in any custom client. Treat the first lost trace boundary as a P0 — the entire blast radius of "we don't know what's happening" lives downstream of it.
Sampling: Head-Based vs Tail-Based
Recording every span at scale is impossibly expensive. Sampling is mandatory. The two strategies have very different costs and properties.
Head-based sampling: decide at trace start whether to keep the trace, before any work happens. Cheap (no buffering), simple (the decision propagates through the parent context so all spans in a trace are kept or dropped together), but blind — you can't make the decision based on whether the trace turned out interesting. A 1% sample rate misses 99% of rare bugs by definition.
Tail-based sampling: buffer all spans for a trace, decide at the end whether to keep. Lets you sample 100% of errors and slow traces while keeping baseline at 1%. Expensive: the sampler must hold every span in memory until the trace completes (or a timeout fires). Requires a sampling decision service that sees all spans for a trace ID — usually a sticky-routed Collector with a tail-sampling processor.
Hybrid: head-sample the boring 99%, tail-sample errors and outliers. This is the dominant production
pattern. Honeycomb's "refinery" and the OTel Collector's tail_sampling processor implement
it.
The Cardinality Problem
A metric series is identified by metric name plus the set of label values. Each unique combination is a
new time-series. A counter http_requests_total{method, path, status, user_id} with even
modest cardinality on each label can produce billions of series. Prometheus, Mimir, VictoriaMetrics —
all hit a wall around 10–50M active series before query latency, ingest, and storage costs explode.
Cardinality discipline: never put unbounded values in labels (user_id, request_id, trace_id, customer email). Aggregate at ingest where possible. If you need user-level breakdowns, do it on logs or traces with exemplars, not metrics. The number-one cause of "our observability bill is bigger than our compute bill" is undisciplined labels.
Modern systems (Mimir, VictoriaMetrics, ClickHouse-backed metrics stores) handle higher cardinality than classic Prometheus because they use better storage engines (label-indexed columnar) — but no one handles unbounded cardinality. Pre-aggregation and recording rules are still the answer.
Logs: Structured, Correlated, Aggregated
Modern logging is structured: JSON or logfmt with fixed fields. Free-text logs are searchable but un- queryable; structured logs are queryable like a database. The OTel logs SDK ships log records with span context attached, so every log line in a request handler automatically carries the trace ID.
The aggregation pipeline is canonical: app → stdout → log shipper (Vector, Fluent Bit, OTel Collector with logs receiver) → storage (Loki, Elasticsearch, ClickHouse, Splunk, S3+Athena). Loki's design point is "index labels only, store raw log content compressed" — dramatically cheaper than Elasticsearch for log volume but slower for full-text search. ClickHouse-backed log stores (SigNoz, Quickwit, vendor stacks) split the difference.
eBPF: Observability Without Instrumentation
eBPF programs run in the kernel, hooked to events: syscalls, network packets, function entries. They can produce metrics, traces, and profiles without touching application code. Cilium Tetragon, Pixie, Parca, and Hubble all leverage eBPF for system-call-level visibility.
The win is huge for closed-source binaries, polyglot environments, and "drop-in observability" claims. The limit is that the kernel sees syscalls, not application semantics — eBPF can tell you a syscall took 5 ms but not which user-level operation that syscall was serving. Combined with OTel auto-instrumentation, the two cover complementary blind spots.
SLI, SLO, SLA, Error Budgets
The Google SRE framework that became universal. SLI: Service Level Indicator — a measurable metric like "p99 latency of the checkout endpoint" or "error rate of writes to user table." SLO: Service Level Objective — a target ("p99 below 300 ms 99.9% of the time over 28 days"). SLA: Service Level Agreement — a contractual SLO with consequences for violation.
An error budget is the inverse of an SLO: if you commit to 99.9%, you have 43 minutes of allowed downtime per month. Burn-rate alerts (Google's multi-window multi-burn-rate pattern, MWMBR) page humans only when you're about to blow the budget — not on every blip. A 14.4× burn rate over a 1-hour window means "in the next 5 days at this rate you'll exhaust the month's budget" — alert. A 1.5× burn rate over 6 hours means "you'll exhaust the budget in 18 days at this rate" — also worth paging because the issue is sustained.
The three classic monitoring frameworks: RED (Rate, Errors, Duration — for request- driven services), USE (Utilization, Saturation, Errors — for resources), and the four golden signals (latency, traffic, errors, saturation — Google's mash-up). They all describe the same idea: instrument both the request flow and the resources serving it.
Tradeoffs and When Less is More
Modern observability is expensive. A typical bill at scale: 30–60% of compute cost. Most of it is metric cardinality and log volume that nobody queries. The discipline is to keep alerting metrics surgical, sample traces aggressively, ship logs only when you need them, and use exemplars to bridge between cheap signals and expensive ones.
The opposite mistake is also common: instrument too lightly, ship to a single tool, then realise after an incident that you have no idea what happened. The right floor for a production service is: RED metrics on every endpoint, structured logs at info+, distributed tracing at 1% head sampling + 100% errors, and continuous profiling. Everything beyond that is workload-specific.
Vendor Stacks vs OpenTelemetry-Native
| Datadog | Honeycomb | Grafana Stack | OTel + OSS | |
|---|---|---|---|---|
| Instrumentation | Datadog SDK + dd-trace | OpenTelemetry-native | OTel + Loki/Tempo SDKs | OpenTelemetry exclusively |
| Metrics backend | DogStatsD + custom TSDB | Computed from events | Mimir / Prometheus | Prometheus / Mimir / VM |
| Logs backend | Datadog Logs (custom) | Events (high-cardinality) | Loki | Loki / ClickHouse / OpenSearch |
| Traces backend | APM (custom storage) | Native event-based | Tempo | Jaeger / Tempo / SigNoz |
| Continuous profiling | Datadog Profiler | (via OTel) | Pyroscope (acquired) | Pyroscope / Parca |
| Lock-in | High (custom SDKs) | Low (OTel-native) | Low (OSS, swappable) | None |
| Cost shape | Per host + per ingested GB | Per event | Self-hosted or Cloud per signal | Self-hosted: infra cost only |
FAQ
Should I use OpenTelemetry or my vendor's SDK?
OpenTelemetry, almost always. Even if you're a Datadog/New Relic shop today, instrumenting with OTel + their OTLP receiver gives you the same data with zero lock-in — switch vendors by changing an exporter config. The vendor SDKs are mostly there for legacy reasons or to expose proprietary features (RUM, security signals); use them only when OTel's coverage is insufficient.
Why are my metrics so expensive?
Cardinality. Run a query for series count by metric name. The top 10 metrics are usually 90% of the bill, and they almost always have a label that should never have been one (user_id, request_id, status_message). Drop the label or move that data to logs/traces. A second class of cost: scrape interval. 15 s is fine for almost everything; 1 s is rarely justified.
What's the difference between OpenTelemetry and OpenTracing?
OpenTracing was a tracing-only API that became hard to evolve. OpenCensus was Google's competing project that included metrics. They merged in 2019 to form OpenTelemetry, which now covers traces (stable), metrics (stable), logs (stable), and profiles (in development). OpenTracing is deprecated; bridges exist for migrating instrumentation.
How does tail-based sampling actually work in production?
The OTel Collector or a sampling proxy (Honeycomb Refinery, Datadog's tail sampler) keeps a buffer of all spans for each in-flight trace, hashed by trace ID and routed consistently across collector instances. When the root span ends or a timeout fires, the sampler evaluates configured policies (always sample errors, sample slow latency, head-sample rate for the rest) and emits or drops accordingly. Memory pressure is the failure mode — you must tune the buffer size and timeout aggressively.
Do I really need traces if I have detailed logs?
Yes — logs tell you what happened in one process; traces tell you the causal chain across processes. In a 14-microservice architecture, a 95th-percentile latency outlier is impossible to debug from logs alone because you can't tell which downstream call dominated the time. Traces show the waterfall directly. Conversely, traces alone aren't enough either — exact error messages, stack traces, and rare events live in logs.
What is "events" in the Honeycomb sense?
Honeycomb pioneered the wide-event model: instead of separate metrics + logs + traces, every operation emits a single high-cardinality, high-dimensional event with hundreds of attributes. Aggregations are computed on the fly. The benefit is unmatched flexibility — you can group by any attribute after the fact. The cost is that the storage system must handle billions of events with arbitrary group-bys cheaply, which is why Honeycomb built a custom column store. OTel's wide-spans-with-attributes is a partial convergence on this model.
How does eBPF observability compare to traditional instrumentation?
eBPF gives you visibility "for free" — no code changes, no SDK upgrades. It's strongest at network, syscall, and kernel-resource levels. It's weakest at application semantics: it can tell you a process called read() on socket FD 14 for 5 ms; it can't tell you that was a database query for the user's profile. Combine: eBPF for system-level coverage of services you can't instrument, OTel SDKs for everything you own.