Observability Internals

Observability is the practice of inferring a system's internal state from the data it emits — metrics, logs, traces, and profiles. It has become a 30-billion-dollar market because modern distributed systems are too complex to debug by reading code or stepping through a debugger. The interesting engineering is not in any single signal but in the protocols that move billions of telemetry events per second cheaply, the sampling strategies that keep storage costs sane, and the correlation primitives (W3C trace context, exemplars) that let you pivot from a metric anomaly to the exact failed request in two clicks. This hub covers what actually changed since the move to OpenTelemetry.

Grounded in the OpenTelemetry specification, the SRE workbook, and operational lessons from running this stack at scale.

Pipeline Architecture

Key Numbers

OTel spec status

stable (1.0+)

Trace context standard

W3C TC

Default head sampling

parent-based

Profiling overhead

< 1%

Prom scrape default

15 s

Cardinality budget

~10M series/TSDB

RED / USE / 4 golden

3 frameworks

Why Modern Observability Exists

The Gap

By 2018, every vendor had its own SDK. Switching from Datadog to New Relic meant re-instrumenting your entire codebase. Tracing, metrics, and logs lived in three different products with three different correlation models. The cost of vendor lock-in was the cost of every line of instrumentation.

The Insight

If you standardise the data format (signals, attributes, semantic conventions) and the wire protocol (OTLP), the SDK becomes a commodity. Vendors compete on backend storage, query, and UX — not on which tracer you have to import. Decoupling instrumentation from analysis frees the entire ecosystem.

The Result

OpenTelemetry merged OpenTracing + OpenCensus in 2019, hit GA for traces in 2021, metrics in 2023, and logs in 2024. It is now the de facto instrumentation API across cloud-native and is supported as a first-class ingest path by every major vendor — including the ones it threatened.

✦ Live

OpenTelemetry

The vendor-neutral standard for traces, metrics, and logs — collectors, SDKs, semantic conventions, and how OTel finally won the instrumentation war

Coming soon

The Three Pillars

Metrics, logs, and traces — what each one is good at, where they overlap, and why exemplars finally let you jump between them in one click

Coming soon

Distributed Tracing

W3C trace context, B3 headers, span propagation, and what actually flows over the wire when a request crosses 14 microservices

Coming soon

Sampling Strategies

Head-based vs tail-based, probabilistic vs deterministic, the tail sampler's accuracy/cost tradeoff, and how to keep traces cheap without missing the rare bad ones

Coming soon

Cardinality Explosions

The single most expensive mistake in metrics — when your user_id label turns 50 time-series into 50 million, and how to claw back without losing fidelity

Coming soon

eBPF Observability

Cilium Tetragon, Pixie, Parca, and how kernel-side hooks generate metrics, traces, and continuous profiles with zero application instrumentation

Coming soon

Continuous Profiling

Pyroscope, Parca, Polar Signals — flame graphs sampled continuously in production at <1% overhead, and what they catch that traces don't

Coming soon

SLI / SLO / SLA

Service Level Indicator, Objective, Agreement — error budgets, the burn-rate alerts that actually page humans, and how Google's SRE book translates to your team

The Three Pillars (And the Fourth)

Observability has historically rested on three signal types. Metrics are pre-aggregated numerical time-series — counters, gauges, histograms. They are cheap to store (one number plus a few label values per scrape interval) and great for alerting and dashboards, but they are dimensional and lose individual events. Logs are structured records of discrete events. Expensive to store at high cardinality but irreplaceable for "what exactly happened to this request" questions. Traces are causally-linked records of work spanning processes — span trees that show how a request flowed through services.

A fourth pillar has emerged: continuous profiling. Sampled CPU/memory profiles from production with negligible overhead, stored as flame graphs over time. Pyroscope, Parca, and Polar Signals proved you can run pprof-style profilers continuously and answer questions metrics and traces can't ("which line of code is using my CPU"). OpenTelemetry's profiling signal is the formalisation.

The interesting question isn't "which pillar?" but how they correlate. Exemplars attach a sample trace_id to a histogram bucket: a high-latency p99 metric link directly to the slow trace. Logs include trace_id in their structured fields so a single trace ID jumps from "slow request" to "the exact log line where it stalled." This is the wholesale shift from "three pillars" to "one signal graph with multiple projections."

OpenTelemetry: The Standard

OTel is three things: a specification (data model + semantic conventions), a set of SDKs in every major language, and a Collector binary that receives, processes, and exports telemetry. The wire protocol, OTLP (OpenTelemetry Protocol, gRPC or HTTP), is the canonical format. Every vendor accepts OTLP at the front door now.

The semantic conventions are arguably more important than the protocol. They define standardised attribute names: http.request.method, db.system.name, messaging.kafka.partition. When everyone agrees these names, dashboards built for one service work for all of them; correlation across services becomes trivial. Vendor SDKs that stick OTel-incompatible attribute names on spans break this for everyone.

Auto-instrumentation libraries (Java agent, Python opentelemetry-instrument, .NET, Node) attach to popular frameworks and emit OTLP without code changes. Manual instrumentation fills gaps the auto layer can't see. The combination usually gives you 80% coverage in an afternoon.

Distributed Tracing and Context Propagation

A trace is a directed graph of spans, each representing a unit of work (an HTTP request, a DB query, an internal function). Spans have a start, an end, attributes, events, and a parent span ID. The trace is identified by a 128-bit trace_id; spans by a 64-bit span_id. The whole graph is reconstructed by joining child spans to their parents.

For a trace to span processes, the trace context must travel with the request. The W3C Trace Context standard defines two headers: traceparent (the binary identifiers in a fixed format) and tracestate (vendor-specific data). All OTel-compliant libraries inject and extract these headers automatically across HTTP, gRPC, and most messaging clients. The older B3 headers (Zipkin) are still common in legacy environments.

A subtle correctness pitfall: if your trace context propagation is incomplete (you call a service that doesn't know how to forward the headers), spans become orphans. They show up as separate small traces. The fix is auto-instrumentation everywhere or explicit propagation in any custom client. Treat the first lost trace boundary as a P0 — the entire blast radius of "we don't know what's happening" lives downstream of it.

Sampling: Head-Based vs Tail-Based

Recording every span at scale is impossibly expensive. Sampling is mandatory. The two strategies have very different costs and properties.

Head-based sampling: decide at trace start whether to keep the trace, before any work happens. Cheap (no buffering), simple (the decision propagates through the parent context so all spans in a trace are kept or dropped together), but blind — you can't make the decision based on whether the trace turned out interesting. A 1% sample rate misses 99% of rare bugs by definition.

Tail-based sampling: buffer all spans for a trace, decide at the end whether to keep. Lets you sample 100% of errors and slow traces while keeping baseline at 1%. Expensive: the sampler must hold every span in memory until the trace completes (or a timeout fires). Requires a sampling decision service that sees all spans for a trace ID — usually a sticky-routed Collector with a tail-sampling processor.

Hybrid: head-sample the boring 99%, tail-sample errors and outliers. This is the dominant production pattern. Honeycomb's "refinery" and the OTel Collector's tail_sampling processor implement it.

The Cardinality Problem

A metric series is identified by metric name plus the set of label values. Each unique combination is a new time-series. A counter http_requests_total{method, path, status, user_id} with even modest cardinality on each label can produce billions of series. Prometheus, Mimir, VictoriaMetrics — all hit a wall around 10–50M active series before query latency, ingest, and storage costs explode.

Cardinality discipline: never put unbounded values in labels (user_id, request_id, trace_id, customer email). Aggregate at ingest where possible. If you need user-level breakdowns, do it on logs or traces with exemplars, not metrics. The number-one cause of "our observability bill is bigger than our compute bill" is undisciplined labels.

Modern systems (Mimir, VictoriaMetrics, ClickHouse-backed metrics stores) handle higher cardinality than classic Prometheus because they use better storage engines (label-indexed columnar) — but no one handles unbounded cardinality. Pre-aggregation and recording rules are still the answer.

Logs: Structured, Correlated, Aggregated

Modern logging is structured: JSON or logfmt with fixed fields. Free-text logs are searchable but un- queryable; structured logs are queryable like a database. The OTel logs SDK ships log records with span context attached, so every log line in a request handler automatically carries the trace ID.

The aggregation pipeline is canonical: app → stdout → log shipper (Vector, Fluent Bit, OTel Collector with logs receiver) → storage (Loki, Elasticsearch, ClickHouse, Splunk, S3+Athena). Loki's design point is "index labels only, store raw log content compressed" — dramatically cheaper than Elasticsearch for log volume but slower for full-text search. ClickHouse-backed log stores (SigNoz, Quickwit, vendor stacks) split the difference.

eBPF: Observability Without Instrumentation

eBPF programs run in the kernel, hooked to events: syscalls, network packets, function entries. They can produce metrics, traces, and profiles without touching application code. Cilium Tetragon, Pixie, Parca, and Hubble all leverage eBPF for system-call-level visibility.

The win is huge for closed-source binaries, polyglot environments, and "drop-in observability" claims. The limit is that the kernel sees syscalls, not application semantics — eBPF can tell you a syscall took 5 ms but not which user-level operation that syscall was serving. Combined with OTel auto-instrumentation, the two cover complementary blind spots.

SLI, SLO, SLA, Error Budgets

The Google SRE framework that became universal. SLI: Service Level Indicator — a measurable metric like "p99 latency of the checkout endpoint" or "error rate of writes to user table." SLO: Service Level Objective — a target ("p99 below 300 ms 99.9% of the time over 28 days"). SLA: Service Level Agreement — a contractual SLO with consequences for violation.

An error budget is the inverse of an SLO: if you commit to 99.9%, you have 43 minutes of allowed downtime per month. Burn-rate alerts (Google's multi-window multi-burn-rate pattern, MWMBR) page humans only when you're about to blow the budget — not on every blip. A 14.4× burn rate over a 1-hour window means "in the next 5 days at this rate you'll exhaust the month's budget" — alert. A 1.5× burn rate over 6 hours means "you'll exhaust the budget in 18 days at this rate" — also worth paging because the issue is sustained.

The three classic monitoring frameworks: RED (Rate, Errors, Duration — for request- driven services), USE (Utilization, Saturation, Errors — for resources), and the four golden signals (latency, traffic, errors, saturation — Google's mash-up). They all describe the same idea: instrument both the request flow and the resources serving it.

Tradeoffs and When Less is More

Modern observability is expensive. A typical bill at scale: 30–60% of compute cost. Most of it is metric cardinality and log volume that nobody queries. The discipline is to keep alerting metrics surgical, sample traces aggressively, ship logs only when you need them, and use exemplars to bridge between cheap signals and expensive ones.

The opposite mistake is also common: instrument too lightly, ship to a single tool, then realise after an incident that you have no idea what happened. The right floor for a production service is: RED metrics on every endpoint, structured logs at info+, distributed tracing at 1% head sampling + 100% errors, and continuous profiling. Everything beyond that is workload-specific.

Vendor Stacks vs OpenTelemetry-Native

	Datadog	Honeycomb	Grafana Stack	OTel + OSS
Instrumentation	Datadog SDK + dd-trace	OpenTelemetry-native	OTel + Loki/Tempo SDKs	OpenTelemetry exclusively
Metrics backend	DogStatsD + custom TSDB	Computed from events	Mimir / Prometheus	Prometheus / Mimir / VM
Logs backend	Datadog Logs (custom)	Events (high-cardinality)	Loki	Loki / ClickHouse / OpenSearch
Traces backend	APM (custom storage)	Native event-based	Tempo	Jaeger / Tempo / SigNoz
Continuous profiling	Datadog Profiler	(via OTel)	Pyroscope (acquired)	Pyroscope / Parca
Lock-in	High (custom SDKs)	Low (OTel-native)	Low (OSS, swappable)	None
Cost shape	Per host + per ingested GB	Per event	Self-hosted or Cloud per signal	Self-hosted: infra cost only

FAQ

Should I use OpenTelemetry or my vendor's SDK?

OpenTelemetry, almost always. Even if you're a Datadog/New Relic shop today, instrumenting with OTel + their OTLP receiver gives you the same data with zero lock-in — switch vendors by changing an exporter config. The vendor SDKs are mostly there for legacy reasons or to expose proprietary features (RUM, security signals); use them only when OTel's coverage is insufficient.

Why are my metrics so expensive?

Cardinality. Run a query for series count by metric name. The top 10 metrics are usually 90% of the bill, and they almost always have a label that should never have been one (user_id, request_id, status_message). Drop the label or move that data to logs/traces. A second class of cost: scrape interval. 15 s is fine for almost everything; 1 s is rarely justified.

What's the difference between OpenTelemetry and OpenTracing?

OpenTracing was a tracing-only API that became hard to evolve. OpenCensus was Google's competing project that included metrics. They merged in 2019 to form OpenTelemetry, which now covers traces (stable), metrics (stable), logs (stable), and profiles (in development). OpenTracing is deprecated; bridges exist for migrating instrumentation.

How does tail-based sampling actually work in production?

The OTel Collector or a sampling proxy (Honeycomb Refinery, Datadog's tail sampler) keeps a buffer of all spans for each in-flight trace, hashed by trace ID and routed consistently across collector instances. When the root span ends or a timeout fires, the sampler evaluates configured policies (always sample errors, sample slow latency, head-sample rate for the rest) and emits or drops accordingly. Memory pressure is the failure mode — you must tune the buffer size and timeout aggressively.

Do I really need traces if I have detailed logs?

Yes — logs tell you what happened in one process; traces tell you the causal chain across processes. In a 14-microservice architecture, a 95th-percentile latency outlier is impossible to debug from logs alone because you can't tell which downstream call dominated the time. Traces show the waterfall directly. Conversely, traces alone aren't enough either — exact error messages, stack traces, and rare events live in logs.

What is "events" in the Honeycomb sense?

Honeycomb pioneered the wide-event model: instead of separate metrics + logs + traces, every operation emits a single high-cardinality, high-dimensional event with hundreds of attributes. Aggregations are computed on the fly. The benefit is unmatched flexibility — you can group by any attribute after the fact. The cost is that the storage system must handle billions of events with arbitrary group-bys cheaply, which is why Honeycomb built a custom column store. OTel's wide-spans-with-attributes is a partial convergence on this model.

How does eBPF observability compare to traditional instrumentation?

eBPF gives you visibility "for free" — no code changes, no SDK upgrades. It's strongest at network, syscall, and kernel-resource levels. It's weakest at application semantics: it can tell you a process called read() on socket FD 14 for 5 ms; it can't tell you that was a database query for the user's profile. Combine: eBPF for system-level coverage of services you can't instrument, OTel SDKs for everything you own.

Observability Internals

Pipeline Architecture

Key Numbers

Why Modern Observability Exists

OpenTelemetry

The Three Pillars

Distributed Tracing

Sampling Strategies

Cardinality Explosions

eBPF Observability

Continuous Profiling

SLI / SLO / SLA

The Three Pillars (And the Fourth)

OpenTelemetry: The Standard

Distributed Tracing and Context Propagation

Sampling: Head-Based vs Tail-Based

The Cardinality Problem

Logs: Structured, Correlated, Aggregated

eBPF: Observability Without Instrumentation

SLI, SLO, SLA, Error Budgets

Tradeoffs and When Less is More

Vendor Stacks vs OpenTelemetry-Native

FAQ

🔗 Related Topics