OpenTelemetry
The vendor-neutral pipeline for metrics, logs, traces (and profiles)
OpenTelemetry (OTel) is the CNCF project that won the observability instrumentation wars. It defines a single API and SDK in every major language, a wire format (OTLP), and a vendor-neutral collector that can receive data from anything and export it to any backend. The promise: instrument your code once, ship to whatever backend you want now or later.
Before OTel, switching from Datadog to Honeycomb meant reinstrumenting every service. Now it's a collector config change. The architecture is three layers: SDKs in your apps, a Collector you run yourself, and exporters from the collector to one or many backends. OTLP carries the data between them. OpAMP increasingly manages the collector fleet centrally.
The Pipeline
Every OTel deployment looks like the same five-stage pipeline: receive, process, route, export, ingest. The shape doesn't change; only the components plugged into each stage do.
Key Numbers
The OTLP Protocol
OTLP (OpenTelemetry Protocol) is the wire format. Protobuf schemas define metrics, logs, traces, and (in development) profiles. Available over gRPC (port 4317) and HTTP/protobuf or HTTP/JSON (port 4318).
{`# Top-level OTLP messages
message ExportTraceServiceRequest { repeated ResourceSpans resource_spans = 1; }
message ExportMetricsServiceRequest { repeated ResourceMetrics resource_metrics = 1; }
message ExportLogsServiceRequest { repeated ResourceLogs resource_logs = 1; }
# Each ResourceSpans groups spans by Resource (service.name, k8s.pod.name, ...)
message ResourceSpans {
Resource resource = 1; // shared service-level attributes
repeated ScopeSpans scope_spans = 2; // grouped by instrumentation library
}
# gRPC service definition
service TraceService {
rpc Export(ExportTraceServiceRequest) returns (ExportTraceServiceResponse);
}
# HTTP/protobuf
POST /v1/traces HTTP/1.1
Content-Type: application/x-protobuf
# HTTP/JSON (less efficient but easier for browser RUM)
POST /v1/traces HTTP/1.1
Content-Type: application/json
{"resourceSpans": [...]}`} Resource Detection
Every signal carries a Resource — a set of attributes that describe the producer.
service.name, service.version, k8s.pod.name,
cloud.region, host.id. The SDK auto-detects these from the
environment.
{`# Environment-based detection
OTEL_SERVICE_NAME=billing-api
OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,service.version=v1.42.0,deployment.environment=prod
# K8s resource detector populates automatically:
# k8s.pod.name from $POD_NAME
# k8s.namespace.name from $POD_NAMESPACE
# k8s.node.name from $NODE_NAME
# k8s.container.name from runtime API
# Cloud detectors hit instance metadata services:
# cloud.provider = aws | gcp | azure
# cloud.region = us-east-1
# cloud.account.id
# host.id = i-0abcdef...
# Resource attributes are global - applied to every span/metric/log from this process`} Processors: Where the Pipeline Logic Lives
Processors transform telemetry between receive and export. They run in pipeline order and can drop, modify, enrich, batch, or sample.
batch
Almost always required. Aggregates incoming items into batches before export to reduce network round-trips. Default 8192 items or 1s timeout.
attribute
Insert, update, delete, or hash specific attributes. Useful for redacting PII
(delete the user.email attribute) or normalizing values across
services.
filter
Drop signals that match an OTTL expression. Reject health-check spans, drop metric series for blocklisted services, suppress noisy log levels.
tail_sampling
Trace-aware sampling. Holds spans by trace_id, decides keep/drop after the trace completes. Required for "keep all errors + 1% of healthy" policies.
resourcedetection
Adds resource attributes the SDK couldn't detect (e.g., when the SDK runs in a sidecar that lacks cloud metadata access). Run at the collector, applied to all incoming signals.
transform
Run OTTL programs on incoming data: rename attributes, compute new ones, restructure fields. The escape hatch when no other processor fits.
Collector Pipeline Config
The full mental model: receivers feed processors feed exporters, all stitched together by named pipelines.
{`# otelcol.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
prometheus:
config:
scrape_configs:
- job_name: 'k8s-pods'
kubernetes_sd_configs: [{ role: pod }]
processors:
memory_limiter: { check_interval: 1s, limit_mib: 4000 }
batch: { send_batch_size: 8192, timeout: 1s }
attributes:
actions:
- { key: user.email, action: delete } # PII redaction
- { key: deployment.environment, value: prod, action: insert }
tail_sampling:
decision_wait: 30s
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 500 } }
- { name: rest, type: probabilistic, probabilistic: { sampling_percentage: 1 } }
exporters:
otlphttp/tempo: { endpoint: http://tempo:4318 }
prometheusremotewrite: { endpoint: http://mimir:9090/api/v1/push }
otlphttp/loki: { endpoint: http://loki:3100/otlp }
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes, tail_sampling, batch]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [otlphttp/loki]`} OpAMP: Fleet Management
OpAMP (Open Agent Management Protocol) is the answer to "how do I manage 10,000 collectors without SSH?". A central control plane pushes config, certificates, and binary updates to every agent. Each agent reports back its health, version, and currently active config hash.
{`# Collector OpAMP extension (subset)
extensions:
opamp:
server:
ws: { endpoint: wss://opamp.example.com/v1/opamp }
capabilities:
reports_effective_config: true
accepts_remote_config: true
reports_status: true
accepts_packages: true # binary updates
reports_package_statuses: true
# Server pushes a new config; collector validates and reloads
# Server pushes a new collector binary (signed); collector swaps and restarts
# Collector reports back its actual running config hash for audit
# Vendor support:
# - BindPlane (Observiq) - OpAMP server with config UI
# - Grafana Agent Cloud
# - Various commercial SaaS`} Semantic Conventions
The semconv spec standardizes attribute names and values. http.method,
not method. db.system=postgresql, not db_kind.
Backends rely on these names for built-in dashboards.
| Domain | Examples |
|---|---|
| HTTP | http.request.method, http.response.status_code, url.path |
| Database | db.system, db.statement, db.operation |
| Messaging | messaging.system, messaging.destination.name |
| RPC | rpc.system=grpc, rpc.service, rpc.method |
| Resource | service.name, k8s.pod.name, cloud.region |
Semconv has had breaking changes (http.method →
http.request.method). Check the maturity tier (Stable / Experimental /
Deprecated) before relying on a name.
Tradeoffs & Gotchas
Vendor neutrality is real
Switching from Datadog to Honeycomb is genuinely a config change. The flip side: vendor-specific features (Datadog Watchdog, Honeycomb BubbleUp) require some vendor-specific instrumentation.
Semantic conventions still churn
The spec evolves. Don't pin to attribute names without checking maturity tier. Plan for migration windows when stable conventions change.
Collector is mandatory at scale
SDKs export directly to vendors fine for small deployments, but at scale you want the collector for batching, tail sampling, redaction, and vendor multiplexing.
Profiles signal is in-flight
Logs and metrics joined traces in 2023. Profiles signal is in development — the protocol exists but SDK and backend support is uneven. Watch the spec.
FAQ
Why not just use vendor SDKs directly?
Lock-in. Once 200 services have Datadog tracer hard-coded, switching costs a year of engineering. OTel is a one-time investment that buys future flexibility for any backend.
How is the Collector different from Grafana Agent or Vector?
Functionally similar — all three are telemetry pipelines. OTel Collector is vendor-neutral (CNCF). Grafana Agent is OTel-Collector-derived but tuned for Grafana stack. Vector is Datadog's logs-first daemon. They overlap heavily; OTel Collector has the broadest format support.
Should I run the collector as a sidecar or daemonset?
Daemonset (one per node) for most cases — simpler ops, lower resource overhead. Sidecar for tight isolation or per-app config. Many large deployments add a second tier of "gateway" collectors that sidecars/daemonsets fan into.
What's auto-instrumentation vs manual?
Auto attaches at runtime to known frameworks (HTTP servers, DB clients) and creates spans without code changes. Manual is when you call tracer.start_span() yourself for business logic. Use both: auto for plumbing, manual for domain.
How big is the SDK overhead?
~5MB binary size and ~2-5% CPU at typical sampling rates. The Java agent attaches via instrumentation API (no code changes); Go SDK is compile-time. eBPF-based auto-instrumentation (in development) has near-zero in-process overhead.
What about logs — does OTel replace Loki/Elasticsearch?
No. OTel is the pipeline; Loki/Elasticsearch is the backend. OTel's logs SDK and OTLP/logs receiver get your structured logs from app to backend with trace_id correlation. The query side is still Loki, Elastic, or whatever.