OpenTelemetry

The vendor-neutral pipeline for metrics, logs, traces (and profiles)

OpenTelemetry (OTel) is the CNCF project that won the observability instrumentation wars. It defines a single API and SDK in every major language, a wire format (OTLP), and a vendor-neutral collector that can receive data from anything and export it to any backend. The promise: instrument your code once, ship to whatever backend you want now or later.

Before OTel, switching from Datadog to Honeycomb meant reinstrumenting every service. Now it's a collector config change. The architecture is three layers: SDKs in your apps, a Collector you run yourself, and exporters from the collector to one or many backends. OTLP carries the data between them. OpAMP increasingly manages the collector fleet centrally.

The Pipeline

Every OTel deployment looks like the same five-stage pipeline: receive, process, route, export, ingest. The shape doesn't change; only the components plugged into each stage do.

Key Numbers

officially supported language SDKs

~80

collector receivers in contrib distro

~70

collector exporters in contrib

4317

default OTLP gRPC port

4318

default OTLP HTTP port

512K

default max batch size in BatchSpanProcessor

CNCF

graduated, second most active project after Kubernetes

The OTLP Protocol

OTLP (OpenTelemetry Protocol) is the wire format. Protobuf schemas define metrics, logs, traces, and (in development) profiles. Available over gRPC (port 4317) and HTTP/protobuf or HTTP/JSON (port 4318).

{`# Top-level OTLP messages
message ExportTraceServiceRequest    { repeated ResourceSpans   resource_spans = 1; }
message ExportMetricsServiceRequest  { repeated ResourceMetrics resource_metrics = 1; }
message ExportLogsServiceRequest     { repeated ResourceLogs    resource_logs = 1; }

# Each ResourceSpans groups spans by Resource (service.name, k8s.pod.name, ...)
message ResourceSpans {
  Resource   resource    = 1;       // shared service-level attributes
  repeated ScopeSpans scope_spans = 2;  // grouped by instrumentation library
}

# gRPC service definition
service TraceService {
  rpc Export(ExportTraceServiceRequest) returns (ExportTraceServiceResponse);
}

# HTTP/protobuf
POST /v1/traces HTTP/1.1
Content-Type: application/x-protobuf


# HTTP/JSON (less efficient but easier for browser RUM)
POST /v1/traces HTTP/1.1
Content-Type: application/json
{"resourceSpans": [...]}`}

Resource Detection

Every signal carries a Resource — a set of attributes that describe the producer. service.name, service.version, k8s.pod.name, cloud.region, host.id. The SDK auto-detects these from the environment.

{`# Environment-based detection
OTEL_SERVICE_NAME=billing-api
OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,service.version=v1.42.0,deployment.environment=prod

# K8s resource detector populates automatically:
#   k8s.pod.name        from $POD_NAME
#   k8s.namespace.name  from $POD_NAMESPACE
#   k8s.node.name       from $NODE_NAME
#   k8s.container.name  from runtime API

# Cloud detectors hit instance metadata services:
#   cloud.provider     = aws | gcp | azure
#   cloud.region       = us-east-1
#   cloud.account.id
#   host.id            = i-0abcdef...

# Resource attributes are global - applied to every span/metric/log from this process`}

Processors: Where the Pipeline Logic Lives

Processors transform telemetry between receive and export. They run in pipeline order and can drop, modify, enrich, batch, or sample.

batch

Almost always required. Aggregates incoming items into batches before export to reduce network round-trips. Default 8192 items or 1s timeout.

attribute

Insert, update, delete, or hash specific attributes. Useful for redacting PII (delete the user.email attribute) or normalizing values across services.

filter

Drop signals that match an OTTL expression. Reject health-check spans, drop metric series for blocklisted services, suppress noisy log levels.

tail_sampling

Trace-aware sampling. Holds spans by trace_id, decides keep/drop after the trace completes. Required for "keep all errors + 1% of healthy" policies.

resourcedetection

Adds resource attributes the SDK couldn't detect (e.g., when the SDK runs in a sidecar that lacks cloud metadata access). Run at the collector, applied to all incoming signals.

transform

Run OTTL programs on incoming data: rename attributes, compute new ones, restructure fields. The escape hatch when no other processor fits.

Collector Pipeline Config

The full mental model: receivers feed processors feed exporters, all stitched together by named pipelines.

{`# otelcol.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  prometheus:
    config:
      scrape_configs:
        - job_name: 'k8s-pods'
          kubernetes_sd_configs: [{ role: pod }]

processors:
  memory_limiter: { check_interval: 1s, limit_mib: 4000 }
  batch: { send_batch_size: 8192, timeout: 1s }
  attributes:
    actions:
      - { key: user.email, action: delete }    # PII redaction
      - { key: deployment.environment, value: prod, action: insert }
  tail_sampling:
    decision_wait: 30s
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow,   type: latency,     latency: { threshold_ms: 500 } }
      - { name: rest,   type: probabilistic, probabilistic: { sampling_percentage: 1 } }

exporters:
  otlphttp/tempo:    { endpoint: http://tempo:4318 }
  prometheusremotewrite: { endpoint: http://mimir:9090/api/v1/push }
  otlphttp/loki:     { endpoint: http://loki:3100/otlp }

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, attributes, tail_sampling, batch]
      exporters:  [otlphttp/tempo]
    metrics:
      receivers:  [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters:  [prometheusremotewrite]
    logs:
      receivers:  [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters:  [otlphttp/loki]`}

OpAMP: Fleet Management

OpAMP (Open Agent Management Protocol) is the answer to "how do I manage 10,000 collectors without SSH?". A central control plane pushes config, certificates, and binary updates to every agent. Each agent reports back its health, version, and currently active config hash.

{`# Collector OpAMP extension (subset)
extensions:
  opamp:
    server:
      ws: { endpoint: wss://opamp.example.com/v1/opamp }
    capabilities:
      reports_effective_config:    true
      accepts_remote_config:       true
      reports_status:              true
      accepts_packages:            true       # binary updates
      reports_package_statuses:    true

# Server pushes a new config; collector validates and reloads
# Server pushes a new collector binary (signed); collector swaps and restarts
# Collector reports back its actual running config hash for audit

# Vendor support:
#   - BindPlane (Observiq) - OpAMP server with config UI
#   - Grafana Agent Cloud
#   - Various commercial SaaS`}

Semantic Conventions

The semconv spec standardizes attribute names and values. http.method, not method. db.system=postgresql, not db_kind. Backends rely on these names for built-in dashboards.

Domain	Examples
HTTP	`http.request.method`, `http.response.status_code`, `url.path`
Database	`db.system`, `db.statement`, `db.operation`
Messaging	`messaging.system`, `messaging.destination.name`
RPC	`rpc.system=grpc`, `rpc.service`, `rpc.method`
Resource	`service.name`, `k8s.pod.name`, `cloud.region`

Semconv has had breaking changes (http.method → http.request.method). Check the maturity tier (Stable / Experimental / Deprecated) before relying on a name.

Tradeoffs & Gotchas

Vendor neutrality is real

Switching from Datadog to Honeycomb is genuinely a config change. The flip side: vendor-specific features (Datadog Watchdog, Honeycomb BubbleUp) require some vendor-specific instrumentation.

Semantic conventions still churn

The spec evolves. Don't pin to attribute names without checking maturity tier. Plan for migration windows when stable conventions change.

Collector is mandatory at scale

SDKs export directly to vendors fine for small deployments, but at scale you want the collector for batching, tail sampling, redaction, and vendor multiplexing.

Profiles signal is in-flight

Logs and metrics joined traces in 2023. Profiles signal is in development — the protocol exists but SDK and backend support is uneven. Watch the spec.

FAQ

Why not just use vendor SDKs directly?

Lock-in. Once 200 services have Datadog tracer hard-coded, switching costs a year of engineering. OTel is a one-time investment that buys future flexibility for any backend.

How is the Collector different from Grafana Agent or Vector?

Functionally similar — all three are telemetry pipelines. OTel Collector is vendor-neutral (CNCF). Grafana Agent is OTel-Collector-derived but tuned for Grafana stack. Vector is Datadog's logs-first daemon. They overlap heavily; OTel Collector has the broadest format support.

Should I run the collector as a sidecar or daemonset?

Daemonset (one per node) for most cases — simpler ops, lower resource overhead. Sidecar for tight isolation or per-app config. Many large deployments add a second tier of "gateway" collectors that sidecars/daemonsets fan into.

What's auto-instrumentation vs manual?

Auto attaches at runtime to known frameworks (HTTP servers, DB clients) and creates spans without code changes. Manual is when you call tracer.start_span() yourself for business logic. Use both: auto for plumbing, manual for domain.

How big is the SDK overhead?

~5MB binary size and ~2-5% CPU at typical sampling rates. The Java agent attaches via instrumentation API (no code changes); Go SDK is compile-time. eBPF-based auto-instrumentation (in development) has near-zero in-process overhead.

What about logs — does OTel replace Loki/Elasticsearch?

No. OTel is the pipeline; Loki/Elasticsearch is the backend. OTel's logs SDK and OTLP/logs receiver get your structured logs from app to backend with trace_id correlation. The query side is still Loki, Elastic, or whatever.