Design a Metrics System — build your own Prometheus

Architecture

Capacity Estimation

Metric	Value	Notes
Active series	~50 M	medium-large org
Samples/s ingested	~5 M	15 s scrape interval
Bytes/sample on disk	~1.3 B	Gorilla / XOR
Daily ingest	~50 GB compressed	500 GB raw
Long-term retention	13 mo	downsampled
Alert rules	~5 K	15–60 s evaluation
Query p99	~1 s	typical dashboard panel

Ingestion: Push vs Pull

Pull (Prometheus) scrapes /metrics on a schedule. Service discovery (Consul, Kubernetes) feeds the target list. Pros: target up/down is implicit (failed scrape = down); back-pressure flows naturally. Cons: targets must be reachable from ingester; short-lived jobs need a Pushgateway.

Push (StatsD, OTLP) is sender-driven. Pros: serverless / batch jobs work natively; targets behind NAT need no inbound port. Cons: target reliability has to be inferred from absence of data; back-pressure is awkward.

Pick pull for long-lived service fleets; push for ephemeral and edge. Most production systems support both: a Prometheus-compatible scrape API plus an OTLP push endpoint.

On-disk TSDB Format

The Prometheus TSDB design (Fabian Reinartz):

WAL — new samples append to a write-ahead log; flushed to disk every batch. Replayed on restart.
Head block — in-memory representation of the current 2-hour window. All recent reads served from here.
Persistent blocks — immutable directories named by their time range, each containing chunks (compressed sample arrays), index (postings list), and meta.json.
Compactor — periodically merges small blocks into larger ones (2h → 6h → 24h), saving file-system overhead.

Compression: Gorilla's XOR encoding for floats (typical metric stays close to neighbors; XOR gives mostly-zero deltas). Timestamps: delta-of-delta encoding (most timestamps are exactly the scrape interval apart; the second derivative is usually zero). Combined: ~1.3 bytes per sample, 100× better than raw float64.

Query Engine: PromQL

PromQL operates on instant vectors (one value per series at one timestamp) and range vectors (a window of values per series). Operators: arithmetic, aggregation (sum, avg, by), rate() over counters, histogram_quantile() on histograms.

The execution model: parse the expression, resolve label matchers via the index (postings list intersection), fetch sample chunks, compute. Hot path optimizations:

Index caching — postings lists are the hot read; cache aggressively in RAM.
Subquery rewriting — rate(...)[5m] at 1m step is much cheaper than evaluated sample-by-sample.
Step alignment — queries align to step boundaries for cache reuse across dashboard panels.

Retention and Downsampling

Storage cost is linear in retention. Solve with tiers:

Raw 15 s samples, 14 d.
5 m rollup — min/max/avg/sum/count per 5 min, 90 d.
1 h rollup — same aggregates per hour, 13 mo.

Compactor builds rollups offline; query engine routes by query window: a panel showing "last 7 days" uses raw, "last quarter" uses 5 m, "last year" uses 1 h.

Downsampling is lossy by design. Document which aggregates exist; rate-of-change queries on hourly rollups are fine, percentile queries cannot be reconstructed from coarse aggregates (need t-digest or similar).

Alerting

Alerting is rule evaluation + notification routing.

Rule evaluator: runs PromQL expressions on a schedule (15 s default). Each result with a label set fires an alert. for: 5m requires the condition to hold for 5 minutes — reduces flap.
Alertmanager: deduplicates (one open alert from many rule firings), groups (one notification per service), routes (DB team gets DB alerts), silences (planned maintenance), inhibits (ignore symptom alerts when the cause alert is firing).

SLO-style alerts: alert when error budget burn rate is too high, multi-window (1h fast burn + 6h slow burn) to balance fast detection with low false-positive rate.

Federation and Horizontal Scale

One Prometheus per cluster works to ~5 M series. Past that:

Hierarchical federation — upper-tier scrapes summarized metrics from lower-tier. Cheap, lossy. Use only for aggregated rollups.
Remote write to Cortex / Mimir / VictoriaMetrics — ingester → distributor (hash by series) → ingester (in-memory) → S3 blocks (long-term). Querier federates across in-memory + S3.

Mimir / Cortex shard by series ID hash; each ingester owns ~3% of the series and replicates to two peers. On crash, the WAL is replayed by the new owner.

Cardinality Limits

The single most common failure: someone adds a high-cardinality label, series count explodes, ingester OOMs. Defenses:

Per-tenant series limits — reject new series past a cap; alert on approach.
Label allow-list / block-list — drop labels matching unsafe patterns at ingest.
Cardinality alerts — alert on rate of new series per scrape; new bad labels are caught before OOM.

Conventions matter: never use unique IDs (user_id, request_id) as labels. Those go in traces and logs.

Failure Modes

Cardinality OOM (above).
Lost samples on crash — raw Prometheus loses 1–2 minutes WAL on hard crash. Remote-write to a durable store handles this.
Scrape duration overrun — target's /metrics takes 30 s to render; scrape times out. Alert on scrape_duration_seconds > threshold.
Alert flap — threshold near steady state; pages every minute. Hysteresis (different open/close thresholds) or long for: window.

FAQ

Why not use a generic time-series DB like InfluxDB?

You can. Prometheus's integrated alerting + service discovery + PromQL is a tight ecosystem; Influx is more general but the Prometheus stack is the de-facto standard for K8s observability.

How do you handle high-cardinality use cases (per-user metrics)?

You don't. Per-user metrics belong in an analytics database (ClickHouse, Druid). Metrics are for fleet-level signals.

Bedrock semantics — what is a counter vs gauge vs histogram?

Counter: monotonic; you query rate(). Gauge: instantaneous value. Histogram: pre-bucketed observations; supports histogram_quantile(). Summary: client-side quantile; cheap but cannot be aggregated across instances.

How do you handle multi-region?

Per-region Prometheus / ingester; remote-write all to a global Cortex/Mimir for cross-region queries. The local store gives 2-hour resilience to global outages.