Prometheus Internals

Prometheus is the de-facto standard time-series database for cloud-native infrastructure. Its core insight is the inversion of the old monitoring model: instead of agents pushing metrics to a central collector, Prometheus pulls — it discovers targets via service discovery, scrapes them on a schedule, and stores everything in a local, single-binary TSDB. The trade-offs in that one decision propagate through every layer of the system, from the on-disk chunk format to the way you write PromQL alerts.

Originally built at SoundCloud in 2012, modeled after Google's Borgmon, donated to the CNCF in 2016. Now powers monitoring at every major Kubernetes shop on the planet.

Prometheus Architecture Overview

node_exporter :9100 app /metrics :8080 kube-state-metrics cAdvisor :4194 blackbox_exporter Targets (HTTP /metrics) PROMETHEUS Service Discovery (k8s/Consul/file_sd) Scrape Manager (every 15-60s) Append: WAL → Head Chunks TSDB blocks (2h compaction) PromQL engine + rule manager Local disk: data/ Grafana PromQL queries Alertmanager dedup + routing Remote Write Thanos / Mimir / Cortex Federation /federate scrape PagerDuty / Slack PULL: Prometheus scrapes targets · PUSH: Alertmanager fans out incidents · STREAM: remote_write to long-term store

Key Numbers

Default Scrape Interval
15s
Sample size on disk
~1.3 bytes
Block duration
2h initial
Default retention
15 days
Single-instance ceiling
~10M active series
WAL truncation
3h
Compaction levels
2h → 6h → 18h → 54h

Why Prometheus Exists

Push Was Failing
StatsD-style push monitoring made every application responsible for knowing where the metrics server lived, what the schema was, and what happened when the server was overwhelmed. Auto-scaling fleets made target lists ephemeral. The push agent often dropped data silently.
Pull Inverts Ownership
In a pull model, monitoring owns the truth about what is monitored. Service discovery feeds Prometheus the current target list; Prometheus decides scrape cadence; targets just expose /metrics. A target that can't be scraped shows up as up == 0 — which is itself a metric you can alert on.
Labels Replace Hierarchies
Graphite's dot-separated metric names couldn't express that a counter belonged to multiple dimensions (host, region, version, customer). Prometheus's label model gives every sample an arbitrary set of key=value pairs, and PromQL turns aggregation across labels into one-line expressions: sum by (region) (rate(http_requests_total[5m])).

The Label Model and the Cardinality Cliff

Every Prometheus sample is identified by a metric name plus a set of label key=value pairs. The pair (http_requests_total, {method="GET", handler="/api/orders", status="200"}) is a distinct time series; change any label value and you get a different series. The TSDB indexes by series ID, not by row, so the unit of work is a series — and the unit of cost is the count of distinct label combinations, called cardinality.

Cardinality is the load-bearing concept of Prometheus operations. The TSDB head holds one in-memory chunk per active series. At ~1 KiB per chunk plus index entries, a million active series cost roughly 2-4 GiB of RAM. Ten million is the practical ceiling for a single instance. Past that you either shard, federate, or move to a horizontally-scaled remote-write target like Mimir or VictoriaMetrics.

The cardinality cliff is rarely caused by adding a new metric — it's caused by adding a label with high distinct values: a customer ID, a request URL with embedded IDs, a Kubernetes pod name (which churns every deploy), an email address. The classic footgun is something like http_requests_total{url="/orders/12345"}: every order ID becomes a new series, never garbage-collected for the WAL retention window. Within a week the TSDB head explodes.

Operational hygiene: enable --query.max-series, watch prometheus_tsdb_head_series, alert on scrape_samples_post_metric_relabeling ratios, drop high-cardinality labels at scrape time with metric_relabel_configs, and bucket continuous values (latency in histogram buckets, not raw microseconds).

The TSDB On-Disk Format

A Prometheus data directory contains:

data/
├── 01HQXZ.../              # immutable block (2h or longer)
│   ├── chunks/
│   │   ├── 000001          # 512 MiB max, append-only
│   │   └── 000002
│   ├── index               # postings + label index + series table
│   ├── meta.json           # block ID, time range, stats
│   └── tombstones          # logical deletes (rare)
├── 01HQ.../                # next block
├── chunks_head/            # mmap'd head chunks (active series)
└── wal/                    # write-ahead log
    ├── 00000123
    ├── 00000124
    └── checkpoint.000122/

A block is the unit of immutable storage. Each block covers a fixed time range — initially 2 hours, then compacted into 6h, 18h, and finally 54h blocks. Inside a block, chunks hold the actual samples for one series. Each chunk is XOR-encoded (Gorilla-style compression: store the XOR of consecutive timestamps and values, then the leading/trailing zero counts), achieving ~1.3 bytes per sample on typical telemetry. A naive (timestamp, value) pair would cost 16 bytes; XOR encoding is a 10x improvement and drives the storage economics of the whole system.

The index is more interesting. It contains:

A query like {job="api", region="us-east"} intersects two postings lists by ID. Multi-label queries are roughly free as long as the label_name is indexed — they reduce to sorted-set intersection.

The WAL is the durability layer. Every appended sample, every series creation, and every checkpoint operation is written to the WAL before being acknowledged to the scraper. On startup, Prometheus replays the WAL into the head, reconstructing in-memory state. WAL segments are 128 MiB each; checkpointing collapses old segments that have been fully flushed into a block.

Compaction: From Head to 54-Hour Blocks

The head is the in-memory representation of the most recent ~3 hours of data. Once a 2-hour window closes, the head's chunks for that window are flushed to a new persistent block. The compactor then runs in the background:

Compaction merges chunks per-series, removes tombstoned samples, rebuilds the index with the union of label sets, and rewrites the symbol table. The result is fewer, larger, better-compressed blocks — and dramatically cheaper queries, because a 30-day query no longer touches 360 small blocks but ~14 large ones.

Compaction is I/O-heavy and CPU-heavy. Production deployments size disks for ~3x peak block size (write amplification during merge) and watch prometheus_tsdb_compactions_total and prometheus_tsdb_compaction_duration_seconds for drift.

PromQL Execution: Instant Vectors, Range Vectors, rate()

PromQL has two core type concepts — the instant vector (one sample per series at one timestamp) and the range vector (a window of samples per series). You cannot graph a range vector directly; you must collapse it with a function like rate(), increase(), or avg_over_time().

The single most-used function is rate(metric[5m]). It computes the per-second slope of a counter over the last 5 minutes, accounting for counter resets (a counter that goes backward is treated as a reset to zero, not a negative slope). Internally, rate() uses extrapolation: it computes (last - first) / (range - first_offset - last_offset) and extrapolates to the full window if the first/last samples don't quite cover it. This makes rate() robust to scrape jitter but means rate(x[1m]) on a 15s scrape is much noisier than rate(x[5m]).

The cousin irate() uses only the last two samples, giving a much more reactive (and noisier) signal. Use irate for short-term graphs, rate for alerting and dashboards.

Histograms get their own machinery. A histogram exposes a set of bucketed counters (http_request_duration_seconds_bucket{le="0.1"}, le="0.5", etc.), and histogram_quantile(0.99, sum by (le) (rate(...[5m]))) linearly interpolates within the appropriate bucket to estimate the 99th percentile latency. Prometheus 2.40+ adds native histograms, which store an exponential bucket schema in a single sample, replacing the multi-series classical histogram with ~10x less storage and far better quantile fidelity.

Recording Rules and Alerting Rules

Querying histogram_quantile(0.99, sum by (route, le) (rate(http_request_duration_bucket[5m]))) every dashboard refresh is expensive. Recording rules evaluate that expression every evaluation_interval (default 15s) and write the result back to the TSDB as a new series — typically named job:http_p99:5m by convention. Dashboards then query the cheap pre-computed series, not the raw histogram.

Alerting rules have the same shape but trigger an alert when their expression is non-empty for at least for: 5m. The alert is sent to Alertmanager, which handles deduplication (multiple Prometheus replicas firing the same alert), grouping (collapse 100 pod alerts into one cluster alert), routing (which Slack channel, which PagerDuty service), inhibition (suppress downstream alerts when an upstream is firing), and silencing (a human-set time-bound mute).

groups:
- name: api_slo
  rules:
  - record: job:http_request_rate:5m
    expr: sum by (job, route) (rate(http_requests_total[5m]))
  - alert: HighErrorRate
    expr: |
      sum by (job) (rate(http_requests_total{'{'}status=~"5.."{'}'}[5m]))
        /
      sum by (job) (rate(http_requests_total[5m])) > 0.05
    for: 10m
    labels: {'{'}severity: page{'}'}
    annotations:
      summary: "{'{{'} $labels.job {'}}'} 5xx rate above 5%"

Service Discovery: Who Do I Scrape?

Prometheus does not maintain a static target list. Service discovery integrations watch external systems and emit a stream of "here is the current target set with these labels":

Every discovered target carries a set of "meta labels" (prefixed __meta_) describing the source. Relabel rules transform those into final scrape labels: extracting a Pod's app label, joining IP+port into the address, dropping targets entirely (action: drop), or hashmod-sharding across Prometheus replicas. Relabeling is itself a Turing-complete YAML language — any sufficiently complex Prometheus deploy spends 30% of its config on relabel rules.

Beyond a Single Instance: Federation, Remote Write, Thanos, Mimir

A single Prometheus is bounded by RAM (active series), disk (retention), and one machine's query throughput. Three patterns push past the boundary:

Federation — a "global" Prometheus scrapes the /federate endpoint of "leaf" Prometheis, pulling pre-aggregated metrics. Hierarchical federation groups data center → region → global. Cross-service federation aggregates a small shared set of SLI metrics across services. Federation works best for low-cardinality summaries; it is the wrong answer for full historical query.

Remote write — Prometheus streams every sample to an external endpoint via the remote_write protocol (Snappy-compressed protobuf over HTTPS). The receiver becomes the long-term store. The consumer side is a separate ecosystem:

Remote read is the symmetric API for query-time access to remote stores, but in practice most production stacks use Thanos/Mimir's native PromQL endpoint instead of remote_read because it can push down the query closer to the storage.

Prometheus vs Alternatives

PrometheusInfluxDB 2.xOpenTelemetry CollectorDatadog
Collection modelPull (scrape)Push (line protocol)Push or pull (configurable)Push (agent)
StorageLocal TSDB; remote_write for HATSI + TSM filesPass-through; needs backendProprietary cloud
Query languagePromQLFlux (functional) or InfluxQLNone (forwarder)Datadog query DSL
Cardinality modelLabel-set, single seriesTag-set, single seriesOTLP attributesTag-set
Native histogramsYes (since 2.40)No (separate measurements)OTLP exponential histogramsYes (distributions)
HA modelTwo replicas + Alertmanager dedupCluster (commercial)Stateless replicasSaaS
Best atCloud-native infra metrics, alertingIoT and event data; SQL-like queriesVendor-neutral pipelineTurn-key managed monitoring

Tradeoffs and Honest Weaknesses

Frequently Asked Questions

How does Prometheus achieve ~1.3 bytes per sample on disk?
Two compressions stack. The Gorilla XOR encoding stores the XOR of consecutive (timestamp, value) pairs as a delta-of-delta on timestamps and an XOR'd float on values. For typical monotonically-increasing metrics (counters) and slowly-changing gauges, the XOR has lots of leading and trailing zeros, which a variable-length encoding crushes into a few bits. Then the chunk file is mmap'd and compressed at the filesystem level if you use ZFS or btrfs — but the on-disk wire format itself is already that dense.
Why does my recording rule produce no output even though the underlying query works?
Recording rules evaluate at evaluation_interval boundaries. If your expression uses rate(...[1m]) and the scrape interval is 30s, you get exactly two samples in the range — borderline enough to fail the "need at least two points" rule. Use a wider window ([5m] is the safe default) and ensure evaluation_intervalscrape_interval.
What does up == 0 mean and why is it the most important alert?
up is a synthetic gauge Prometheus writes itself for every scrape: 1 if the scrape succeeded, 0 otherwise. up == 0 means a target was supposed to be scraped but couldn't be reached (TCP connection failure, HTTP 5xx, scrape timeout). It is the canary metric — alert on this before anything else, because if scraping is broken your other alerts are silently quiet, not actually fine.
What's the difference between rate() and increase()?
rate(x[5m]) returns per-second slope. increase(x[5m]) returns the absolute count, equivalent to rate(x[5m]) * 300. Both handle counter resets the same way. increase is friendlier for "how many requests in the last 5 minutes" dashboards; rate is the canonical form for alert thresholds and aggregation across windows.
How do native histograms compare to classical histograms?
A classical histogram exposes one counter per bucket boundary (_bucket{le="0.1"}, le="0.5", ...). For a 30-bucket histogram with 5 label combinations that's 150 series. A native histogram packs the entire distribution into a single sample using exponential bucket schemas — typically 1 series total, with sub-percent quantile error. They were introduced in Prometheus 2.40, are wire-format compatible with OTLP exponential histograms, and are the recommended path for new deployments.
Why do I sometimes see two replicas of Prometheus disagree slightly?
Each replica scrapes the same targets independently, but on slightly offset schedules. Counter values converge but per-second rates over short windows can differ by a few percent. Alertmanager dedups by alert fingerprint (the label set) so duplicate fires collapse, but graphs from each replica will not be byte-identical. This is fundamental to a pull model with no consensus protocol — and it's an acceptable trade-off because the alternative (a clustered storage) trades operational simplicity for a class of split-brain bugs.
When should I shard Prometheus vs adopt Thanos/Mimir?
If you are bounded by series count on one host, shard first: run two Prometheis with hashmod relabel rules splitting targets, and have Grafana query both. Cheaper and simpler than a long-term-store deploy. Adopt Thanos/Mimir when you also need (a) cross-shard queries, (b) longer retention than you can fit on local disk, (c) downsampling for multi-month dashboards, or (d) a global rule evaluator. The split is roughly: shard for scale, Thanos/Mimir for unification.