Prometheus Internals

Prometheus is the de-facto standard time-series database for cloud-native infrastructure. Its core insight is the inversion of the old monitoring model: instead of agents pushing metrics to a central collector, Prometheus pulls — it discovers targets via service discovery, scrapes them on a schedule, and stores everything in a local, single-binary TSDB. The trade-offs in that one decision propagate through every layer of the system, from the on-disk chunk format to the way you write PromQL alerts.

Originally built at SoundCloud in 2012, modeled after Google's Borgmon, donated to the CNCF in 2016. Now powers monitoring at every major Kubernetes shop on the planet.

Prometheus Architecture Overview

Key Numbers

Default Scrape Interval

15s

Sample size on disk

~1.3 bytes

Block duration

2h initial

Default retention

15 days

Single-instance ceiling

~10M active series

WAL truncation

Compaction levels

2h → 6h → 18h → 54h

Why Prometheus Exists

Push Was Failing

StatsD-style push monitoring made every application responsible for knowing where the metrics server lived, what the schema was, and what happened when the server was overwhelmed. Auto-scaling fleets made target lists ephemeral. The push agent often dropped data silently.

Pull Inverts Ownership

In a pull model, monitoring owns the truth about what is monitored. Service discovery feeds Prometheus the current target list; Prometheus decides scrape cadence; targets just expose /metrics. A target that can't be scraped shows up as up == 0 — which is itself a metric you can alert on.

Labels Replace Hierarchies

Graphite's dot-separated metric names couldn't express that a counter belonged to multiple dimensions (host, region, version, customer). Prometheus's label model gives every sample an arbitrary set of key=value pairs, and PromQL turns aggregation across labels into one-line expressions: sum by (region) (rate(http_requests_total[5m])).

The Label Model and the Cardinality Cliff

Every Prometheus sample is identified by a metric name plus a set of label key=value pairs. The pair (http_requests_total, {method="GET", handler="/api/orders", status="200"}) is a distinct time series; change any label value and you get a different series. The TSDB indexes by series ID, not by row, so the unit of work is a series — and the unit of cost is the count of distinct label combinations, called cardinality.

Cardinality is the load-bearing concept of Prometheus operations. The TSDB head holds one in-memory chunk per active series. At ~1 KiB per chunk plus index entries, a million active series cost roughly 2-4 GiB of RAM. Ten million is the practical ceiling for a single instance. Past that you either shard, federate, or move to a horizontally-scaled remote-write target like Mimir or VictoriaMetrics.

The cardinality cliff is rarely caused by adding a new metric — it's caused by adding a label with high distinct values: a customer ID, a request URL with embedded IDs, a Kubernetes pod name (which churns every deploy), an email address. The classic footgun is something like http_requests_total{url="/orders/12345"}: every order ID becomes a new series, never garbage-collected for the WAL retention window. Within a week the TSDB head explodes.

Operational hygiene: enable --query.max-series, watch prometheus_tsdb_head_series, alert on scrape_samples_post_metric_relabeling ratios, drop high-cardinality labels at scrape time with metric_relabel_configs, and bucket continuous values (latency in histogram buckets, not raw microseconds).

The TSDB On-Disk Format

A Prometheus data directory contains:

data/
├── 01HQXZ.../              # immutable block (2h or longer)
│   ├── chunks/
│   │   ├── 000001          # 512 MiB max, append-only
│   │   └── 000002
│   ├── index               # postings + label index + series table
│   ├── meta.json           # block ID, time range, stats
│   └── tombstones          # logical deletes (rare)
├── 01HQ.../                # next block
├── chunks_head/            # mmap'd head chunks (active series)
└── wal/                    # write-ahead log
    ├── 00000123
    ├── 00000124
    └── checkpoint.000122/

A block is the unit of immutable storage. Each block covers a fixed time range — initially 2 hours, then compacted into 6h, 18h, and finally 54h blocks. Inside a block, chunks hold the actual samples for one series. Each chunk is XOR-encoded (Gorilla-style compression: store the XOR of consecutive timestamps and values, then the leading/trailing zero counts), achieving ~1.3 bytes per sample on typical telemetry. A naive (timestamp, value) pair would cost 16 bytes; XOR encoding is a 10x improvement and drives the storage economics of the whole system.

The index is more interesting. It contains:

Symbol table — every distinct string that appears in any label, deduplicated and sorted. Series records reference symbol IDs, not raw strings.
Series records — one per series, listing label refs and chunk references for this block.
Postings lists — for each (label_name, label_value) pair, a sorted list of series IDs that have it. Querying {job="api"} just dereferences a postings list.
Postings index — a top-level index mapping (label_name, label_value) to a file offset where the postings list lives.

A query like {job="api", region="us-east"} intersects two postings lists by ID. Multi-label queries are roughly free as long as the label_name is indexed — they reduce to sorted-set intersection.

The WAL is the durability layer. Every appended sample, every series creation, and every checkpoint operation is written to the WAL before being acknowledged to the scraper. On startup, Prometheus replays the WAL into the head, reconstructing in-memory state. WAL segments are 128 MiB each; checkpointing collapses old segments that have been fully flushed into a block.

Compaction: From Head to 54-Hour Blocks

The head is the in-memory representation of the most recent ~3 hours of data. Once a 2-hour window closes, the head's chunks for that window are flushed to a new persistent block. The compactor then runs in the background:

Three contiguous 2h blocks → one 6h block
Three contiguous 6h blocks → one 18h block
Three contiguous 18h blocks → one 54h block (the largest)

Compaction merges chunks per-series, removes tombstoned samples, rebuilds the index with the union of label sets, and rewrites the symbol table. The result is fewer, larger, better-compressed blocks — and dramatically cheaper queries, because a 30-day query no longer touches 360 small blocks but ~14 large ones.

Compaction is I/O-heavy and CPU-heavy. Production deployments size disks for ~3x peak block size (write amplification during merge) and watch prometheus_tsdb_compactions_total and prometheus_tsdb_compaction_duration_seconds for drift.

PromQL Execution: Instant Vectors, Range Vectors, rate()

PromQL has two core type concepts — the instant vector (one sample per series at one timestamp) and the range vector (a window of samples per series). You cannot graph a range vector directly; you must collapse it with a function like rate(), increase(), or avg_over_time().

The single most-used function is rate(metric[5m]). It computes the per-second slope of a counter over the last 5 minutes, accounting for counter resets (a counter that goes backward is treated as a reset to zero, not a negative slope). Internally, rate() uses extrapolation: it computes (last - first) / (range - first_offset - last_offset) and extrapolates to the full window if the first/last samples don't quite cover it. This makes rate() robust to scrape jitter but means rate(x[1m]) on a 15s scrape is much noisier than rate(x[5m]).

The cousin irate() uses only the last two samples, giving a much more reactive (and noisier) signal. Use irate for short-term graphs, rate for alerting and dashboards.

Histograms get their own machinery. A histogram exposes a set of bucketed counters (http_request_duration_seconds_bucket{le="0.1"}, le="0.5", etc.), and histogram_quantile(0.99, sum by (le) (rate(...[5m]))) linearly interpolates within the appropriate bucket to estimate the 99th percentile latency. Prometheus 2.40+ adds native histograms, which store an exponential bucket schema in a single sample, replacing the multi-series classical histogram with ~10x less storage and far better quantile fidelity.

Recording Rules and Alerting Rules

Querying histogram_quantile(0.99, sum by (route, le) (rate(http_request_duration_bucket[5m]))) every dashboard refresh is expensive. Recording rules evaluate that expression every evaluation_interval (default 15s) and write the result back to the TSDB as a new series — typically named job:http_p99:5m by convention. Dashboards then query the cheap pre-computed series, not the raw histogram.

Alerting rules have the same shape but trigger an alert when their expression is non-empty for at least for: 5m. The alert is sent to Alertmanager, which handles deduplication (multiple Prometheus replicas firing the same alert), grouping (collapse 100 pod alerts into one cluster alert), routing (which Slack channel, which PagerDuty service), inhibition (suppress downstream alerts when an upstream is firing), and silencing (a human-set time-bound mute).

groups:
- name: api_slo
  rules:
  - record: job:http_request_rate:5m
    expr: sum by (job, route) (rate(http_requests_total[5m]))
  - alert: HighErrorRate
    expr: |
      sum by (job) (rate(http_requests_total{'{'}status=~"5.."{'}'}[5m]))
        /
      sum by (job) (rate(http_requests_total[5m])) > 0.05
    for: 10m
    labels: {'{'}severity: page{'}'}
    annotations:
      summary: "{'{{'} $labels.job {'}}'} 5xx rate above 5%"

Service Discovery: Who Do I Scrape?

Prometheus does not maintain a static target list. Service discovery integrations watch external systems and emit a stream of "here is the current target set with these labels":

kubernetes_sd — watches the Kubernetes API for Pods, Services, Endpoints, Nodes, Ingresses. Annotations like prometheus.io/scrape: "true" and prometheus.io/port: "9090" drive selection through relabel rules.
consul_sd — long-polls Consul's catalog API.
file_sd — watches a directory of YAML/JSON files that an external system writes; the lowest-friction way to integrate a custom inventory.
ec2_sd, azure_sd, gce_sd, digitalocean_sd — direct cloud provider integration.
http_sd — Prometheus calls a configurable URL on an interval; the URL returns JSON with the target list. Good escape hatch for custom topologies.

Every discovered target carries a set of "meta labels" (prefixed __meta_) describing the source. Relabel rules transform those into final scrape labels: extracting a Pod's app label, joining IP+port into the address, dropping targets entirely (action: drop), or hashmod-sharding across Prometheus replicas. Relabeling is itself a Turing-complete YAML language — any sufficiently complex Prometheus deploy spends 30% of its config on relabel rules.

Beyond a Single Instance: Federation, Remote Write, Thanos, Mimir

A single Prometheus is bounded by RAM (active series), disk (retention), and one machine's query throughput. Three patterns push past the boundary:

Federation — a "global" Prometheus scrapes the /federate endpoint of "leaf" Prometheis, pulling pre-aggregated metrics. Hierarchical federation groups data center → region → global. Cross-service federation aggregates a small shared set of SLI metrics across services. Federation works best for low-cardinality summaries; it is the wrong answer for full historical query.

Remote write — Prometheus streams every sample to an external endpoint via the remote_write protocol (Snappy-compressed protobuf over HTTPS). The receiver becomes the long-term store. The consumer side is a separate ecosystem:

Thanos — sidecars upload Prometheus blocks to S3/GCS; a global Querier sees all blocks via the StoreAPI; Compactor downsamples to 5m and 1h resolution; Ruler runs cross-cluster recording rules.
Cortex / Grafana Mimir — horizontally scalable receive path with separate ingester/querier/compactor microservices; uses ring-based sharding and consistent hashing for series ownership.
VictoriaMetrics — different on-disk format optimized for high cardinality, supports both push and pull, generally faster on the same hardware than vanilla Prometheus but with a slightly different PromQL dialect (MetricsQL).

Remote read is the symmetric API for query-time access to remote stores, but in practice most production stacks use Thanos/Mimir's native PromQL endpoint instead of remote_read because it can push down the query closer to the storage.

Prometheus vs Alternatives

	Prometheus	InfluxDB 2.x	OpenTelemetry Collector	Datadog
Collection model	Pull (scrape)	Push (line protocol)	Push or pull (configurable)	Push (agent)
Storage	Local TSDB; remote_write for HA	TSI + TSM files	Pass-through; needs backend	Proprietary cloud
Query language	PromQL	Flux (functional) or InfluxQL	None (forwarder)	Datadog query DSL
Cardinality model	Label-set, single series	Tag-set, single series	OTLP attributes	Tag-set
Native histograms	Yes (since 2.40)	No (separate measurements)	OTLP exponential histograms	Yes (distributions)
HA model	Two replicas + Alertmanager dedup	Cluster (commercial)	Stateless replicas	SaaS
Best at	Cloud-native infra metrics, alerting	IoT and event data; SQL-like queries	Vendor-neutral pipeline	Turn-key managed monitoring

Tradeoffs and Honest Weaknesses

Cardinality is fragile — a single careless label can blow up the head. There is no built-in cardinality limiter in vanilla Prometheus (Mimir adds one); the standard mitigation is metric_relabel_configs and operator vigilance.
Single-node — a pair of Prometheis is "HA" only in the sense that Alertmanager dedups duplicate alerts and queries can fail over. There is no native clustering for the storage. For real horizontal scale you adopt Thanos, Mimir, or VictoriaMetrics.
Pull doesn't fit short-lived jobs — a CronJob that runs for 30s may not even live long enough to be scraped. The Pushgateway exists as an explicit hack: a long-lived process that batch jobs push to, which Prometheus then scrapes. It complicates the mental model and can keep stale metrics around forever.
No event/log/trace storage — Prometheus is metrics-only. Logs go to Loki, ELK, or a vendor; traces go to Tempo, Jaeger, or a vendor. A complete observability stack is at least two and usually three separate systems.
15-day default retention — long-term capacity planning generally requires remote_write to a long-term store. Local-only Prometheus is a real-time tool, not a historical archive.
PromQL has sharp edges — counter resets, rate() extrapolation, instant-vs-range type errors, and the "many-to-many matching not allowed" error confuse newcomers. The error messages are getting better, but the semantics still require time to internalize.

Frequently Asked Questions

How does Prometheus achieve ~1.3 bytes per sample on disk?

Two compressions stack. The Gorilla XOR encoding stores the XOR of consecutive (timestamp, value) pairs as a delta-of-delta on timestamps and an XOR'd float on values. For typical monotonically-increasing metrics (counters) and slowly-changing gauges, the XOR has lots of leading and trailing zeros, which a variable-length encoding crushes into a few bits. Then the chunk file is mmap'd and compressed at the filesystem level if you use ZFS or btrfs — but the on-disk wire format itself is already that dense.

Why does my recording rule produce no output even though the underlying query works?

Recording rules evaluate at evaluation_interval boundaries. If your expression uses rate(...[1m]) and the scrape interval is 30s, you get exactly two samples in the range — borderline enough to fail the "need at least two points" rule. Use a wider window ([5m] is the safe default) and ensure evaluation_interval ≥ scrape_interval.

What does up == 0 mean and why is it the most important alert?

up is a synthetic gauge Prometheus writes itself for every scrape: 1 if the scrape succeeded, 0 otherwise. up == 0 means a target was supposed to be scraped but couldn't be reached (TCP connection failure, HTTP 5xx, scrape timeout). It is the canary metric — alert on this before anything else, because if scraping is broken your other alerts are silently quiet, not actually fine.

What's the difference between rate() and increase()?

rate(x[5m]) returns per-second slope. increase(x[5m]) returns the absolute count, equivalent to rate(x[5m]) * 300. Both handle counter resets the same way. increase is friendlier for "how many requests in the last 5 minutes" dashboards; rate is the canonical form for alert thresholds and aggregation across windows.

How do native histograms compare to classical histograms?

A classical histogram exposes one counter per bucket boundary (_bucket{le="0.1"}, le="0.5", ...). For a 30-bucket histogram with 5 label combinations that's 150 series. A native histogram packs the entire distribution into a single sample using exponential bucket schemas — typically 1 series total, with sub-percent quantile error. They were introduced in Prometheus 2.40, are wire-format compatible with OTLP exponential histograms, and are the recommended path for new deployments.

Why do I sometimes see two replicas of Prometheus disagree slightly?

Each replica scrapes the same targets independently, but on slightly offset schedules. Counter values converge but per-second rates over short windows can differ by a few percent. Alertmanager dedups by alert fingerprint (the label set) so duplicate fires collapse, but graphs from each replica will not be byte-identical. This is fundamental to a pull model with no consensus protocol — and it's an acceptable trade-off because the alternative (a clustered storage) trades operational simplicity for a class of split-brain bugs.

When should I shard Prometheus vs adopt Thanos/Mimir?

If you are bounded by series count on one host, shard first: run two Prometheis with hashmod relabel rules splitting targets, and have Grafana query both. Cheaper and simpler than a long-term-store deploy. Adopt Thanos/Mimir when you also need (a) cross-shard queries, (b) longer retention than you can fit on local disk, (c) downsampling for multi-month dashboards, or (d) a global rule evaluator. The split is roughly: shard for scale, Thanos/Mimir for unification.