Prometheus Internals
Prometheus is the de-facto standard time-series database for cloud-native infrastructure. Its core insight is the inversion of the old monitoring model: instead of agents pushing metrics to a central collector, Prometheus pulls — it discovers targets via service discovery, scrapes them on a schedule, and stores everything in a local, single-binary TSDB. The trade-offs in that one decision propagate through every layer of the system, from the on-disk chunk format to the way you write PromQL alerts.
Originally built at SoundCloud in 2012, modeled after Google's Borgmon, donated to the CNCF in 2016. Now powers monitoring at every major Kubernetes shop on the planet.
Prometheus Architecture Overview
Key Numbers
Why Prometheus Exists
/metrics. A target that can't be scraped shows up as up == 0 — which is itself a metric you can alert on.sum by (region) (rate(http_requests_total[5m])).The Label Model and the Cardinality Cliff
Every Prometheus sample is identified by a metric name plus a set of label key=value
pairs. The pair (http_requests_total, {method="GET",
handler="/api/orders", status="200"}) is a distinct time series;
change any label value and you get a different series. The TSDB indexes by series ID, not
by row, so the unit of work is a series — and the unit of cost is the count of distinct
label combinations, called cardinality.
Cardinality is the load-bearing concept of Prometheus operations. The TSDB head holds one in-memory chunk per active series. At ~1 KiB per chunk plus index entries, a million active series cost roughly 2-4 GiB of RAM. Ten million is the practical ceiling for a single instance. Past that you either shard, federate, or move to a horizontally-scaled remote-write target like Mimir or VictoriaMetrics.
The cardinality cliff is rarely caused by adding a new metric — it's caused by adding a
label with high distinct values: a customer ID, a request URL with embedded
IDs, a Kubernetes pod name (which churns every deploy), an email address. The classic
footgun is something like http_requests_total{url="/orders/12345"}:
every order ID becomes a new series, never garbage-collected for the WAL retention
window. Within a week the TSDB head explodes.
Operational hygiene: enable --query.max-series, watch
prometheus_tsdb_head_series, alert on
scrape_samples_post_metric_relabeling ratios, drop high-cardinality labels
at scrape time with metric_relabel_configs, and bucket continuous values
(latency in histogram buckets, not raw microseconds).
The TSDB On-Disk Format
A Prometheus data directory contains:
data/
├── 01HQXZ.../ # immutable block (2h or longer)
│ ├── chunks/
│ │ ├── 000001 # 512 MiB max, append-only
│ │ └── 000002
│ ├── index # postings + label index + series table
│ ├── meta.json # block ID, time range, stats
│ └── tombstones # logical deletes (rare)
├── 01HQ.../ # next block
├── chunks_head/ # mmap'd head chunks (active series)
└── wal/ # write-ahead log
├── 00000123
├── 00000124
└── checkpoint.000122/ A block is the unit of immutable storage. Each block covers a fixed time range — initially 2 hours, then compacted into 6h, 18h, and finally 54h blocks. Inside a block, chunks hold the actual samples for one series. Each chunk is XOR-encoded (Gorilla-style compression: store the XOR of consecutive timestamps and values, then the leading/trailing zero counts), achieving ~1.3 bytes per sample on typical telemetry. A naive (timestamp, value) pair would cost 16 bytes; XOR encoding is a 10x improvement and drives the storage economics of the whole system.
The index is more interesting. It contains:
- Symbol table — every distinct string that appears in any label, deduplicated and sorted. Series records reference symbol IDs, not raw strings.
- Series records — one per series, listing label refs and chunk references for this block.
- Postings lists — for each (label_name, label_value) pair, a sorted list of series IDs that have it. Querying
{job="api"}just dereferences a postings list. - Postings index — a top-level index mapping (label_name, label_value) to a file offset where the postings list lives.
A query like {job="api", region="us-east"} intersects two postings
lists by ID. Multi-label queries are roughly free as long as the label_name is indexed
— they reduce to sorted-set intersection.
The WAL is the durability layer. Every appended sample, every series creation, and every checkpoint operation is written to the WAL before being acknowledged to the scraper. On startup, Prometheus replays the WAL into the head, reconstructing in-memory state. WAL segments are 128 MiB each; checkpointing collapses old segments that have been fully flushed into a block.
Compaction: From Head to 54-Hour Blocks
The head is the in-memory representation of the most recent ~3 hours of data. Once a 2-hour window closes, the head's chunks for that window are flushed to a new persistent block. The compactor then runs in the background:
- Three contiguous 2h blocks → one 6h block
- Three contiguous 6h blocks → one 18h block
- Three contiguous 18h blocks → one 54h block (the largest)
Compaction merges chunks per-series, removes tombstoned samples, rebuilds the index with the union of label sets, and rewrites the symbol table. The result is fewer, larger, better-compressed blocks — and dramatically cheaper queries, because a 30-day query no longer touches 360 small blocks but ~14 large ones.
Compaction is I/O-heavy and CPU-heavy. Production deployments size disks for ~3x peak
block size (write amplification during merge) and watch
prometheus_tsdb_compactions_total and
prometheus_tsdb_compaction_duration_seconds for drift.
PromQL Execution: Instant Vectors, Range Vectors, rate()
PromQL has two core type concepts — the instant vector (one sample per
series at one timestamp) and the range vector (a window of samples per
series). You cannot graph a range vector directly; you must collapse it with a function
like rate(), increase(), or avg_over_time().
The single most-used function is rate(metric[5m]). It computes the per-second
slope of a counter over the last 5 minutes, accounting for counter resets (a counter
that goes backward is treated as a reset to zero, not a negative slope). Internally,
rate() uses extrapolation: it computes
(last - first) / (range - first_offset - last_offset) and extrapolates to
the full window if the first/last samples don't quite cover it. This makes
rate() robust to scrape jitter but means rate(x[1m]) on a 15s
scrape is much noisier than rate(x[5m]).
The cousin irate() uses only the last two samples, giving a much more
reactive (and noisier) signal. Use irate for short-term graphs, rate
for alerting and dashboards.
Histograms get their own machinery. A histogram exposes a set of bucketed counters
(http_request_duration_seconds_bucket{le="0.1"}, le="0.5", etc.),
and histogram_quantile(0.99, sum by (le) (rate(...[5m]))) linearly
interpolates within the appropriate bucket to estimate the 99th percentile latency.
Prometheus 2.40+ adds native histograms, which store an exponential
bucket schema in a single sample, replacing the multi-series classical histogram with
~10x less storage and far better quantile fidelity.
Recording Rules and Alerting Rules
Querying histogram_quantile(0.99, sum by (route, le) (rate(http_request_duration_bucket[5m])))
every dashboard refresh is expensive. Recording rules evaluate that
expression every evaluation_interval (default 15s) and write the result back
to the TSDB as a new series — typically named job:http_p99:5m by convention.
Dashboards then query the cheap pre-computed series, not the raw histogram.
Alerting rules have the same shape but trigger an alert when their
expression is non-empty for at least for: 5m. The alert is sent to
Alertmanager, which handles deduplication (multiple Prometheus replicas firing the same
alert), grouping (collapse 100 pod alerts into one cluster alert), routing (which Slack
channel, which PagerDuty service), inhibition (suppress downstream alerts when an
upstream is firing), and silencing (a human-set time-bound mute).
groups:
- name: api_slo
rules:
- record: job:http_request_rate:5m
expr: sum by (job, route) (rate(http_requests_total[5m]))
- alert: HighErrorRate
expr: |
sum by (job) (rate(http_requests_total{'{'}status=~"5.."{'}'}[5m]))
/
sum by (job) (rate(http_requests_total[5m])) > 0.05
for: 10m
labels: {'{'}severity: page{'}'}
annotations:
summary: "{'{{'} $labels.job {'}}'} 5xx rate above 5%" Service Discovery: Who Do I Scrape?
Prometheus does not maintain a static target list. Service discovery integrations watch external systems and emit a stream of "here is the current target set with these labels":
- kubernetes_sd — watches the Kubernetes API for Pods, Services, Endpoints, Nodes, Ingresses. Annotations like
prometheus.io/scrape: "true"andprometheus.io/port: "9090"drive selection through relabel rules. - consul_sd — long-polls Consul's catalog API.
- file_sd — watches a directory of YAML/JSON files that an external system writes; the lowest-friction way to integrate a custom inventory.
- ec2_sd, azure_sd, gce_sd, digitalocean_sd — direct cloud provider integration.
- http_sd — Prometheus calls a configurable URL on an interval; the URL returns JSON with the target list. Good escape hatch for custom topologies.
Every discovered target carries a set of "meta labels" (prefixed __meta_)
describing the source. Relabel rules transform those into final scrape
labels: extracting a Pod's app label, joining IP+port into the address,
dropping targets entirely (action: drop), or hashmod-sharding across
Prometheus replicas. Relabeling is itself a Turing-complete YAML language — any
sufficiently complex Prometheus deploy spends 30% of its config on relabel rules.
Beyond a Single Instance: Federation, Remote Write, Thanos, Mimir
A single Prometheus is bounded by RAM (active series), disk (retention), and one machine's query throughput. Three patterns push past the boundary:
Federation — a "global" Prometheus scrapes the /federate
endpoint of "leaf" Prometheis, pulling pre-aggregated metrics. Hierarchical federation
groups data center → region → global. Cross-service federation aggregates a small
shared set of SLI metrics across services. Federation works best for low-cardinality
summaries; it is the wrong answer for full historical query.
Remote write — Prometheus streams every sample to an external endpoint
via the remote_write protocol (Snappy-compressed protobuf over HTTPS). The
receiver becomes the long-term store. The consumer side is a separate ecosystem:
- Thanos — sidecars upload Prometheus blocks to S3/GCS; a global Querier sees all blocks via the StoreAPI; Compactor downsamples to 5m and 1h resolution; Ruler runs cross-cluster recording rules.
- Cortex / Grafana Mimir — horizontally scalable receive path with separate ingester/querier/compactor microservices; uses ring-based sharding and consistent hashing for series ownership.
- VictoriaMetrics — different on-disk format optimized for high cardinality, supports both push and pull, generally faster on the same hardware than vanilla Prometheus but with a slightly different PromQL dialect (MetricsQL).
Remote read is the symmetric API for query-time access to remote stores, but in practice most production stacks use Thanos/Mimir's native PromQL endpoint instead of remote_read because it can push down the query closer to the storage.
Prometheus vs Alternatives
| Prometheus | InfluxDB 2.x | OpenTelemetry Collector | Datadog | |
|---|---|---|---|---|
| Collection model | Pull (scrape) | Push (line protocol) | Push or pull (configurable) | Push (agent) |
| Storage | Local TSDB; remote_write for HA | TSI + TSM files | Pass-through; needs backend | Proprietary cloud |
| Query language | PromQL | Flux (functional) or InfluxQL | None (forwarder) | Datadog query DSL |
| Cardinality model | Label-set, single series | Tag-set, single series | OTLP attributes | Tag-set |
| Native histograms | Yes (since 2.40) | No (separate measurements) | OTLP exponential histograms | Yes (distributions) |
| HA model | Two replicas + Alertmanager dedup | Cluster (commercial) | Stateless replicas | SaaS |
| Best at | Cloud-native infra metrics, alerting | IoT and event data; SQL-like queries | Vendor-neutral pipeline | Turn-key managed monitoring |
Tradeoffs and Honest Weaknesses
- Cardinality is fragile — a single careless label can blow up the head. There is no built-in cardinality limiter in vanilla Prometheus (Mimir adds one); the standard mitigation is metric_relabel_configs and operator vigilance.
- Single-node — a pair of Prometheis is "HA" only in the sense that Alertmanager dedups duplicate alerts and queries can fail over. There is no native clustering for the storage. For real horizontal scale you adopt Thanos, Mimir, or VictoriaMetrics.
- Pull doesn't fit short-lived jobs — a CronJob that runs for 30s may not even live long enough to be scraped. The Pushgateway exists as an explicit hack: a long-lived process that batch jobs push to, which Prometheus then scrapes. It complicates the mental model and can keep stale metrics around forever.
- No event/log/trace storage — Prometheus is metrics-only. Logs go to Loki, ELK, or a vendor; traces go to Tempo, Jaeger, or a vendor. A complete observability stack is at least two and usually three separate systems.
- 15-day default retention — long-term capacity planning generally requires remote_write to a long-term store. Local-only Prometheus is a real-time tool, not a historical archive.
- PromQL has sharp edges — counter resets, rate() extrapolation, instant-vs-range type errors, and the "many-to-many matching not allowed" error confuse newcomers. The error messages are getting better, but the semantics still require time to internalize.
Frequently Asked Questions
How does Prometheus achieve ~1.3 bytes per sample on disk?
Why does my recording rule produce no output even though the underlying query works?
evaluation_interval boundaries. If your expression uses rate(...[1m]) and the scrape interval is 30s, you get exactly two samples in the range — borderline enough to fail the "need at least two points" rule. Use a wider window ([5m] is the safe default) and ensure evaluation_interval ≥ scrape_interval.What does up == 0 mean and why is it the most important alert?
up is a synthetic gauge Prometheus writes itself for every scrape: 1 if the scrape succeeded, 0 otherwise. up == 0 means a target was supposed to be scraped but couldn't be reached (TCP connection failure, HTTP 5xx, scrape timeout). It is the canary metric — alert on this before anything else, because if scraping is broken your other alerts are silently quiet, not actually fine.What's the difference between rate() and increase()?
rate(x[5m]) returns per-second slope. increase(x[5m]) returns the absolute count, equivalent to rate(x[5m]) * 300. Both handle counter resets the same way. increase is friendlier for "how many requests in the last 5 minutes" dashboards; rate is the canonical form for alert thresholds and aggregation across windows.How do native histograms compare to classical histograms?
_bucket{le="0.1"}, le="0.5", ...). For a 30-bucket histogram with 5 label combinations that's 150 series. A native histogram packs the entire distribution into a single sample using exponential bucket schemas — typically 1 series total, with sub-percent quantile error. They were introduced in Prometheus 2.40, are wire-format compatible with OTLP exponential histograms, and are the recommended path for new deployments.