Architecture

App /metrics endpoint Prometheus scrape, TSDB, rule eval Remote write Cortex / Mimir / VictoriaMetrics Object store S3 blocks (TSDB) 2 h chunks Alertmanager dedup, group, route, silence Query engine PromQL, downsampled rollups Grafana dashboards PagerDuty / Slack / OpsGenie

Capacity Estimation

MetricValueNotes
Active series~50 Mmedium-large org
Samples/s~5 M15 s scrape interval
Bytes/sample on disk~1.3 BGorilla compression
Daily ingestion~500 GB raw~50 GB compressed
Retention (raw)15 d local, 13 mo S3downsampled
Alert rules~5 Kevaluated every 15–60 s
Prometheus RAM~32 GBWAL + active series

Push vs Pull

  • Pull (Prometheus, Datadog Agent in pull mode): the metrics platform discovers targets and HTTP-GETs /metrics on a schedule. Pros: target list is the source of truth (failed scrape = target down, automatic alerting); back-pressure flows naturally; targets can be hidden behind firewalls only the platform crosses.
  • Push (StatsD, Graphite, OpenTelemetry OTLP in push mode): targets send to a collector. Pros: works for short-lived jobs (cron, batch) and serverless; no service discovery needed.

Pull is the default for long-lived processes; push for short-lived. Prometheus offers Pushgateway for short-lived jobs only — explicitly not as a general push interface.

Prometheus Storage Model

The TSDB stores time series as (label set, timestamp, value). Internally:

  • WAL — write-ahead log for recent samples; replayed on restart.
  • 2-hour blocks — immutable on-disk chunks of compressed samples (Gorilla / XOR encoding compresses ~1.3 bytes/sample).
  • Index — postings list mapping label pairs to series IDs; the query engine resolves http_requests_total{method="GET"} by intersecting postings.

Memory dominates: each active series consumes ~3 KB in RAM. 10 M series → 30 GB RAM minimum. Alerting rules and recording rules add their own overhead.

Federation, Cortex, Mimir, VictoriaMetrics

One Prometheus per cluster works to ~5 M series. Past that:

  • Federation — an upper-tier Prometheus scrapes summarized metrics from lower-tier Prometheuses. Cheap, but lossy; only suited for hierarchical aggregates.
  • Cortex / Mimir — horizontally scalable Prometheus-compatible long-term storage. Distributors fan out to ingesters; ingesters flush 2-hour blocks to S3; queriers federate across in-memory and S3. Mimir is Grafana Labs' fork; Cortex is the original CNCF project. Either gives multi-PB retention with PromQL on top.
  • VictoriaMetrics — alternative implementation with a smaller operational footprint and reportedly better compression. Single-binary setup beats Cortex's 6-component stack for small-to-medium teams.
  • Thanos — sidecar-based federation; older approach, still common.

Alerting: Rules + Alertmanager

Rules are PromQL expressions evaluated periodically: up{job="api"} == 0 for 5 m. Each rule yields one or more alerts per matching series. Best practices:

  • Use the four golden signals — latency, traffic, errors, saturation per service.
  • Burn-rate alerts on SLOs — alert when error rate is consuming the error budget faster than allowed; multi-window (1h short fast + 6h long fast) reduces false positives.
  • Avoid noisy alerts — threshold > 5 minutes, alert on user-impacting signals not on causes.

Alertmanager handles routing: dedup (one open incident, not 1000 alerts), group (one notification per service per 5 min), route (DB team gets DB alerts), silence (planned maintenance window). A naive integration without Alertmanager pages 50 humans on every regional outage.

The Cardinality Cliff

The most common Prometheus disaster: someone adds a label like user_id or request_id; series count explodes from 10 K to 10 M; RAM doubles every hour; OOM at 02:00. Cardinality is the product of unique values across labels:

cardinality = product(distinct values per label)

10 jobs × 5 instances × 3 statuses × 100 endpoints = 15 K series. Add a 10 K-distinct-value label and you are at 150 M.

Mitigations:

  • Label discipline — never use unbounded values (user_id, IP, URL with parameters, request_id) as labels. Send those to traces/logs, not metrics.
  • Drop relabelingmetric_relabel_configs at scrape time can drop high-cardinality series before they hit the TSDB.
  • Cardinality alerts — alert when scrape_samples_post_metric_relabeling spikes; this catches new bad labels before OOM.
  • Cortex / Mimir tenant limits — max series per tenant; reject ingestion past the cap.

Downsampling and Long-term Storage

Raw 15 s samples for 13 months at 50 M series is ~25 TB. You do not need 15 s resolution for queries on last quarter. Downsample:

  • Raw 15 s → 14 d retention.
  • 5 m rollup → 90 d.
  • 1 h rollup → 13 mo.

Mimir / Thanos compactor rebuilds at coarser resolution offline; the query engine routes to the appropriate level by query range. Storage cost drops 100× vs uniform retention.

ServiceMonitor + PodMonitor in K8s

The Prometheus Operator turns Kubernetes resources into scrape configs. ServiceMonitor selects services (label-matched) and tells Prometheus how to scrape their endpoints (port, path, interval). PodMonitor targets pods directly when there is no service. Probe uses Blackbox Exporter for synthetic checks.

This is the modern way: developers ship a ServiceMonitor with their service's Helm chart; the platform Prometheus picks it up automatically. Without the operator, you maintain scrape configs in a static file and reload Prometheus on every change.

Failure Modes

  • Cardinality OOM — covered above.
  • Scrape stagger — all targets scraped at the same instant; a single Prometheus saturates network. Stagger via target hash; tune scrape_offset.
  • Alert flapping — threshold near steady-state; alerts open/close every minute. Use a for: 5m window or hysteresis (different open/close thresholds).
  • Lost samples on restart — raw Prometheus loses 1–2 minutes on crash. Remote-write to a durable store covers this.

FAQ

Prometheus or a hosted service (Datadog, New Relic)?

Hosted is faster to start, more expensive at scale. Prometheus is the default for K8s-native shops; hosted shines when you need APM + RUM + traces in one pane.

How do I instrument my app?

OpenTelemetry SDK with the Prometheus exporter for metrics; standard names (RED method or USE method); avoid sneaky high-cardinality labels.

What about logs and traces?

Different stack: Loki/Elasticsearch for logs, Jaeger/Tempo for traces. Metrics-logs-traces are the three pillars of observability; you need all three. Observability page covers the cross-cutting story.

Should I run my own Prometheus or use Mimir/Cortex?

Single Prometheus to ~5M series; Mimir/Cortex past that or when you need cross-team multi-tenancy. VictoriaMetrics if you want a simpler operational model.