High-Cardinality Metrics
Why labels are the silent TSDB killer
Every label you add to a Prometheus metric multiplies the number of time series the
TSDB has to track. A single counter http_requests_total with no labels
is one series. Add method (5 values) and status (40 values)
and you have 200 series. Add user_id (10 million values) and you have
2 billion series — enough to OOM a 256 GB Prometheus inside an hour. Cardinality
is the single most common reason teams replatform their metrics stack.
The fix is rarely "buy more RAM". It is to recognize that identifiers — user IDs, request IDs, full URL paths, customer emails — do not belong on metrics at all. They belong on logs and traces, where the cost model is per-event, not per-unique-combination-forever.
Cardinality Explosion (Visualized)
Each label dimension multiplies. The same metric name can produce one series or one billion, depending only on what you label.
Key Numbers
What Counts as a Time Series
A time series in Prometheus is uniquely identified by the metric name plus the full set of label key/value pairs. Two samples differ in any label and they are different series, each with its own chunk file, its own index entry, and its own RAM footprint.
{`# These are FOUR distinct time series:
http_requests_total{method="GET", status="200", route="/api/v1/orders"}
http_requests_total{method="GET", status="200", route="/api/v1/users"}
http_requests_total{method="POST", status="201", route="/api/v1/orders"}
http_requests_total{method="GET", status="500", route="/api/v1/orders"}
# Adding user_id="12345" creates a series for EACH user EACH route EACH status.
# A 10M-user app with 200 routes and 40 statuses => 80 BILLION potential series.`} The series identifier is hashed with FNV-64 to produce a series ID, which becomes a key in the in-memory index. Even if a series sees only one sample per day, its index entry stays resident as long as that sample is within the retention window.
The Cost Inside Prometheus
Cardinality cost is not just RAM. It hits four subsystems simultaneously, each with a different threshold.
Head Block (RAM)
The active 2-hour window keeps every series resident in memory. ~3 KB per series for the chunk header plus encoded samples. 1M series ≈ 3 GB RAM just for head data, before posting lists or query workspace.
Posting Lists (RAM + Disk)
For every label key/value pair, Prometheus maintains an inverted index of series
IDs. Querying {job="api"} intersects posting lists. With 100M
series, posting lists alone consume gigabytes and intersect operations dominate
query CPU.
WAL (Disk + I/O)
Every sample is appended to the write-ahead log before being added to the head block. High cardinality means many small writes scattered across many series — poor sequential I/O patterns and slow WAL replay on restart.
Compaction (CPU + Disk)
Every two hours, head data is flushed to a persistent block. Compactor merges adjacent blocks into larger ones (2h → 6h → 1d → ...). High-cardinality blocks have huge index files and slow compactions, sometimes consuming an entire CPU core for hours.
Detecting the Worst Offenders
Prometheus exposes its own internals as metrics. A handful of PromQL queries find the labels destroying your TSDB before they fall over.
{`# Top metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))
# Top label keys by distinct values across the whole TSDB
topk(20, count by (label) (label_values_count))
# How many series each scrape target is producing
topk(10, count by (job, instance) ({__name__=~".+"}))
# Series churn rate (new series being created per second)
rate(prometheus_tsdb_head_series_created_total[5m])
# Active head series (the live cardinality)
prometheus_tsdb_head_series
# WAL truncation lag - high values mean the head can't keep up
prometheus_tsdb_wal_truncations_failed_total`}
The tsdb-analyze CLI tool does the same offline against a snapshot,
printing per-metric cardinality, series count by label key, and the top label values.
Run it weekly — cardinality drift is gradual and easy to miss until the OOM.
Label Dropping & Recording Rules
The cheapest fix is to never ingest the offending label. The next cheapest is to pre-aggregate at scrape time so the high-cardinality series are dropped from the TSDB after their useful information has been extracted.
{`# prometheus.yml - drop user_id at scrape time
scrape_configs:
- job_name: 'api'
metric_relabel_configs:
- source_labels: [__name__]
regex: 'http_requests_total'
action: labeldrop
regex: 'user_id|request_id|trace_id'
# Or replace high-cardinality URL paths with route templates
- source_labels: [path]
regex: '/users/[0-9]+/orders/[a-f0-9-]+'
target_label: path
replacement: '/users/:id/orders/:uuid'`} Recording rules pre-compute aggregations at evaluation time and store the result as a new (lower-cardinality) series. The original high-cardinality series can then be dropped during compaction or held only briefly.
{`# rules.yml
groups:
- name: api_rollups
interval: 30s
rules:
- record: http_requests:rate5m_by_route
expr: sum by (route, status) (rate(http_requests_total[5m]))
- record: http_request_duration:p99_by_route
expr: histogram_quantile(0.99,
sum by (route, le) (rate(http_request_duration_seconds_bucket[5m])))`} When to Use Logs or Traces Instead
The strongest cardinality discipline is "if it identifies a single request or user, it goes on a log line or a trace span, never on a metric label."
| Question | Right tool | Why |
|---|---|---|
| How many 500s did user 12345 see today? | Logs | Per-user cardinality is unbounded; logs are per-event |
| What's the 99p latency by route? | Metrics | Bounded label set, aggregable |
| Why did this specific request take 4 seconds? | Traces | Span tree shows the slow span; metrics can't |
| How many unique users called /checkout this hour? | Logs (HLL) | Cardinality is the answer; metrics can't store it |
| Is the error rate elevated right now? | Metrics | Cheap to alert on, low cardinality |
Cortex / Mimir Cardinality Limits
Multi-tenant Prometheus systems impose hard limits per tenant. These prevent one noisy tenant from taking down the cluster.
{`# mimir.yaml - per-tenant runtime overrides
overrides:
tenant-a:
max_global_series_per_user: 1000000 # hard ceiling
max_global_series_per_metric: 100000 # per-metric ceiling
max_label_names_per_series: 30 # reject samples with too many labels
max_label_value_length: 2048
ingestion_rate: 100000 # samples/sec
ingestion_burst_size: 200000
tenant-b:
max_global_series_per_user: 100000
max_global_series_per_metric: 10000
out_of_order_time_window: 10m`}
When a tenant exceeds max_global_series_per_user, new series are
rejected at the distributor with HTTP 429. Existing series keep ingesting.
This is a hard wall designed to protect the cluster, but it produces angry pages from
the tenant who hit it — they need pre-warning via cardinality alerts.
Tradeoffs
More labels = better debuggability
Labels make ad-hoc PromQL slicing easy. The first instinct after an incident is "I wish I had a label for X." Adding one is cheap; removing one means losing history.
Fewer labels = stable TSDB
Every label you don't add is one you'll never have to drop later. Aggressive discipline at instrumentation time prevents painful migrations.
Recording rules = cheaper queries, lossy
Pre-aggregated series are 10-100x cheaper to query but throw away the per-instance breakdown. Useful for dashboards, dangerous for incident debugging.
Exemplars = traces from metrics
Exemplars attach a trace ID to a histogram bucket. You get aggregate metrics and a way to drill into a specific slow request — without paying the cardinality cost of a per-request label.
FAQ
What is the practical series limit for a single Prometheus?
Around 10 million active series on a 32 GB box, 50 million on 256 GB. Beyond that, you want a horizontally-scaled system like Mimir, Thanos, or VictoriaMetrics. The bottleneck is usually RAM for the head block plus query workspace, not disk.
Why does adding a single user_id label kill performance?
Because user_id is unbounded and grows with traffic. Every other label has a small fixed value set: 5 HTTP methods, ~40 status codes, hundreds of routes. user_id has as many values as you have users, and each one creates a new series for every other label combination.
Are histograms more cardinality-expensive than counters?
Yes. A histogram with N buckets creates N+2 series per label combination (one per bucket, plus _count and _sum). A 10-bucket histogram with 1000 label combinations is 12,000 series. Native histograms (sparse, exponential) reduce this dramatically — one series carries all buckets.
How do I find a runaway label without taking down Prometheus?
Use the /api/v1/status/tsdb endpoint — it returns the top label names and values by series count without running a heavy PromQL query. The Mimir/Cortex UI exposes this as the "cardinality" tab. tsdb-analyze on a block snapshot works offline.
Should I drop labels at scrape time or at query time?
At scrape time. Query-time aggregation does not reduce storage cost — the high-cardinality series still exist in the TSDB. metric_relabel_configs with action: labeldrop removes the label before ingestion, so the series never gets created.
What's the difference between cardinality and churn?
Cardinality is "how many distinct series exist right now". Churn is "how fast new series are created and old ones go stale". Kubernetes pod restarts produce churn (new pod_name labels each restart) without growing total cardinality unboundedly — but they still hit head-block memory. High churn is its own performance problem.