Primary Keys in ClickHouse
ClickHouse "primary keys" are not what you think they are. They do not enforce uniqueness — duplicates are
legal. They are not B-trees — they are sparse indexes over the sorted data, one entry per
granule (8192 rows by default). They live in tiny .idx files that are loaded into RAM at startup
and used to skip granules during scans. The ORDER BY clause is what physically sorts the data;
PRIMARY KEY, when omitted, defaults to ORDER BY; when set explicitly it must be a prefix of
ORDER BY. Choosing a good key — a leading low-cardinality column you filter on, then time — is the single
biggest performance lever you have on a MergeTree table.
Sparse index, one mark per granule
A 1B-row table at 8192-row granularity has only ~122k marks — a few MB total, kept in RAM permanently.
Key Numbers
index_granularityPRIMARY KEY vs ORDER BY
CREATE TABLE events (
ts DateTime,
user_id UInt64,
event LowCardinality(String),
payload String
)
ENGINE = MergeTree
ORDER BY (event, ts, user_id) -- physical sort order
PRIMARY KEY (event, ts); -- index columns (prefix of ORDER BY)
The data on disk is sorted by (event, ts, user_id). The primary index records the values of
(event, ts) at the start of every granule. Why use a shorter PRIMARY KEY than ORDER BY? To shrink the
in-memory mark file when the trailing column is high-cardinality and you don't filter on it directly — useful
for very wide tables.
If you omit PRIMARY KEY, ClickHouse uses ORDER BY for both. That's the right default.
Why "not unique"
ClickHouse never enforces a uniqueness check. Inserting two rows with the same (event, ts, user_id) just
gives you two rows. This is a deliberate tradeoff: enforcing uniqueness on a column-store with append-only parts
would require either a global secondary index (expensive on writes) or a read-time dedup (the
ReplacingMergeTree approach).
If you need uniqueness:
- Use
ReplacingMergeTreewith the dedup key as ORDER BY. - Or guarantee uniqueness in the producer (idempotent INSERTs by version).
- Or check at read time:
SELECT … FROM t FINALorargMax.
Granules and marks on disk
Each column has a .bin (compressed data) and a .mrk (mark file). The mark file maps granule index →
(compressed_offset, decompressed_offset). To read granule G42 of column event, ClickHouse reads
event.mrk[42], seeks event.bin to that offset, decompresses one block, and returns rows
42 * 8192 through 43 * 8192 - 1.
-- See on-disk size of an index
SELECT
table,
formatReadableSize(primary_key_bytes_in_memory) AS pk_in_mem,
formatReadableSize(primary_key_bytes_in_memory_allocated) AS pk_allocated,
marks_bytes
FROM system.parts
WHERE table = 'events' AND active
LIMIT 5; Multi-column keys: leading column matters
ClickHouse can use any prefix of the primary key for granule pruning. WHERE event = 'click' works.
WHERE ts = '2024-03-01' alone is useless because ts is the second column — granules are sorted by
event first, so any granule could contain that timestamp.
The rule of thumb:
- Leading column = low-cardinality, frequently filtered (often partition-aligned).
- Trailing column = the one you sort/filter after the leading filter.
- Avoid putting high-cardinality columns at the front.
-- Better for "events of type X over time"
ORDER BY (event, ts)
-- Better for "all events of one user over time"
ORDER BY (user_id, ts)
-- Worst: leading high-cardinality column
ORDER BY (uuid, ts) -- index has 122k unique uuids → no granule pruning index_granularity tradeoffs
Default is 8192 rows per granule. Smaller granularity = more precise pruning but bigger index. Larger granularity = smaller index but more rows scanned per match.
| granularity | marks for 1B rows | scan per match | RAM |
|---|---|---|---|
| 1024 | ~977k | ~1k rows | 8× larger |
| 8192 (default) | ~122k | ~8k rows | baseline |
| 65536 | ~15k | ~64k rows | 8× smaller |
For point-lookup workloads on huge tables, smaller granularity wins. For full-scan analytics, default is fine.
Adaptive granularity (index_granularity_bytes) keeps each granule near a target byte size regardless of
row width.
Tradeoffs
- + Index is a few MB even for trillion-row tables.
- + Cache-friendly, RAM-resident, no random IO during planning.
- + ORDER BY also enables run-length and delta compression.
- − Point lookups still scan ≥ one granule (~8k rows).
- − No uniqueness, no FK, no tree-balanced index.
- − Trailing-column filters do nothing without a leading-column predicate.
FAQ
Can I add a primary key after CREATE TABLE?
You can add data-skipping indexes (ALTER TABLE … ADD INDEX), but the primary key / ORDER BY is fixed at creation. Recreate via INSERT…SELECT to change it.
Why is the PRIMARY KEY a prefix of ORDER BY?
Because the index records column values at granule boundaries. If PK columns weren't a prefix of the sort, those values wouldn't be in sorted order across granules, and the binary search couldn't work.
Does WHERE user_id IN (1,2,3) use the primary key?
Only if user_id is the leading column. With (event, user_id), you'd need to also filter event for granule pruning to fire.
What if I really need point lookups by uuid?
Add a bloom_filter data-skipping index on uuid. Granules whose bloom rejects the value are skipped. Costs extra disk and a small bloom check per granule.
Are duplicates a bug or feature?
Feature for event tables, bug for dimension tables. ClickHouse doesn't differentiate — pick the engine that matches your dedup expectation (MergeTree for events, ReplacingMergeTree for dimensions).