Primary Keys in ClickHouse

ClickHouse "primary keys" are not what you think they are. They do not enforce uniqueness — duplicates are legal. They are not B-trees — they are sparse indexes over the sorted data, one entry per granule (8192 rows by default). They live in tiny .idx files that are loaded into RAM at startup and used to skip granules during scans. The ORDER BY clause is what physically sorts the data; PRIMARY KEY, when omitted, defaults to ORDER BY; when set explicitly it must be a prefix of ORDER BY. Choosing a good key — a leading low-cardinality column you filter on, then time — is the single biggest performance lever you have on a MergeTree table.

Sparse index, one mark per granule

A 1B-row table at 8192-row granularity has only ~122k marks — a few MB total, kept in RAM permanently.

Key Numbers

8192

default index_granularity

~122k

marks for a 1B-row table

RAM

primary index lives there permanently

uniqueness constraint

prefix

PK must prefix ORDER BY

.idx

file containing the marks

.mrk

file mapping mark → file offset

PRIMARY KEY vs ORDER BY

CREATE TABLE events (
    ts        DateTime,
    user_id   UInt64,
    event     LowCardinality(String),
    payload   String
)
ENGINE = MergeTree
ORDER BY    (event, ts, user_id)   -- physical sort order
PRIMARY KEY (event, ts);           -- index columns (prefix of ORDER BY)

The data on disk is sorted by (event, ts, user_id). The primary index records the values of (event, ts) at the start of every granule. Why use a shorter PRIMARY KEY than ORDER BY? To shrink the in-memory mark file when the trailing column is high-cardinality and you don't filter on it directly — useful for very wide tables.

If you omit PRIMARY KEY, ClickHouse uses ORDER BY for both. That's the right default.

Why "not unique"

ClickHouse never enforces a uniqueness check. Inserting two rows with the same (event, ts, user_id) just gives you two rows. This is a deliberate tradeoff: enforcing uniqueness on a column-store with append-only parts would require either a global secondary index (expensive on writes) or a read-time dedup (the ReplacingMergeTree approach).

If you need uniqueness:

Use ReplacingMergeTree with the dedup key as ORDER BY.
Or guarantee uniqueness in the producer (idempotent INSERTs by version).
Or check at read time: SELECT … FROM t FINAL or argMax.

Granules and marks on disk

Each column has a .bin (compressed data) and a .mrk (mark file). The mark file maps granule index → (compressed_offset, decompressed_offset). To read granule G42 of column event, ClickHouse reads event.mrk[42], seeks event.bin to that offset, decompresses one block, and returns rows 42 * 8192 through 43 * 8192 - 1.

-- See on-disk size of an index
SELECT
    table,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_in_mem,
    formatReadableSize(primary_key_bytes_in_memory_allocated) AS pk_allocated,
    marks_bytes
FROM system.parts
WHERE table = 'events' AND active
LIMIT 5;

Multi-column keys: leading column matters

ClickHouse can use any prefix of the primary key for granule pruning. WHERE event = 'click' works. WHERE ts = '2024-03-01' alone is useless because ts is the second column — granules are sorted by event first, so any granule could contain that timestamp.

The rule of thumb:

Leading column = low-cardinality, frequently filtered (often partition-aligned).
Trailing column = the one you sort/filter after the leading filter.
Avoid putting high-cardinality columns at the front.

-- Better for "events of type X over time"
ORDER BY (event, ts)

-- Better for "all events of one user over time"
ORDER BY (user_id, ts)

-- Worst: leading high-cardinality column
ORDER BY (uuid, ts)  -- index has 122k unique uuids → no granule pruning

index_granularity tradeoffs

Default is 8192 rows per granule. Smaller granularity = more precise pruning but bigger index. Larger granularity = smaller index but more rows scanned per match.

granularity	marks for 1B rows	scan per match	RAM
1024	~977k	~1k rows	8× larger
8192 (default)	~122k	~8k rows	baseline
65536	~15k	~64k rows	8× smaller

For point-lookup workloads on huge tables, smaller granularity wins. For full-scan analytics, default is fine. Adaptive granularity (index_granularity_bytes) keeps each granule near a target byte size regardless of row width.

Tradeoffs

+ Index is a few MB even for trillion-row tables.
+ Cache-friendly, RAM-resident, no random IO during planning.
+ ORDER BY also enables run-length and delta compression.
− Point lookups still scan ≥ one granule (~8k rows).
− No uniqueness, no FK, no tree-balanced index.
− Trailing-column filters do nothing without a leading-column predicate.

FAQ

Can I add a primary key after CREATE TABLE?

You can add data-skipping indexes (ALTER TABLE … ADD INDEX), but the primary key / ORDER BY is fixed at creation. Recreate via INSERT…SELECT to change it.

Why is the PRIMARY KEY a prefix of ORDER BY?

Because the index records column values at granule boundaries. If PK columns weren't a prefix of the sort, those values wouldn't be in sorted order across granules, and the binary search couldn't work.

Does WHERE user_id IN (1,2,3) use the primary key?

Only if user_id is the leading column. With (event, user_id), you'd need to also filter event for granule pruning to fire.

What if I really need point lookups by uuid?

Add a bloom_filter data-skipping index on uuid. Granules whose bloom rejects the value are skipped. Costs extra disk and a small bloom check per granule.

Are duplicates a bug or feature?

Feature for event tables, bug for dimension tables. ClickHouse doesn't differentiate — pick the engine that matches your dedup expectation (MergeTree for events, ReplacingMergeTree for dimensions).