Sparse index, one mark per granule

Granules (8192 rows each), sorted by primary key G0 G1 G2 G3 G4 G5 G6 primary.idx (one mark per granule) 'a' 'd' 'g' 'k' 'p' 't' 'y' WHERE key = 'h' → binary-search idx → only G2 is scanned G2

A 1B-row table at 8192-row granularity has only ~122k marks — a few MB total, kept in RAM permanently.

Key Numbers

8192
default index_granularity
~122k
marks for a 1B-row table
RAM
primary index lives there permanently
no
uniqueness constraint
prefix
PK must prefix ORDER BY
.idx
file containing the marks
.mrk
file mapping mark → file offset

PRIMARY KEY vs ORDER BY

CREATE TABLE events (
    ts        DateTime,
    user_id   UInt64,
    event     LowCardinality(String),
    payload   String
)
ENGINE = MergeTree
ORDER BY    (event, ts, user_id)   -- physical sort order
PRIMARY KEY (event, ts);           -- index columns (prefix of ORDER BY)

The data on disk is sorted by (event, ts, user_id). The primary index records the values of (event, ts) at the start of every granule. Why use a shorter PRIMARY KEY than ORDER BY? To shrink the in-memory mark file when the trailing column is high-cardinality and you don't filter on it directly — useful for very wide tables.

If you omit PRIMARY KEY, ClickHouse uses ORDER BY for both. That's the right default.

Why "not unique"

ClickHouse never enforces a uniqueness check. Inserting two rows with the same (event, ts, user_id) just gives you two rows. This is a deliberate tradeoff: enforcing uniqueness on a column-store with append-only parts would require either a global secondary index (expensive on writes) or a read-time dedup (the ReplacingMergeTree approach).

If you need uniqueness:

  • Use ReplacingMergeTree with the dedup key as ORDER BY.
  • Or guarantee uniqueness in the producer (idempotent INSERTs by version).
  • Or check at read time: SELECT … FROM t FINAL or argMax.

Granules and marks on disk

Each column has a .bin (compressed data) and a .mrk (mark file). The mark file maps granule index → (compressed_offset, decompressed_offset). To read granule G42 of column event, ClickHouse reads event.mrk[42], seeks event.bin to that offset, decompresses one block, and returns rows 42 * 8192 through 43 * 8192 - 1.

-- See on-disk size of an index
SELECT
    table,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_in_mem,
    formatReadableSize(primary_key_bytes_in_memory_allocated) AS pk_allocated,
    marks_bytes
FROM system.parts
WHERE table = 'events' AND active
LIMIT 5;

Multi-column keys: leading column matters

ClickHouse can use any prefix of the primary key for granule pruning. WHERE event = 'click' works. WHERE ts = '2024-03-01' alone is useless because ts is the second column — granules are sorted by event first, so any granule could contain that timestamp.

The rule of thumb:

  1. Leading column = low-cardinality, frequently filtered (often partition-aligned).
  2. Trailing column = the one you sort/filter after the leading filter.
  3. Avoid putting high-cardinality columns at the front.
-- Better for "events of type X over time"
ORDER BY (event, ts)

-- Better for "all events of one user over time"
ORDER BY (user_id, ts)

-- Worst: leading high-cardinality column
ORDER BY (uuid, ts)  -- index has 122k unique uuids → no granule pruning

index_granularity tradeoffs

Default is 8192 rows per granule. Smaller granularity = more precise pruning but bigger index. Larger granularity = smaller index but more rows scanned per match.

granularitymarks for 1B rowsscan per matchRAM
1024~977k~1k rows8× larger
8192 (default)~122k~8k rowsbaseline
65536~15k~64k rows8× smaller

For point-lookup workloads on huge tables, smaller granularity wins. For full-scan analytics, default is fine. Adaptive granularity (index_granularity_bytes) keeps each granule near a target byte size regardless of row width.

Tradeoffs

  • + Index is a few MB even for trillion-row tables.
  • + Cache-friendly, RAM-resident, no random IO during planning.
  • + ORDER BY also enables run-length and delta compression.
  • Point lookups still scan ≥ one granule (~8k rows).
  • No uniqueness, no FK, no tree-balanced index.
  • Trailing-column filters do nothing without a leading-column predicate.

FAQ

Can I add a primary key after CREATE TABLE?

You can add data-skipping indexes (ALTER TABLE … ADD INDEX), but the primary key / ORDER BY is fixed at creation. Recreate via INSERT…SELECT to change it.

Why is the PRIMARY KEY a prefix of ORDER BY?

Because the index records column values at granule boundaries. If PK columns weren't a prefix of the sort, those values wouldn't be in sorted order across granules, and the binary search couldn't work.

Does WHERE user_id IN (1,2,3) use the primary key?

Only if user_id is the leading column. With (event, user_id), you'd need to also filter event for granule pruning to fire.

What if I really need point lookups by uuid?

Add a bloom_filter data-skipping index on uuid. Granules whose bloom rejects the value are skipped. Costs extra disk and a small bloom check per granule.

Are duplicates a bug or feature?

Feature for event tables, bug for dimension tables. ClickHouse doesn't differentiate — pick the engine that matches your dedup expectation (MergeTree for events, ReplacingMergeTree for dimensions).