Partitioning in ClickHouse

A partition is a directory of parts. PARTITION BY tells ClickHouse to keep rows from different partition keys in physically separate parts, which means the query planner can prune entire partitions at WHERE time, you can DROP PARTITION in a single rename, and you can MOVE PARTITION between tables almost instantly. The convention is PARTITION BY toYYYYMM(ts) — one partition per month — but the engine is general-purpose, and choosing wrong is one of the most common mistakes new ClickHouse users make: too many partitions blows up the system.parts count, too few blocks operational levers like rolling drops.

Parts vs partitions

A partition is the unit of operational manipulation; a part is the unit of merging and IO.

Key Numbers

recommended ceiling on partitions per table

300

default parts_to_throw_insert

monthly

canonical partition granularity

rename

cost of DROP / DETACH / ATTACH

main partition operations

never

merges cross partition boundaries

prune

at planning time, before scan

PARTITION BY semantics

CREATE TABLE events (
    ts        DateTime,
    user_id   UInt64,
    event     LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, ts);

The expression is evaluated per row at INSERT time. Rows with different partition values cannot share a part, so an insert that touches three months produces at least three parts.

Partition keys can use any deterministic expression:

toYYYYMM(ts) — month — the most common choice.
toDate(ts) — daily — only for short-retention tables.
(toYYYYMM(ts), tenant_id) — tuple — multi-tenant rolling drops.
cityHash64(user_id) % 100 — bucketed — rarely a good idea.

Partition pruning

At plan time the optimizer evaluates the partition expression against the query's WHERE. Parts whose partition value cannot match are removed from the scan list before any IO. Inspect the result with EXPLAIN ESTIMATE or via system.query_log:

EXPLAIN ESTIMATE
SELECT count() FROM events WHERE ts BETWEEN '2024-03-01' AND '2024-03-31';
-- ┌─database─┬─table──┬─parts─┬───rows─┐
-- │ default  │ events │     3 │ 412881 │
-- └──────────┴────────┴───────┴────────┘

Pruning fires only when the WHERE expression matches the partition expression structurally. WHERE ts >= '...' prunes toYYYYMM(ts); WHERE substring(date_str, 1, 7) = '2024-03' does not.

Partition operations

-- Drop one month — instant; just removes the directory
ALTER TABLE events DROP PARTITION 202401;

-- Detach (move under detached/, recoverable)
ALTER TABLE events DETACH PARTITION 202401;
ALTER TABLE events ATTACH PARTITION 202401;

-- Move data between tables (must share schema/storage policy)
ALTER TABLE events_archive ATTACH PARTITION 202401 FROM events;

-- Move to a different storage tier
ALTER TABLE events MOVE PARTITION 202401 TO DISK 'cold';
ALTER TABLE events MOVE PARTITION 202401 TO VOLUME 's3';

-- Replace one partition's data with another's
ALTER TABLE events REPLACE PARTITION 202401 FROM events_staging;

These are O(rename) — they manipulate file system metadata, not data. Combined with multi-disk storage policies, partition moves implement automatic hot/cold tiering.

system.parts

SELECT
    partition,
    count()                         AS parts,
    sum(rows)                       AS rows,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    max(modification_time)          AS last_change
FROM system.parts
WHERE table = 'events' AND active
GROUP BY partition
ORDER BY partition DESC;

This is the operational dashboard for a MergeTree table. Watch for:

Partition with hundreds of parts → merges falling behind.
Partition with size 0 → orphaned, candidate for DETACH.
Newest partition has many tiny parts → INSERT batch size too small.

Custom partition keys

Beyond time, partition keys are useful for tenant isolation, retention tiers, or segregating reprocessable from immutable data:

-- Multi-tenant: drop one tenant's data without scanning others
PARTITION BY tenant_id

-- Retention tiers: hot = day, warm = month, cold = year
PARTITION BY (tier, toStartOfInterval(ts,
    if(tier = 'hot',  toIntervalDay(1),
       if(tier = 'warm', toIntervalMonth(1), toIntervalYear(1)))))

The cardinality must stay sane: a partition per user_id will produce millions of partitions and ClickHouse will refuse to start.

Limits and pitfalls

Settings that govern partition health:

parts_to_throw_insert = 300 — INSERTs fail when a partition has too many active parts.
parts_to_delay_insert = 150 — INSERTs are slowed before the hard limit.
max_parts_in_total = 100000 — table-wide ceiling.
max_partitions_per_insert_block = 100 — guard against accidental partition explosion.

The "too many parts" error means merges can't keep up. Causes: tiny INSERTs, too-fine partitioning, mutations starving the merge scheduler, slow disk. Fix: batch larger, partition coarser, throttle ALTER.

Tradeoffs

+ O(1) DROP / MOVE / DETACH for entire time ranges.
+ Free pruning for time-bounded queries.
+ Per-partition tiering with multi-disk storage policies.
− Too-fine partitioning blows up part counts and merge work.
− Merges never cross partitions, so partition skew = part skew.
− Predicate must structurally match the partition expression to prune.

FAQ

How many partitions is too many?

Beyond ~1k–10k per table you start seeing slow plan time and slow restarts (parts are scanned at startup). Daily partitioning over a 5-year retention is already 1825 partitions; monthly is 60.

Can I change the partition expression after creation?

No. The partition expression is fixed at table creation. To change it: create a new table with the new key, INSERT … SELECT in the background, swap with RENAME TABLE.

Should partition key match the primary key?

Often yes for time data: PARTITION BY toYYYYMM(ts) + ORDER BY (user_id, ts). The partition prunes the time range and the primary key prunes within the partition.

What does the part name part_1_3_2 mean?

{block_number_min}_{block_number_max}_{merge_level}. A part with min == max is freshly inserted; merge level grows each time it is merged.

Can I DROP a partition that's currently being merged?

Yes. The merge cancels and the resulting parts are removed atomically. There is no risk of partial drop.