Partitioning in ClickHouse
A partition is a directory of parts. PARTITION BY tells ClickHouse to keep rows from different partition
keys in physically separate parts, which means the query planner can prune entire partitions at WHERE
time, you can DROP PARTITION in a single rename, and you can MOVE PARTITION between tables almost
instantly. The convention is PARTITION BY toYYYYMM(ts) — one partition per month — but the engine is
general-purpose, and choosing wrong is one of the most common mistakes new ClickHouse users make: too many
partitions blows up the system.parts count, too few blocks operational levers like rolling drops.
Parts vs partitions
A partition is the unit of operational manipulation; a part is the unit of merging and IO.
Key Numbers
parts_to_throw_insertPARTITION BY semantics
CREATE TABLE events (
ts DateTime,
user_id UInt64,
event LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ts)
ORDER BY (user_id, ts); The expression is evaluated per row at INSERT time. Rows with different partition values cannot share a part, so an insert that touches three months produces at least three parts.
Partition keys can use any deterministic expression:
toYYYYMM(ts)— month — the most common choice.toDate(ts)— daily — only for short-retention tables.(toYYYYMM(ts), tenant_id)— tuple — multi-tenant rolling drops.cityHash64(user_id) % 100— bucketed — rarely a good idea.
Partition pruning
At plan time the optimizer evaluates the partition expression against the query's WHERE. Parts whose
partition value cannot match are removed from the scan list before any IO. Inspect the result with
EXPLAIN ESTIMATE or via system.query_log:
EXPLAIN ESTIMATE
SELECT count() FROM events WHERE ts BETWEEN '2024-03-01' AND '2024-03-31';
-- ┌─database─┬─table──┬─parts─┬───rows─┐
-- │ default │ events │ 3 │ 412881 │
-- └──────────┴────────┴───────┴────────┘
Pruning fires only when the WHERE expression matches the partition expression structurally. WHERE ts >= '...'
prunes toYYYYMM(ts); WHERE substring(date_str, 1, 7) = '2024-03' does not.
Partition operations
-- Drop one month — instant; just removes the directory
ALTER TABLE events DROP PARTITION 202401;
-- Detach (move under detached/, recoverable)
ALTER TABLE events DETACH PARTITION 202401;
ALTER TABLE events ATTACH PARTITION 202401;
-- Move data between tables (must share schema/storage policy)
ALTER TABLE events_archive ATTACH PARTITION 202401 FROM events;
-- Move to a different storage tier
ALTER TABLE events MOVE PARTITION 202401 TO DISK 'cold';
ALTER TABLE events MOVE PARTITION 202401 TO VOLUME 's3';
-- Replace one partition's data with another's
ALTER TABLE events REPLACE PARTITION 202401 FROM events_staging; These are O(rename) — they manipulate file system metadata, not data. Combined with multi-disk storage policies, partition moves implement automatic hot/cold tiering.
system.parts
SELECT
partition,
count() AS parts,
sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size,
max(modification_time) AS last_change
FROM system.parts
WHERE table = 'events' AND active
GROUP BY partition
ORDER BY partition DESC; This is the operational dashboard for a MergeTree table. Watch for:
- Partition with hundreds of parts → merges falling behind.
- Partition with size 0 → orphaned, candidate for DETACH.
- Newest partition has many tiny parts → INSERT batch size too small.
Custom partition keys
Beyond time, partition keys are useful for tenant isolation, retention tiers, or segregating reprocessable from immutable data:
-- Multi-tenant: drop one tenant's data without scanning others
PARTITION BY tenant_id
-- Retention tiers: hot = day, warm = month, cold = year
PARTITION BY (tier, toStartOfInterval(ts,
if(tier = 'hot', toIntervalDay(1),
if(tier = 'warm', toIntervalMonth(1), toIntervalYear(1))))) The cardinality must stay sane: a partition per user_id will produce millions of partitions and ClickHouse will refuse to start.
Limits and pitfalls
Settings that govern partition health:
parts_to_throw_insert = 300— INSERTs fail when a partition has too many active parts.parts_to_delay_insert = 150— INSERTs are slowed before the hard limit.max_parts_in_total = 100000— table-wide ceiling.max_partitions_per_insert_block = 100— guard against accidental partition explosion.
The "too many parts" error means merges can't keep up. Causes: tiny INSERTs, too-fine partitioning, mutations starving the merge scheduler, slow disk. Fix: batch larger, partition coarser, throttle ALTER.
Tradeoffs
- + O(1) DROP / MOVE / DETACH for entire time ranges.
- + Free pruning for time-bounded queries.
- + Per-partition tiering with multi-disk storage policies.
- − Too-fine partitioning blows up part counts and merge work.
- − Merges never cross partitions, so partition skew = part skew.
- − Predicate must structurally match the partition expression to prune.
FAQ
How many partitions is too many?
Beyond ~1k–10k per table you start seeing slow plan time and slow restarts (parts are scanned at startup). Daily partitioning over a 5-year retention is already 1825 partitions; monthly is 60.
Can I change the partition expression after creation?
No. The partition expression is fixed at table creation. To change it: create a new table with the new key, INSERT … SELECT in the background, swap with RENAME TABLE.
Should partition key match the primary key?
Often yes for time data: PARTITION BY toYYYYMM(ts) + ORDER BY (user_id, ts). The partition prunes the time range and the primary key prunes within the partition.
What does the part name part_1_3_2 mean?
{block_number_min}_{block_number_max}_{merge_level}. A part with min == max is freshly inserted; merge level grows each time it is merged.
Can I DROP a partition that's currently being merged?
Yes. The merge cancels and the resulting parts are removed atomically. There is no risk of partial drop.