ReplacingMergeTree
ReplacingMergeTree is the answer to "I want updateable rows in a column store." Inserts pile up,
background merges deduplicate by the sort key keeping the row with the highest version, and reads either
wait for those merges or pay the FINAL tax to deduplicate at query time. The 24.x release added an
is_deleted column so deletions can be encoded as INSERTs as well, completing the upsert/delete model.
It is the engine behind virtually every "current state" table in production ClickHouse — user profiles,
order status, account balances — but the FINAL tax means you have to think about the read pattern up front.
How dedup happens during merges
Merges within a partition collapse rows with the same sort key, retaining the one with max version. Cross-partition rows are not deduped.
Key Numbers
Schema
CREATE TABLE users (
id UInt64,
name String,
email String,
updated_at DateTime,
is_deleted UInt8 DEFAULT 0 -- optional, 24+
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY id; Two arguments to the engine: the version column and (optionally) the deletion column. Without a version column, ClickHouse keeps an arbitrary row of any duplicate group — useful for "I just want one of these," dangerous if you care which one.
Reads: wait, FINAL, or argMax
Three options for reading "current state" from a ReplacingMergeTree:
-- 1. Plain SELECT: fast, but may include not-yet-merged duplicates
SELECT * FROM users WHERE id = 42;
-- 2. FINAL: dedup at query time
SELECT * FROM users FINAL WHERE id = 42;
-- 3. argMax: explicit, predictable
SELECT id, argMax(name, updated_at) AS name, max(updated_at) AS v
FROM users GROUP BY id;
FINAL forces a merge of overlapping ranges at query time. Cost: 2–10× slower for typical workloads, more for
skewed tables. Since 23.x, SELECT … FINAL can run in parallel across non-overlapping primary-key ranges,
recovering most of the cost on well-partitioned tables. Set do_not_merge_across_partitions_select_final = 1
on tables you've designed for partition-aligned reads.
OPTIMIZE FINAL
Forces a merge of all parts in a partition now:
OPTIMIZE TABLE users FINAL;
OPTIMIZE TABLE users PARTITION 202401 FINAL; Useful for "I want clean data right now," dangerous as a regular workload — it triggers full-partition rewrites which compete with the merge scheduler's normal IO budget. Run during off-peak windows; never run on every query.
is_deleted: encoding deletes as inserts
With the second engine argument:
-- "Delete" user 42 by inserting a tombstone
INSERT INTO users VALUES (42, '', '', now(), 1);
-- FINAL drops rows where the winner has is_deleted = 1
SELECT * FROM users FINAL WHERE id = 42;
-- (no rows)
For this to work in plain SELECTs you need SETTINGS clean_deleted_rows = 'Always' on the table, which makes
merges drop tombstoned rows entirely. Without that setting, FINAL is required to make tombstones effective.
Family comparison
| ReplacingMergeTree | CollapsingMergeTree | VersionedCollapsingMT | |
|---|---|---|---|
| Update model | INSERT new version | INSERT +1 new, −1 old | +1/−1 with version |
| Out-of-order | OK (max version wins) | Breaks if reordered | OK (version) |
| Producer state | Just write current row | Must remember old row | Must remember old row |
| Best for | Upsert / current-state | Counters, deltas | Counters with retries |
ReplacingMergeTree is simpler because the producer doesn't need to know the previous row.
Operational considerations
- Never expect dedup across partitions. If
id=42lands in two partitions, both copies survive merges. - Sort key cardinality determines merge cost: more unique keys = bigger output parts.
- For high-write hot keys, the version column should be monotonic (timestamp + sequence), not just
now(). - Background dedup is eventual — between an INSERT and the next merge, both versions are visible to plain SELECTs.
- Replicated variant exists:
ReplicatedReplacingMergeTreewraps the same dedup logic.
Read-time strategies in detail
Three patterns dominate production use of ReplacingMergeTree, each with a different cost profile:
| Pattern | Latency | Correctness | When to use |
|---|---|---|---|
| Plain SELECT | Fastest | Eventual (sees duplicates pre-merge) | Approximate analytics, dashboards |
| SELECT … FINAL | 2–10× slower | Strong | Correctness-critical reads |
| argMax(col, version) | 1.5–3× slower | Strong, explicit | Predictable plans, BI tools |
| Pre-merged via OPTIMIZE | Fastest | Strong (until next INSERT) | Snapshot tables, daily batches |
Settings that shape FINAL behavior:
do_not_merge_across_partitions_select_final = 1— parallelize FINAL across partitions.final = 1session-wide — apply FINAL to every read; convenient for BI tools.parallel_replicas_count— combined with FINAL, splits the dedup work across replicas.
Tradeoffs
- + Producer-friendly upsert model: just write the latest row.
- + No mutation cost — dedup happens during merges that would run anyway.
- + Tombstone deletes via
is_deletedmatch the same model. - − Reads see duplicates until the next merge unless you use FINAL.
- − FINAL is slow on skewed tables and cross-partition queries.
- − Dedup never crosses partition boundaries.
FAQ
What if two rows have the same version?
Tied rows resolve in an arbitrary but stable order. Use a (timestamp, sequence) tuple for the version to avoid ties.
Can I omit the version column?
Yes — then ClickHouse keeps an arbitrary row from each duplicate group. Fine for "we just need one." Risky if you care which one.
How do I read consistently without FINAL?
Use argMax(col, version) in your SELECT. It does the dedup explicitly and parallelizes naturally.
Should I OPTIMIZE FINAL nightly?
Only if normal merges aren't keeping up. Most tables don't need it — let the merge scheduler do its work.
What if a row's is_deleted arrives before its insert?
If the tombstone has a higher version than any insert, it wins forever — even when the insert eventually arrives with a lower version. Make sure version is sourced from the producer's monotonic timestamp.