ReplacingMergeTree

ReplacingMergeTree is the answer to "I want updateable rows in a column store." Inserts pile up, background merges deduplicate by the sort key keeping the row with the highest version, and reads either wait for those merges or pay the FINAL tax to deduplicate at query time. The 24.x release added an is_deleted column so deletions can be encoded as INSERTs as well, completing the upsert/delete model. It is the engine behind virtually every "current state" table in production ClickHouse — user profiles, order status, account balances — but the FINAL tax means you have to think about the read pattern up front.

How dedup happens during merges

Merges within a partition collapse rows with the same sort key, retaining the one with max version. Cross-partition rows are not deduped.

Key Numbers

sort key

defines uniqueness

version

column picks the winner

eventual

consistency model

2–10×

FINAL slowdown vs plain SELECT

24+

version with is_deleted support

partition

scope of dedup (never crosses)

argMax

read-time alternative to FINAL

Schema

CREATE TABLE users (
    id          UInt64,
    name        String,
    email       String,
    updated_at  DateTime,
    is_deleted  UInt8 DEFAULT 0   -- optional, 24+
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY id;

Two arguments to the engine: the version column and (optionally) the deletion column. Without a version column, ClickHouse keeps an arbitrary row of any duplicate group — useful for "I just want one of these," dangerous if you care which one.

Reads: wait, FINAL, or argMax

Three options for reading "current state" from a ReplacingMergeTree:

-- 1. Plain SELECT: fast, but may include not-yet-merged duplicates
SELECT * FROM users WHERE id = 42;

-- 2. FINAL: dedup at query time
SELECT * FROM users FINAL WHERE id = 42;

-- 3. argMax: explicit, predictable
SELECT id, argMax(name, updated_at) AS name, max(updated_at) AS v
FROM users GROUP BY id;

FINAL forces a merge of overlapping ranges at query time. Cost: 2–10× slower for typical workloads, more for skewed tables. Since 23.x, SELECT … FINAL can run in parallel across non-overlapping primary-key ranges, recovering most of the cost on well-partitioned tables. Set do_not_merge_across_partitions_select_final = 1 on tables you've designed for partition-aligned reads.

OPTIMIZE FINAL

Forces a merge of all parts in a partition now:

OPTIMIZE TABLE users FINAL;
OPTIMIZE TABLE users PARTITION 202401 FINAL;

Useful for "I want clean data right now," dangerous as a regular workload — it triggers full-partition rewrites which compete with the merge scheduler's normal IO budget. Run during off-peak windows; never run on every query.

is_deleted: encoding deletes as inserts

With the second engine argument:

-- "Delete" user 42 by inserting a tombstone
INSERT INTO users VALUES (42, '', '', now(), 1);

-- FINAL drops rows where the winner has is_deleted = 1
SELECT * FROM users FINAL WHERE id = 42;
-- (no rows)

For this to work in plain SELECTs you need SETTINGS clean_deleted_rows = 'Always' on the table, which makes merges drop tombstoned rows entirely. Without that setting, FINAL is required to make tombstones effective.

Family comparison

	ReplacingMergeTree	CollapsingMergeTree	VersionedCollapsingMT
Update model	INSERT new version	INSERT +1 new, −1 old	+1/−1 with version
Out-of-order	OK (max version wins)	Breaks if reordered	OK (version)
Producer state	Just write current row	Must remember old row	Must remember old row
Best for	Upsert / current-state	Counters, deltas	Counters with retries

ReplacingMergeTree is simpler because the producer doesn't need to know the previous row.

Operational considerations

Never expect dedup across partitions. If id=42 lands in two partitions, both copies survive merges.
Sort key cardinality determines merge cost: more unique keys = bigger output parts.
For high-write hot keys, the version column should be monotonic (timestamp + sequence), not just now().
Background dedup is eventual — between an INSERT and the next merge, both versions are visible to plain SELECTs.
Replicated variant exists: ReplicatedReplacingMergeTree wraps the same dedup logic.

Read-time strategies in detail

Three patterns dominate production use of ReplacingMergeTree, each with a different cost profile:

Pattern	Latency	Correctness	When to use
Plain SELECT	Fastest	Eventual (sees duplicates pre-merge)	Approximate analytics, dashboards
SELECT … FINAL	2–10× slower	Strong	Correctness-critical reads
argMax(col, version)	1.5–3× slower	Strong, explicit	Predictable plans, BI tools
Pre-merged via OPTIMIZE	Fastest	Strong (until next INSERT)	Snapshot tables, daily batches

Settings that shape FINAL behavior:

do_not_merge_across_partitions_select_final = 1 — parallelize FINAL across partitions.
final = 1 session-wide — apply FINAL to every read; convenient for BI tools.
parallel_replicas_count — combined with FINAL, splits the dedup work across replicas.

Tradeoffs

+ Producer-friendly upsert model: just write the latest row.
+ No mutation cost — dedup happens during merges that would run anyway.
+ Tombstone deletes via is_deleted match the same model.
− Reads see duplicates until the next merge unless you use FINAL.
− FINAL is slow on skewed tables and cross-partition queries.
− Dedup never crosses partition boundaries.

FAQ

What if two rows have the same version?

Tied rows resolve in an arbitrary but stable order. Use a (timestamp, sequence) tuple for the version to avoid ties.

Can I omit the version column?

Yes — then ClickHouse keeps an arbitrary row from each duplicate group. Fine for "we just need one." Risky if you care which one.

How do I read consistently without FINAL?

Use argMax(col, version) in your SELECT. It does the dedup explicitly and parallelizes naturally.

Should I OPTIMIZE FINAL nightly?

Only if normal merges aren't keeping up. Most tables don't need it — let the merge scheduler do its work.

What if a row's is_deleted arrives before its insert?

If the tombstone has a higher version than any insert, it wins forever — even when the insert eventually arrives with a lower version. Make sure version is sourced from the producer's monotonic timestamp.