How dedup happens during merges

Two parts, sorted by (id): part_5_5_0 id=42 v=1 name=Alice id=42 v=3 name=Alicia id=99 v=1 name=Bob part_6_6_0 id=42 v=2 name=Ali id=99 v=2 name=Bobby ↓ background merge by sort key (id), keep MAX(version) part_5_6_1 id=42 v=3 name=Alicia ✓ id=99 v=2 name=Bobby ✓

Merges within a partition collapse rows with the same sort key, retaining the one with max version. Cross-partition rows are not deduped.

Key Numbers

sort key
defines uniqueness
version
column picks the winner
eventual
consistency model
2–10×
FINAL slowdown vs plain SELECT
24+
version with is_deleted support
partition
scope of dedup (never crosses)
argMax
read-time alternative to FINAL

Schema

CREATE TABLE users (
    id          UInt64,
    name        String,
    email       String,
    updated_at  DateTime,
    is_deleted  UInt8 DEFAULT 0   -- optional, 24+
)
ENGINE = ReplacingMergeTree(updated_at, is_deleted)
ORDER BY id;

Two arguments to the engine: the version column and (optionally) the deletion column. Without a version column, ClickHouse keeps an arbitrary row of any duplicate group — useful for "I just want one of these," dangerous if you care which one.

Reads: wait, FINAL, or argMax

Three options for reading "current state" from a ReplacingMergeTree:

-- 1. Plain SELECT: fast, but may include not-yet-merged duplicates
SELECT * FROM users WHERE id = 42;

-- 2. FINAL: dedup at query time
SELECT * FROM users FINAL WHERE id = 42;

-- 3. argMax: explicit, predictable
SELECT id, argMax(name, updated_at) AS name, max(updated_at) AS v
FROM users GROUP BY id;

FINAL forces a merge of overlapping ranges at query time. Cost: 2–10× slower for typical workloads, more for skewed tables. Since 23.x, SELECT … FINAL can run in parallel across non-overlapping primary-key ranges, recovering most of the cost on well-partitioned tables. Set do_not_merge_across_partitions_select_final = 1 on tables you've designed for partition-aligned reads.

OPTIMIZE FINAL

Forces a merge of all parts in a partition now:

OPTIMIZE TABLE users FINAL;
OPTIMIZE TABLE users PARTITION 202401 FINAL;

Useful for "I want clean data right now," dangerous as a regular workload — it triggers full-partition rewrites which compete with the merge scheduler's normal IO budget. Run during off-peak windows; never run on every query.

is_deleted: encoding deletes as inserts

With the second engine argument:

-- "Delete" user 42 by inserting a tombstone
INSERT INTO users VALUES (42, '', '', now(), 1);

-- FINAL drops rows where the winner has is_deleted = 1
SELECT * FROM users FINAL WHERE id = 42;
-- (no rows)

For this to work in plain SELECTs you need SETTINGS clean_deleted_rows = 'Always' on the table, which makes merges drop tombstoned rows entirely. Without that setting, FINAL is required to make tombstones effective.

Family comparison

ReplacingMergeTreeCollapsingMergeTreeVersionedCollapsingMT
Update modelINSERT new versionINSERT +1 new, −1 old+1/−1 with version
Out-of-orderOK (max version wins)Breaks if reorderedOK (version)
Producer stateJust write current rowMust remember old rowMust remember old row
Best forUpsert / current-stateCounters, deltasCounters with retries

ReplacingMergeTree is simpler because the producer doesn't need to know the previous row.

Operational considerations

  • Never expect dedup across partitions. If id=42 lands in two partitions, both copies survive merges.
  • Sort key cardinality determines merge cost: more unique keys = bigger output parts.
  • For high-write hot keys, the version column should be monotonic (timestamp + sequence), not just now().
  • Background dedup is eventual — between an INSERT and the next merge, both versions are visible to plain SELECTs.
  • Replicated variant exists: ReplicatedReplacingMergeTree wraps the same dedup logic.

Read-time strategies in detail

Three patterns dominate production use of ReplacingMergeTree, each with a different cost profile:

PatternLatencyCorrectnessWhen to use
Plain SELECTFastestEventual (sees duplicates pre-merge)Approximate analytics, dashboards
SELECT … FINAL2–10× slowerStrongCorrectness-critical reads
argMax(col, version)1.5–3× slowerStrong, explicitPredictable plans, BI tools
Pre-merged via OPTIMIZEFastestStrong (until next INSERT)Snapshot tables, daily batches

Settings that shape FINAL behavior:

  • do_not_merge_across_partitions_select_final = 1 — parallelize FINAL across partitions.
  • final = 1 session-wide — apply FINAL to every read; convenient for BI tools.
  • parallel_replicas_count — combined with FINAL, splits the dedup work across replicas.

Tradeoffs

  • + Producer-friendly upsert model: just write the latest row.
  • + No mutation cost — dedup happens during merges that would run anyway.
  • + Tombstone deletes via is_deleted match the same model.
  • Reads see duplicates until the next merge unless you use FINAL.
  • FINAL is slow on skewed tables and cross-partition queries.
  • Dedup never crosses partition boundaries.

FAQ

What if two rows have the same version?

Tied rows resolve in an arbitrary but stable order. Use a (timestamp, sequence) tuple for the version to avoid ties.

Can I omit the version column?

Yes — then ClickHouse keeps an arbitrary row from each duplicate group. Fine for "we just need one." Risky if you care which one.

How do I read consistently without FINAL?

Use argMax(col, version) in your SELECT. It does the dedup explicitly and parallelizes naturally.

Should I OPTIMIZE FINAL nightly?

Only if normal merges aren't keeping up. Most tables don't need it — let the merge scheduler do its work.

What if a row's is_deleted arrives before its insert?

If the tombstone has a higher version than any insert, it wins forever — even when the insert eventually arrives with a lower version. Make sure version is sourced from the producer's monotonic timestamp.