Mutations in ClickHouse

ClickHouse is an immutable column store, so "UPDATE" and "DELETE" don't mutate rows in place — they rewrite entire parts. ALTER TABLE … UPDATE is async, expensive, and observable through system.mutations. DELETE FROM (the "lightweight" variant) marks rows with a hidden _row_exists mask and lets compaction clean up later. Both are valid for occasional GDPR-style operations and dangerous as a regular workload. The right tool for "mutable" data is usually ReplacingMergeTree or CollapsingMergeTree, where mutations are encoded as new INSERTs and resolved by the merge process.

What an ALTER UPDATE actually does

A mutation that changes one row in one part rewrites a whole 100 GB part. Use partition-key WHERE clauses to limit the blast radius.

Key Numbers

async

default mutation submission mode

whole

part is rewritten — even for 1 row

_row_exists

virtual mask for lightweight DELETE

mutation kinds: ALTER UPDATE / DELETE

0–2

values for mutations_sync

N×

disk write ≈ size of touched parts

part-id

granularity of mutation skipping

ALTER UPDATE / ALTER DELETE

Submitted to a queue, applied asynchronously by background worker threads. The query returns immediately.

ALTER TABLE events
UPDATE state = 'archived'
WHERE ts < '2024-01-01';

ALTER TABLE events
DELETE WHERE user_id = 12345;

To wait synchronously:

SET mutations_sync = 1;  -- wait for current replica
SET mutations_sync = 2;  -- wait for ALL replicas
ALTER TABLE events DELETE WHERE …;

Lightweight DELETE FROM

Introduced in 22.x; the standard SQL form. Instead of rewriting parts, it writes a _row_exists mask. Reads filter by the mask. Background merges drop masked rows when they next rewrite the part anyway.

DELETE FROM events WHERE user_id = 12345;

	ALTER DELETE	DELETE FROM
Style	Mutation (rewrite)	Mask (lightweight)
Submission	Async by default	Sync, reflected in next read
Disk write	Whole parts	Mask file only
Reads after	See deleted gone	See `_row_exists=0` filtered
Best for	Bulk archival	Targeted GDPR-style

system.mutations

SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    is_done,
    parts_to_do,
    latest_failed_part,
    latest_fail_reason
FROM system.mutations
WHERE NOT is_done
ORDER BY create_time DESC;

Cancel a stuck mutation with KILL MUTATION:

KILL MUTATION WHERE database = 'default' AND mutation_id = '0000000123';

Why mutations are expensive

A mutation is implemented by enumerating every part of the table, evaluating the WHERE on each part's primary index, and rewriting any part that could contain a matching row. That means:

One row's UPDATE may rewrite a 100 GB part.
Replicated mutations run on every replica independently — N× the IO.
Mutations bypass the merge scheduler's part-size budgets, so they can stall regular merges.
If the WHERE includes the partition key, only matching partitions are touched. Always include partition predicates.

-- Good: only one partition is rewritten
ALTER TABLE events
UPDATE state = 'archived'
WHERE toYYYYMM(ts) = 202401 AND state = 'pending';

-- Bad: every partition is scanned
ALTER TABLE events
UPDATE state = 'archived'
WHERE state = 'pending';

The right way: mutation-by-INSERT

For high-rate "updates" use a MergeTree variant that resolves duplicates during merging:

ReplacingMergeTree(version) — last-write-wins by sorting key.
CollapsingMergeTree(sign) — +1/−1 rows cancel during merges.
VersionedCollapsingMergeTree(sign, version) — same, but tolerates out-of-order inserts.

-- Conceptual "UPDATE" via INSERT
INSERT INTO users_replacing
VALUES (42, 'new_name', now());  -- version > previous → wins on merge

-- Read-time dedup if not yet merged:
SELECT * FROM users_replacing FINAL WHERE id = 42;

These engines push the cost from "mutation time" (rewrite parts) to "merge time" (already happening) and "read time" (FINAL, optional). For workloads with many small updates this is orders of magnitude cheaper.

Tuning knobs

Setting	Default	Effect
`mutations_sync`	0	0=async, 1=wait for this replica, 2=wait for all
`background_pool_size`	16	Worker threads for merges + mutations
`number_of_free_entries_in_pool_to_execute_mutation`	10	Reserves merge capacity
`max_part_loading_threads`	auto	Parallelism when scanning parts to mutate
`lightweight_deletes_sync`	2	Wait policy for DELETE FROM
`apply_mutations_on_fly`	0	If 1, queries see mutation effects before parts rewritten

apply_mutations_on_fly = 1 (24+) is the modern story for "I want my UPDATE visible immediately." It applies the mutation expression at read time until the background rewrite catches up. Good for low-rate mutations; expensive for high-rate ones because every read pays the filter cost.

Mutation IO budget

Mutations and merges share the background_pool_size thread pool. A long mutation can starve regular merges, leading to "too many parts" errors. Production rules:

Run heavy mutations during off-peak hours.
Always include partition predicates to bound work.
Watch system.merge_tree_settings.background_pool_size and increase if mutations + merges saturate.
Prefer lightweight DELETE for "delete N rows now"; reserve ALTER UPDATE for bulk archival.

Tradeoffs

+ Standard SQL semantics; familiar to PostgreSQL/MySQL users.
+ Lightweight DELETE is fast and visible immediately.
+ KILL MUTATION lets you back out if you set the WHERE wrong.
− ALTER UPDATE/DELETE are not a transactional row update.
− One bad mutation can rewrite the entire table.
− Replication multiplies the IO cost by N replicas.
− Mutations queue serially per table; back-pressure can be invisible.

FAQ

How do I tell if a mutation is "almost done"?

parts_to_do in system.mutations. It decrements as parts are rewritten. If it stalls, check latest_fail_reason.

Can I run two mutations in parallel?

The mutation queue per table is serial. If you submit two, they run in order. Mutations on different tables run in parallel up to the worker pool size.

What is _row_exists?

A virtual UInt8 column added by lightweight DELETE. WHERE _row_exists filters out logically-deleted rows automatically; you rarely need to reference it directly.

Will a SELECT see the result of an ALTER UPDATE immediately?

No. Mutations are async by default; reads still see old data until the relevant part is rewritten. Use SET mutations_sync=1 for "wait for me" semantics, or use lightweight DELETE/ReplacingMergeTree.

Should I run ALTER UPDATE during ingestion?

Avoid it. Mutations compete with merges for IO; running them during peak ingestion exacerbates the "too many parts" risk. Schedule for off-peak or use ReplacingMergeTree.