JOINs in ClickHouse

ClickHouse joins look like SQL but behave like a column store: the right-hand table is built into a hash table in memory, the left-hand table streams through, and there is no query optimizer that will quietly swap join order for you. The default hash algorithm is fast for small right sides; once that table outgrows RAM you have a menu — partial_merge, grace_hash, parallel_hash, direct — each with different memory and latency profiles. Then there are distributed joins, where the join key, sharding key, and GLOBAL JOIN keyword decide whether you get the right answer or a quietly wrong one.

Default join shape

The right side is fully materialized. Memory grows linearly with right-table cardinality; spills are not automatic with hash.

Key Numbers

join algorithms in join_algorithm

RIGHT

side is always built first

RAM

limit applies to build side

~1B

rows where partial_merge wins

grace_hash buckets spill to disk

O(1)

direct join via dictionary

GLOBAL

required for distributed correctness

The `join_algorithm` menu

ClickHouse exposes a comma-separated list; the planner picks the first algorithm whose preconditions are met. Default since 24.x is direct,parallel_hash,hash.

Algorithm	How it works	Best for	Memory
`hash`	Build hash on right, probe with left	Small right side	O(right)
`parallel_hash`	Multi-threaded build, sharded by key hash	Medium right, multi-core	O(right)
`partial_merge`	Sort both sides on key, merge	Right doesn't fit RAM	O(sort buffer)
`full_sorting_merge`	Always sort both sides	Skewed inputs	O(sort buffer)
`grace_hash`	Hash with disk spill into N buckets	Right is huge but bounded	Bounded
`direct`	Right is a dictionary; lookup per row	Right is a Dictionary table	O(1) per probe

-- Force a specific algorithm for one query
SELECT u.name, e.event
FROM events AS e
JOIN users AS u ON u.id = e.user_id
SETTINGS join_algorithm = 'partial_merge';

When hash join falls over

hash stores the entire right side in RAM. If you join a 10B-row event table to a 200M-row user table on user_id, ClickHouse will try to put 200M rows in memory. With ~120 bytes/row that's 24 GB before the probe even starts — and max_bytes_in_join typically kills it well before that.

-- Memory ceiling for the build side
SET max_bytes_in_join = 10_000_000_000;  -- 10 GB

-- Spill to disk when over threshold (24+)
SET join_algorithm = 'grace_hash';
SET grace_hash_join_initial_buckets = 8;

grace_hash partitions both sides into buckets that fit in RAM, processes them one at a time, and writes intermediate state to disk. It trades CPU and IO for bounded memory. partial_merge sorts both sides and sweeps a merge pointer — slower for small joins but indifferent to right-side size.

Direct join: the dictionary trick

If the right side is a Dictionary, ClickHouse can skip the build phase entirely and probe with dictGet() per row. This is the direct algorithm. It is O(1) per row and uses constant memory because the dictionary is already in RAM.

CREATE DICTIONARY users_dict (
    id   UInt64,
    name String,
    plan LowCardinality(String)
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(table 'users'))
LAYOUT(HASHED())
LIFETIME(3600);

-- Direct join: no build phase
SELECT e.event, u.plan
FROM events AS e
JOIN users_dict AS u ON u.id = e.user_id;
-- ↑ ClickHouse rewrites this to dictGet('users_dict', 'plan', e.user_id)

For small dimension tables, dictionary + direct join beats every other algorithm. It also dodges the distributed-join correctness footgun (next section).

Distributed JOINs & `GLOBAL`

With Distributed tables, a plain JOIN runs locally on each shard. If the right side is also Distributed, each shard joins against its own slice of the right table — which is silently wrong unless both tables are co-sharded on the join key.

-- WRONG (each shard sees only local right-side rows)
SELECT e.event, u.name
FROM events_dist AS e
JOIN users_dist  AS u ON u.id = e.user_id;

-- CORRECT (right side broadcast to all shards)
SELECT e.event, u.name
FROM events_dist AS e
GLOBAL JOIN users_dist AS u ON u.id = e.user_id;

GLOBAL JOIN evaluates the right side once on the initiator and ships the result to every shard as a temporary table. Cost: the right side is sent over the network N times (once per shard). For a small right side this is fine; for a large one, co-shard the tables on the join key and use a plain JOIN.

Inspecting joins in `system.*`

-- Which algorithm was actually used?
SELECT query, ProfileEvents['JoinBuildTableRowCount'] AS build_rows,
       ProfileEvents['JoinProbeTableRowCount']  AS probe_rows,
       Settings['join_algorithm']               AS algo,
       memory_usage
FROM system.query_log
WHERE has(tables, 'users') AND event_time > now() - INTERVAL 1 HOUR
ORDER BY memory_usage DESC LIMIT 10;

-- EXPLAIN PIPELINE shows the join node
EXPLAIN PIPELINE SELECT e.event, u.name
FROM events e JOIN users u ON u.id = e.user_id;
-- (Aggregating)
--   (JoiningTransform)            ← here
--     (ExpressionTransform) × 2

When ClickHouse joins disappoint

Predicate pushdown is partial. A WHERE on the joined right column may not push into the right scan; rewrite as a derived table when it matters.
No automatic join reordering. ClickHouse joins in the order you wrote. Put the smaller table on the right.
NULL semantics. JOIN uses strict equality; asof_join exists for time-aligned joins; for NULL-safe equality use JOIN STRICTNESS = ANY with explicit handling.
Multiple JOINs. Each adds another build phase and another peak-memory point. Consider denormalizing or pre-joining via materialized views.
Skew. One key with millions of matches (a "celebrity row") expands the join output dramatically; use SAMPLE for diagnostics.

Tradeoffs

+ Hash join is microsecond-fast for small right sides.
+ Dictionary + direct join scales to billions of probes with constant memory.
+ parallel_hash uses every core for the build phase.
− Right side must fit in RAM unless you opt into grace_hash or partial_merge.
− No automatic reordering; the planner does what you wrote.
− Distributed joins require GLOBAL or co-sharding; otherwise wrong answers, not errors.

FAQ

Why is the right side always built first?

Hash join requires a probe-able lookup structure. ClickHouse always treats the right table as build, left as probe. If your right is bigger, swap them — even on inner joins where it would be semantically equivalent.

How is parallel_hash different from hash?

It splits the right side into N hash partitions by key hash, builds each in a separate thread, and probes left blocks against the matching partition. Same memory footprint, much higher build throughput.

Should I use asof_join?

For time-series alignment ("for each event, find the latest user record at-or-before the event time"), yes. It avoids manually-written argMax tricks and runs as a sort-merge.

What about any_join / semi_join?

These are STRICTNESS modifiers — ANY returns one match per left row, SEMI only emits left rows that have a match. They reduce join cardinality and memory; use them when you only need existence, not full join output.

Can ClickHouse use indexes on the join key?

Only the right side benefits — its primary key can prune the scan that feeds the build. The probe never uses the left's index because it scans top-to-bottom by definition.

Default join shape

Key Numbers

The join_algorithm menu