Default join shape

RIGHT build hash table (materialized in RAM) HASH key → row LEFT stream blocks (probe one-by-one) JOINED probe ⨝ build

The right side is fully materialized. Memory grows linearly with right-table cardinality; spills are not automatic with hash.

Key Numbers

6
join algorithms in join_algorithm
RIGHT
side is always built first
RAM
limit applies to build side
~1B
rows where partial_merge wins
N
grace_hash buckets spill to disk
O(1)
direct join via dictionary
GLOBAL
required for distributed correctness

The join_algorithm menu

ClickHouse exposes a comma-separated list; the planner picks the first algorithm whose preconditions are met. Default since 24.x is direct,parallel_hash,hash.

AlgorithmHow it worksBest forMemory
hashBuild hash on right, probe with leftSmall right sideO(right)
parallel_hashMulti-threaded build, sharded by key hashMedium right, multi-coreO(right)
partial_mergeSort both sides on key, mergeRight doesn't fit RAMO(sort buffer)
full_sorting_mergeAlways sort both sidesSkewed inputsO(sort buffer)
grace_hashHash with disk spill into N bucketsRight is huge but boundedBounded
directRight is a dictionary; lookup per rowRight is a Dictionary tableO(1) per probe
-- Force a specific algorithm for one query
SELECT u.name, e.event
FROM events AS e
JOIN users AS u ON u.id = e.user_id
SETTINGS join_algorithm = 'partial_merge';

When hash join falls over

hash stores the entire right side in RAM. If you join a 10B-row event table to a 200M-row user table on user_id, ClickHouse will try to put 200M rows in memory. With ~120 bytes/row that's 24 GB before the probe even starts — and max_bytes_in_join typically kills it well before that.

-- Memory ceiling for the build side
SET max_bytes_in_join = 10_000_000_000;  -- 10 GB

-- Spill to disk when over threshold (24+)
SET join_algorithm = 'grace_hash';
SET grace_hash_join_initial_buckets = 8;

grace_hash partitions both sides into buckets that fit in RAM, processes them one at a time, and writes intermediate state to disk. It trades CPU and IO for bounded memory. partial_merge sorts both sides and sweeps a merge pointer — slower for small joins but indifferent to right-side size.

Direct join: the dictionary trick

If the right side is a Dictionary, ClickHouse can skip the build phase entirely and probe with dictGet() per row. This is the direct algorithm. It is O(1) per row and uses constant memory because the dictionary is already in RAM.

CREATE DICTIONARY users_dict (
    id   UInt64,
    name String,
    plan LowCardinality(String)
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(table 'users'))
LAYOUT(HASHED())
LIFETIME(3600);

-- Direct join: no build phase
SELECT e.event, u.plan
FROM events AS e
JOIN users_dict AS u ON u.id = e.user_id;
-- ↑ ClickHouse rewrites this to dictGet('users_dict', 'plan', e.user_id)

For small dimension tables, dictionary + direct join beats every other algorithm. It also dodges the distributed-join correctness footgun (next section).

Distributed JOINs & GLOBAL

With Distributed tables, a plain JOIN runs locally on each shard. If the right side is also Distributed, each shard joins against its own slice of the right table — which is silently wrong unless both tables are co-sharded on the join key.

-- WRONG (each shard sees only local right-side rows)
SELECT e.event, u.name
FROM events_dist AS e
JOIN users_dist  AS u ON u.id = e.user_id;

-- CORRECT (right side broadcast to all shards)
SELECT e.event, u.name
FROM events_dist AS e
GLOBAL JOIN users_dist AS u ON u.id = e.user_id;

GLOBAL JOIN evaluates the right side once on the initiator and ships the result to every shard as a temporary table. Cost: the right side is sent over the network N times (once per shard). For a small right side this is fine; for a large one, co-shard the tables on the join key and use a plain JOIN.

Inspecting joins in system.*

-- Which algorithm was actually used?
SELECT query, ProfileEvents['JoinBuildTableRowCount'] AS build_rows,
       ProfileEvents['JoinProbeTableRowCount']  AS probe_rows,
       Settings['join_algorithm']               AS algo,
       memory_usage
FROM system.query_log
WHERE has(tables, 'users') AND event_time > now() - INTERVAL 1 HOUR
ORDER BY memory_usage DESC LIMIT 10;

-- EXPLAIN PIPELINE shows the join node
EXPLAIN PIPELINE SELECT e.event, u.name
FROM events e JOIN users u ON u.id = e.user_id;
-- (Aggregating)
--   (JoiningTransform)            ← here
--     (ExpressionTransform) × 2

When ClickHouse joins disappoint

  • Predicate pushdown is partial. A WHERE on the joined right column may not push into the right scan; rewrite as a derived table when it matters.
  • No automatic join reordering. ClickHouse joins in the order you wrote. Put the smaller table on the right.
  • NULL semantics. JOIN uses strict equality; asof_join exists for time-aligned joins; for NULL-safe equality use JOIN STRICTNESS = ANY with explicit handling.
  • Multiple JOINs. Each adds another build phase and another peak-memory point. Consider denormalizing or pre-joining via materialized views.
  • Skew. One key with millions of matches (a "celebrity row") expands the join output dramatically; use SAMPLE for diagnostics.

Tradeoffs

  • + Hash join is microsecond-fast for small right sides.
  • + Dictionary + direct join scales to billions of probes with constant memory.
  • + parallel_hash uses every core for the build phase.
  • Right side must fit in RAM unless you opt into grace_hash or partial_merge.
  • No automatic reordering; the planner does what you wrote.
  • Distributed joins require GLOBAL or co-sharding; otherwise wrong answers, not errors.

FAQ

Why is the right side always built first?

Hash join requires a probe-able lookup structure. ClickHouse always treats the right table as build, left as probe. If your right is bigger, swap them — even on inner joins where it would be semantically equivalent.

How is parallel_hash different from hash?

It splits the right side into N hash partitions by key hash, builds each in a separate thread, and probes left blocks against the matching partition. Same memory footprint, much higher build throughput.

Should I use asof_join?

For time-series alignment ("for each event, find the latest user record at-or-before the event time"), yes. It avoids manually-written argMax tricks and runs as a sort-merge.

What about any_join / semi_join?

These are STRICTNESS modifiers — ANY returns one match per left row, SEMI only emits left rows that have a match. They reduce join cardinality and memory; use them when you only need existence, not full join output.

Can ClickHouse use indexes on the join key?

Only the right side benefits — its primary key can prune the scan that feeds the build. The probe never uses the left's index because it scans top-to-bottom by definition.