Sorting Tables Larger Than Memory

How DuckDB Uses External Merge Sort to Handle Massive Datasets

Sorting is one of the most fundamental database operations — ORDER BY, GROUP BY, merge joins, and window functions all depend on it. But what happens when you need to sort 10 GB of data with only 4 GB of RAM? You cannot load everything at once. DuckDB solves this with external merge sort: sort memory-sized chunks independently, write sorted runs to disk, then merge them together using a k-way merge with a tournament tree. Combined with columnar optimizations that avoid moving actual row data, DuckDB sorts datasets many times larger than available memory without falling off a performance cliff.

The Problem: Memory vs Data Size

4 GB

Available RAM

10 GB

Data to sort

Cannot fit all data in memory at once. Need a strategy that processes data in pieces.

DuckDB's Three-Phase Approach

Chunk and Sort

Read data in chunks that fit in memory. Sort each chunk independently using an optimized in-memory algorithm — radix sort for fixed-width integer keys, comparison sort for variable-length strings. Each sorted chunk is called a run.

Spill to Disk

Write each sorted run to a temporary file. The buffer manager handles the I/O, writing data sequentially for maximum throughput. Each run on disk is already sorted, so reading it back only needs sequential scan — no random access.

K-Way Merge

Open all sorted runs simultaneously and merge them. A loser tree (tournament tree) efficiently picks the smallest element across all run heads. If too many runs exist, DuckDB merges in cascading passes — merge groups of runs into larger intermediate runs, then merge those.

External Merge Sort Visualization

Watch data flow through the sort pipeline. Chunks are read into memory, sorted, written to disk as runs, then merged together.

Press Start to watch DuckDB sort 12 values using external merge sort with 4-value memory limit.

Columnar Sort Optimization

DuckDB does not sort the actual rows. Instead, it builds lightweight sort keys and reorders indices.

Build sort keys — For each row, DuckDB creates a binary-normalized representation of the ORDER BY columns. Integers are byte-flipped so that memcmp() produces the correct ordering. Strings are null-terminated and padded. All types reduce to a byte sequence where lexicographic comparison equals the desired sort order.

Sort indices, not rows — The sort algorithm operates on (sort_key, row_index) pairs. Swapping a 16-byte key + 4-byte index is dramatically cheaper than swapping an entire row that might contain kilobytes of string data.

Permute columns — Once the indices are sorted, DuckDB reads each column and reorders it according to the sorted index array. Each column is permuted independently, maintaining the columnar layout for downstream operators.

Row Layout for Fast Comparison

During sorting, DuckDB temporarily converts the sort key columns from columnar to row layout. This makes each comparison a single memcmp() call instead of column-by-column comparison.

Column-by-column comparison

name[i] vs name[j]

then if equal...

age[i] vs age[j]

then if equal...

city[i] vs city[j]

3+ function calls, scattered memory access

Single memcmp() with row layout

name age city

memcmp(row_i, row_j, len)

name age city

1 call, contiguous memory, cache-friendly

K-Way Merge with Loser Tree

Merging k sorted runs requires repeatedly finding the minimum across k run heads. A loser tree makes this O(log k) per element by maintaining a tournament bracket.

Naive: scan all k heads

O(k) per output element

With 64 sorted runs, every single output element requires 64 comparisons.

Heap / priority queue

O(log k) per element

Heap adjustments require pointer chasing up the tree. Good in theory, harder on caches.

Loser tree (DuckDB's choice)
O(log k) per element
Array-based tournament. Winner floats to root. Replacing the winner only updates one path — cache-friendly and branch-prediction-friendly.

Morsel-Driven Parallel Sort

DuckDB uses multiple threads throughout the sort pipeline. The sort phase is embarrassingly parallel; the merge phase partitions the key space across threads.

Sort Phase (parallel)

Thread 1

Morsel A -> Sort -> Run 1

Thread 2

Morsel B -> Sort -> Run 2

Thread 3

Morsel C -> Sort -> Run 3

Thread 4

Morsel D -> Sort -> Run 4

Merge Phase (cascading)

Merge

Run 1 + Run 2 + Run 3 + Run 4 -> Sorted Output

Frequently Asked Questions

How does DuckDB sort a dataset that is larger than available memory?

DuckDB uses external merge sort. It reads the data in memory-sized chunks (called runs), sorts each chunk in memory using radix sort or comparison sort, and writes the sorted runs to temporary files on disk. Then it performs a multi-way merge: it reads the front of each sorted run simultaneously and repeatedly picks the smallest element across all runs using a tournament tree (loser tree). The merged output is either the final result or an intermediate run that gets merged again in the next pass.

Why does DuckDB convert columnar data to a row layout for sorting?

Columnar storage is efficient for scanning and filtering, but sorting requires comparing entire rows across multiple columns. Comparing column by column would require jumping between different column arrays for each comparison, causing cache misses. Instead, DuckDB creates a temporary row layout where all sort key columns are concatenated into a single byte-normalized representation. This lets the sort algorithm compare two rows with a single memcmp() call, which is extremely cache-friendly since all the comparison data is contiguous in memory.

What is a loser tree and why does DuckDB use it for merging?

A loser tree is a variant of a tournament tree used for k-way merging. In a normal tournament tree, each internal node stores the winner of its comparison. A loser tree instead stores the loser, which means the overall winner automatically propagates to the root. When the winner is consumed and replaced by the next element from its run, only the path from that leaf to the root needs updating — O(log k) comparisons per output element. This is more efficient than a naive approach of scanning all k run heads (O(k) per element) or using a heap (same O(log k) but with more pointer-chasing overhead).

How does DuckDB parallelize sorting?

DuckDB uses morsel-driven parallelism. Multiple worker threads each grab a morsel (chunk) of the input data, sort their morsel independently using the in-memory sort algorithm, and produce sorted runs. This first phase is embarrassingly parallel — no coordination needed. The merge phase then combines these sorted runs. DuckDB can also parallelize the merge by partitioning the key space and assigning different partitions to different threads, so multiple merge operations proceed simultaneously.

Does DuckDB physically move row data during sorting?

No. DuckDB avoids moving the actual row data during the sort. Instead, it creates a sort key array (binary-normalized representations of the key columns) paired with row indices. The sort algorithm rearranges these lightweight key-index pairs. Once sorted, DuckDB uses the sorted indices to permute (reorder) the original columns. This is critical because rows can contain large variable-length data like strings or blobs — moving those during each swap would be extremely expensive.