Sorting Tables Larger Than Memory
How DuckDB Uses External Merge Sort to Handle Massive Datasets
Sorting is one of the most fundamental database operations — ORDER BY, GROUP BY, merge joins, and window functions all depend on it. But what happens when you need to sort 10 GB of data with only 4 GB of RAM? You cannot load everything at once. DuckDB solves this with external merge sort: sort memory-sized chunks independently, write sorted runs to disk, then merge them together using a k-way merge with a tournament tree. Combined with columnar optimizations that avoid moving actual row data, DuckDB sorts datasets many times larger than available memory without falling off a performance cliff.
The Problem: Memory vs Data Size
Cannot fit all data in memory at once. Need a strategy that processes data in pieces.
DuckDB's Three-Phase Approach
Chunk and Sort
Read data in chunks that fit in memory. Sort each chunk independently using an optimized in-memory algorithm — radix sort for fixed-width integer keys, comparison sort for variable-length strings. Each sorted chunk is called a run.
Spill to Disk
Write each sorted run to a temporary file. The buffer manager handles the I/O, writing data sequentially for maximum throughput. Each run on disk is already sorted, so reading it back only needs sequential scan — no random access.
K-Way Merge
Open all sorted runs simultaneously and merge them. A loser tree (tournament tree) efficiently picks the smallest element across all run heads. If too many runs exist, DuckDB merges in cascading passes — merge groups of runs into larger intermediate runs, then merge those.
External Merge Sort Visualization
Watch data flow through the sort pipeline. Chunks are read into memory, sorted, written to disk as runs, then merged together.
Columnar Sort Optimization
DuckDB does not sort the actual rows. Instead, it builds lightweight sort keys and reorders indices.
Row Layout for Fast Comparison
During sorting, DuckDB temporarily converts the sort key columns from columnar to row layout. This makes each comparison a single memcmp() call instead of column-by-column comparison.
K-Way Merge with Loser Tree
Merging k sorted runs requires repeatedly finding the minimum across k run heads. A loser tree makes this O(log k) per element by maintaining a tournament bracket.
Morsel-Driven Parallel Sort
DuckDB uses multiple threads throughout the sort pipeline. The sort phase is embarrassingly parallel; the merge phase partitions the key space across threads.
Frequently Asked Questions
How does DuckDB sort a dataset that is larger than available memory?
DuckDB uses external merge sort. It reads the data in memory-sized chunks (called runs), sorts each chunk in memory using radix sort or comparison sort, and writes the sorted runs to temporary files on disk. Then it performs a multi-way merge: it reads the front of each sorted run simultaneously and repeatedly picks the smallest element across all runs using a tournament tree (loser tree). The merged output is either the final result or an intermediate run that gets merged again in the next pass.
Why does DuckDB convert columnar data to a row layout for sorting?
Columnar storage is efficient for scanning and filtering, but sorting requires comparing entire rows across multiple columns. Comparing column by column would require jumping between different column arrays for each comparison, causing cache misses. Instead, DuckDB creates a temporary row layout where all sort key columns are concatenated into a single byte-normalized representation. This lets the sort algorithm compare two rows with a single memcmp() call, which is extremely cache-friendly since all the comparison data is contiguous in memory.
What is a loser tree and why does DuckDB use it for merging?
A loser tree is a variant of a tournament tree used for k-way merging. In a normal tournament tree, each internal node stores the winner of its comparison. A loser tree instead stores the loser, which means the overall winner automatically propagates to the root. When the winner is consumed and replaced by the next element from its run, only the path from that leaf to the root needs updating — O(log k) comparisons per output element. This is more efficient than a naive approach of scanning all k run heads (O(k) per element) or using a heap (same O(log k) but with more pointer-chasing overhead).
How does DuckDB parallelize sorting?
DuckDB uses morsel-driven parallelism. Multiple worker threads each grab a morsel (chunk) of the input data, sort their morsel independently using the in-memory sort algorithm, and produce sorted runs. This first phase is embarrassingly parallel — no coordination needed. The merge phase then combines these sorted runs. DuckDB can also parallelize the merge by partitioning the key space and assigning different partitions to different threads, so multiple merge operations proceed simultaneously.
Does DuckDB physically move row data during sorting?
No. DuckDB avoids moving the actual row data during the sort. Instead, it creates a sort key array (binary-normalized representations of the key columns) paired with row indices. The sort algorithm rearranges these lightweight key-index pairs. Once sorted, DuckDB uses the sorted indices to permute (reorder) the original columns. This is critical because rows can contain large variable-length data like strings or blobs — moving those during each swap would be extremely expensive.