Push-Based Query Pipelines

Morsel-Driven Parallelism — How DuckDB Saturates Every Core

Pull vs Push: Two Ways to Move Data

In the Volcano (pull) model, the top operator in the query plan drives execution. It calls next() on its child, which calls next() on its child, all the way down to the table scan. Data flows upward, pulled through the tree by the consumer. Each call is a virtual function dispatch — the caller does not know the concrete type of the child operator, so the CPU cannot inline or optimize across operator boundaries.

DuckDB uses a push-based model instead. The source operator (table scan) drives execution. It reads a chunk of data and pushes it into the next operator, which processes it and pushes the result into the next, and so on until data reaches a sink. Because the pipeline is known at compile time, the engine can fuse operators together, eliminate virtual dispatch, and let the compiler optimize the entire chain as if it were a single tight loop.

The push model also enables a cleaner parallelism story. In pull-based systems, adding parallelism requires an exchange operator that redistributes rows between threads. In push-based systems, you simply have multiple threads each pushing their own chunk of data through the same pipeline independently.

Query Plan to Pipelines

A query plan is split into pipelines at pipeline breakers. Each pipeline is a chain of operators that can execute without materializing intermediate results. Watch how a join query decomposes into separate pipelines.

Pipeline 1 (Build) Pipeline 2 (Probe) Pipeline 3 (Output) Pipeline Breaker

Morsel-Driven Parallelism

The table is divided into morsels — contiguous chunks of roughly 100,000 rows. Each worker thread grabs a morsel and pushes it through the pipeline independently. No thread coordination or locking is needed during processing — each thread owns its morsel completely. Adjust the thread count and watch morsels get distributed.

Table Morsels (~100K rows each)
Morsels processed0
Total rows0
Elapsed0ms

Work Stealing: Handling Skew

What happens when some morsels take longer than others? If one thread finishes its morsel early, it does not sit idle. It steals an unprocessed morsel from the shared queue. This naturally balances the workload — threads that finish fast do more work, threads on expensive morsels keep going without interruption.

Without work stealing
Total:
With work stealing
Total:

Sink Operators

Every pipeline ends at a sink — the operator that consumes processed data. Sinks are the only point where threads may need to coordinate (e.g., merging partial hash tables). Understanding sinks is key to understanding where parallelism overhead actually occurs.

Hash Table Build
Each thread builds a thread-local hash table from its morsels. After all morsels are consumed, the thread-local tables are merged into a single global hash table. The merge phase is the only sequential part — the build itself is embarrassingly parallel.
Sort Accumulator
Each thread sorts its morsels locally. The final merge step combines sorted runs into a single sorted output. DuckDB uses a merge-sort approach where the number of sorted runs equals the number of worker threads, making the merge efficient.
Aggregate Combiner
Thread-local partial aggregates (per-thread hash tables with running SUM, COUNT, etc.) are combined at pipeline completion. For commutative aggregates, this is a simple merge. Each thread's partial state is independent, so no locking occurs during the aggregation phase.
Result Materializer
The final pipeline's sink writes results into a result collection that the client reads. For simple queries without ORDER BY, result chunks can be emitted as soon as each thread completes a morsel — true streaming output with no buffering.

End-to-End: A Join Query

Trace how SELECT o.id, c.name FROM orders o JOIN customers c ON o.cid = c.id WHERE o.amount > 100 executes as two pipelines with morsel-driven parallelism.

1
Pipeline 1: Build the hash table
The customers table is divided into morsels. Each worker thread scans its morsel, hashes the c.id column, and inserts into a thread-local hash table. After all morsels are consumed, thread-local tables merge into one global hash table. Pipeline breaker: the hash table must be complete before probing can begin.
2
Pipeline 2: Scan, filter, probe, project
The orders table is divided into morsels. Each worker thread: (a) scans a morsel of orders, (b) filters rows where amount > 100 using a selection vector, (c) probes the hash table with o.cid to find matching customers, (d) projects the final columns o.id and c.name. All four operators run in a single push — no materialization between them.
3
Result streaming
As each thread completes a morsel through Pipeline 2, the result chunk is appended to the output collection. The client can start reading results before the entire query completes. No global sort means no final pipeline breaker — results stream out as fast as threads can produce them.

Frequently Asked Questions

What is the difference between a morsel and a vector?

A morsel is a scheduling unit — roughly 100,000 rows assigned to a worker thread. A vector is a processing unit — 2048 values processed by a single operator call. Within one morsel, a thread processes many vectors (about 50 vectors per morsel). The morsel determines how work is divided between threads; the vector determines how work is executed within a thread. Think of morsels as the parallelism granularity and vectors as the execution granularity.

Does DuckDB actually avoid all locks during pipeline execution?

During the main pipeline execution phase, yes — each thread operates on its own morsel and its own thread-local state. No shared data structures are accessed. Coordination only happens at pipeline boundaries: when merging thread-local hash tables, combining partial aggregates, or merging sorted runs. Even this coordination is minimized — for example, hash table merge uses partitioned combining where each thread owns a slice of the final table, avoiding contention.

How does DuckDB decide the morsel size?

The morsel size is not a single fixed constant — it is tuned to balance two concerns. Too small (1,000 rows) and the thread-scheduling overhead dominates: grabbing the next morsel from the queue, initializing thread-local state, and combining results happens too often relative to actual work. Too large (10 million rows) and load balancing suffers: a slow morsel holds up one thread while others sit idle. Around 100,000 rows provides enough work per morsel (~50 vector iterations) to amortize scheduling costs while maintaining fine-grained load balancing.

Can push-based execution handle queries with multiple joins?

Yes. Each join adds one pipeline breaker (the build side) and one additional pipeline. A query with two joins creates three pipelines: build hash table A, build hash table B, then scan the driving table and probe both hash tables. The optimizer determines the join order to minimize the cost of pipeline breakers (smaller tables are built first). Each pipeline independently uses morsel-driven parallelism, so the cost of additional pipelines is primarily the time to build each hash table — the probe pipeline processes all joins in a single pass.

How does this compare to ClickHouse's execution model?

ClickHouse also uses push-based execution (since its Processor model in v21.6+) with vectorized processing. The key difference is parallelism granularity. ClickHouse parallelizes across data parts (MergeTree table segments) and uses a DAG of processors with port-based communication. DuckDB parallelizes across morsels within a single table and uses a simpler linear pipeline model. ClickHouse's model is designed for distributed multi-server deployment; DuckDB's morsel model is optimized for single-machine multi-core execution where work stealing provides dynamic load balancing without network overhead.

Original educational content. Inspired by database systems research — no slides or course material reproduced.