Push-Based Query Pipelines
Morsel-Driven Parallelism — How DuckDB Saturates Every Core
Pull vs Push: Two Ways to Move Data
In the Volcano (pull) model, the top operator in the query plan drives execution.
It calls next() on its child, which calls next() on its child,
all the way down to the table scan. Data flows upward, pulled through the tree by the consumer.
Each call is a virtual function dispatch — the caller does not know the concrete type of the child
operator, so the CPU cannot inline or optimize across operator boundaries.
DuckDB uses a push-based model instead. The source operator (table scan) drives execution. It reads a chunk of data and pushes it into the next operator, which processes it and pushes the result into the next, and so on until data reaches a sink. Because the pipeline is known at compile time, the engine can fuse operators together, eliminate virtual dispatch, and let the compiler optimize the entire chain as if it were a single tight loop.
The push model also enables a cleaner parallelism story. In pull-based systems, adding parallelism requires an exchange operator that redistributes rows between threads. In push-based systems, you simply have multiple threads each pushing their own chunk of data through the same pipeline independently.
Query Plan to Pipelines
A query plan is split into pipelines at pipeline breakers. Each pipeline is a chain of operators that can execute without materializing intermediate results. Watch how a join query decomposes into separate pipelines.
Morsel-Driven Parallelism
The table is divided into morsels — contiguous chunks of roughly 100,000 rows. Each worker thread grabs a morsel and pushes it through the pipeline independently. No thread coordination or locking is needed during processing — each thread owns its morsel completely. Adjust the thread count and watch morsels get distributed.
Work Stealing: Handling Skew
What happens when some morsels take longer than others? If one thread finishes its morsel early, it does not sit idle. It steals an unprocessed morsel from the shared queue. This naturally balances the workload — threads that finish fast do more work, threads on expensive morsels keep going without interruption.
Sink Operators
Every pipeline ends at a sink — the operator that consumes processed data. Sinks are the only point where threads may need to coordinate (e.g., merging partial hash tables). Understanding sinks is key to understanding where parallelism overhead actually occurs.
End-to-End: A Join Query
Trace how SELECT o.id, c.name FROM orders o JOIN customers c ON o.cid = c.id WHERE o.amount > 100
executes as two pipelines with morsel-driven parallelism.
customers table is divided into morsels. Each worker thread scans its morsel,
hashes the c.id column, and inserts into a thread-local hash table.
After all morsels are consumed, thread-local tables merge into one global hash table.
Pipeline breaker: the hash table must be complete before probing can begin.
orders table is divided into morsels. Each worker thread:
(a) scans a morsel of orders,
(b) filters rows where amount > 100 using a selection vector,
(c) probes the hash table with o.cid to find matching customers,
(d) projects the final columns o.id and c.name.
All four operators run in a single push — no materialization between them.
Frequently Asked Questions
What is the difference between a morsel and a vector?
A morsel is a scheduling unit — roughly 100,000 rows assigned to a worker thread. A vector is a processing unit — 2048 values processed by a single operator call. Within one morsel, a thread processes many vectors (about 50 vectors per morsel). The morsel determines how work is divided between threads; the vector determines how work is executed within a thread. Think of morsels as the parallelism granularity and vectors as the execution granularity.
Does DuckDB actually avoid all locks during pipeline execution?
During the main pipeline execution phase, yes — each thread operates on its own morsel and its own thread-local state. No shared data structures are accessed. Coordination only happens at pipeline boundaries: when merging thread-local hash tables, combining partial aggregates, or merging sorted runs. Even this coordination is minimized — for example, hash table merge uses partitioned combining where each thread owns a slice of the final table, avoiding contention.
How does DuckDB decide the morsel size?
The morsel size is not a single fixed constant — it is tuned to balance two concerns. Too small (1,000 rows) and the thread-scheduling overhead dominates: grabbing the next morsel from the queue, initializing thread-local state, and combining results happens too often relative to actual work. Too large (10 million rows) and load balancing suffers: a slow morsel holds up one thread while others sit idle. Around 100,000 rows provides enough work per morsel (~50 vector iterations) to amortize scheduling costs while maintaining fine-grained load balancing.
Can push-based execution handle queries with multiple joins?
Yes. Each join adds one pipeline breaker (the build side) and one additional pipeline. A query with two joins creates three pipelines: build hash table A, build hash table B, then scan the driving table and probe both hash tables. The optimizer determines the join order to minimize the cost of pipeline breakers (smaller tables are built first). Each pipeline independently uses morsel-driven parallelism, so the cost of additional pipelines is primarily the time to build each hash table — the probe pipeline processes all joins in a single pass.
How does this compare to ClickHouse's execution model?
ClickHouse also uses push-based execution (since its Processor model in v21.6+) with vectorized processing. The key difference is parallelism granularity. ClickHouse parallelizes across data parts (MergeTree table segments) and uses a DAG of processors with port-based communication. DuckDB parallelizes across morsels within a single table and uses a simpler linear pipeline model. ClickHouse's model is designed for distributed multi-server deployment; DuckDB's morsel model is optimized for single-machine multi-core execution where work stealing provides dynamic load balancing without network overhead.