Vectorized Execution Engine
Why Processing a Column of 2048 Values Crushes Row-at-a-Time
The Problem with One Row at a Time
Traditional database engines use the Volcano model (also called the iterator model).
Every operator in the query plan has a next() method that returns exactly one row.
A filter calls next() on its child (a table scan), checks a predicate on that single row,
and if it passes, returns it to the parent operator. This design is elegant and composable, but
it has a devastating performance flaw: every single row incurs a virtual function call
at every operator boundary.
Scan 100 million rows through a plan with 5 operators, and you have executed 500 million function calls — most of which do almost no actual computation. The CPU spends more time dispatching calls and chasing pointers than doing useful work. Branch prediction fails, instruction caches thrash, and data caches are wasted because each row is a different type layout.
Vectorized execution changes the unit of work from a single row to a vector
of values. Instead of calling next() 100 million times, you call next_chunk()
about 50,000 times — each call processes a vector of 2048 values in a tight loop that the CPU
can optimize with SIMD instructions and prefetching.
Volcano vs Vectorized: Side by Side
Watch both models process the same rows. Left: Volcano processes one row at a time through each operator. Right: vectorized execution moves a batch of 2048 through each operator in one call.
Why Exactly 2048 Values?
int64 values is 2048 x 8 = 16 KB.
Most modern CPUs have 32-64 KB of L1 data cache per core. A 16 KB vector fits comfortably,
leaving room for the operator's own state and temporary buffers. This means the entire
input vector stays in the fastest memory the CPU has — no cache misses during processing.
next_chunk() call has fixed overhead: function dispatch, null-check setup,
selection vector initialization. With 2048 values per call, this overhead is amortized across
the entire batch. Processing 100M rows requires only ~48,828 function calls total across all
operators — compared to 100M calls per operator in Volcano.
Columnar Vectors and SIMD
Each vector is a contiguous array of a single data type. Not rows of mixed types — a flat
int64[], a flat varchar[], a flat double[]. This columnar layout
is critical because SIMD instructions require all elements to be the same type and contiguous in memory.
Selection Vectors: Filtering Without Copying
When a filter operator evaluates WHERE price > 100 on a vector, it does not copy
the matching rows into a new vector. Instead, it produces a selection vector — an
array of indices pointing to the rows that passed the predicate. Downstream operators read only
those indices, skipping filtered rows without any data movement.
Pipeline Breakers
Not every operator can produce output immediately after seeing one vector. Some need to consume all input before producing any output. These are pipeline breakers and they create boundaries that split the query plan into separate execution pipelines.
Putting Numbers on It
A concrete example of the overhead difference when scanning 100 million rows through a 3-operator pipeline (Scan, Filter, Project).
| Volcano | Vectorized (2048) | |
|---|---|---|
| Rows to process | 100,000,000 | 100,000,000 |
| Calls per operator | 100,000,000 next() | 48,828 next_chunk() |
| Total function calls (3 ops) | 300,000,000 | 146,484 |
| Cache behavior | Random — each row may be a different layout | Sequential — flat array of same type in L1 |
| SIMD possible? | No — single value, scalar operations only | Yes — 4-8 values processed per instruction |
| Branch prediction | Per-row type checks and null checks | One check per vector, then tight branchless loop |
| Typical throughput | ~50-200M rows/sec | ~1-4B rows/sec |
Frequently Asked Questions
Does vectorized execution only help with numeric data?
No. While SIMD benefits are most obvious for fixed-width numeric types (int32, int64, double), vectorized execution also improves string processing. String vectors store offsets and lengths in contiguous arrays, so operations like prefix matching or length computation still benefit from sequential memory access and reduced function call overhead. The batching advantage (amortizing per-call costs across 2048 rows) applies regardless of data type.
What happens when a vector has NULL values?
Each vector in DuckDB carries a validity mask — a bitmask where each bit indicates whether the corresponding value is NULL. Operations check this bitmask before processing. For many operations (arithmetic, comparisons), the engine uses bitwise AND on validity masks to propagate NULLs without branching on individual values. This keeps the tight processing loop intact while correctly handling NULL semantics.
Is the vector size always exactly 2048?
The standard vector size is 2048, defined as a compile-time constant STANDARD_VECTOR_SIZE.
The last vector in a scan may contain fewer values (whatever remains). The value 2048 is a carefully
chosen default that balances L1 cache utilization, SIMD efficiency, and amortization of per-call overhead.
DuckDB can be compiled with a different vector size for experimentation, but 2048 has been found
to be near-optimal across a wide range of hardware and query patterns.
How does vectorized execution relate to columnar storage?
They are complementary but independent concepts. Columnar storage means data is stored on disk with all values of a single column together. Vectorized execution means the query engine processes batches of values from one column at a time. Columnar storage makes it efficient to read a single column from disk; vectorized execution makes it efficient to process that column in memory. DuckDB uses both, but you could have columnar storage with a Volcano-style engine (wasting the cache benefits) or vectorized execution over row-stored data (though you would pay to extract columns first).
Can vectorized engines handle complex expressions like CASE WHEN?
Yes. Complex expressions are decomposed into primitive vectorized operations. A CASE WHEN
is evaluated by first computing the condition for all 2048 values (producing a boolean vector),
then computing the THEN branch for all values, computing the ELSE branch for all values, and
finally using the boolean vector to select the correct result for each position. This avoids
per-row branching and keeps the processing vectorized throughout. The selection vector mechanism
ensures that only the relevant rows are actually computed for each branch.