Vectorized Execution Engine

Why Processing a Column of 2048 Values Crushes Row-at-a-Time

The Problem with One Row at a Time

Traditional database engines use the Volcano model (also called the iterator model). Every operator in the query plan has a next() method that returns exactly one row. A filter calls next() on its child (a table scan), checks a predicate on that single row, and if it passes, returns it to the parent operator. This design is elegant and composable, but it has a devastating performance flaw: every single row incurs a virtual function call at every operator boundary.

Scan 100 million rows through a plan with 5 operators, and you have executed 500 million function calls — most of which do almost no actual computation. The CPU spends more time dispatching calls and chasing pointers than doing useful work. Branch prediction fails, instruction caches thrash, and data caches are wasted because each row is a different type layout.

Vectorized execution changes the unit of work from a single row to a vector of values. Instead of calling next() 100 million times, you call next_chunk() about 50,000 times — each call processes a vector of 2048 values in a tight loop that the CPU can optimize with SIMD instructions and prefetching.

Volcano vs Vectorized: Side by Side

Watch both models process the same rows. Left: Volcano processes one row at a time through each operator. Right: vectorized execution moves a batch of 2048 through each operator in one call.

Volcano (row-at-a-time)

Rows processed0

next() calls0

Vectorized (2048 at a time)

Rows processed0

next_chunk() calls0

Why Exactly 2048 Values?

L1 Cache Fit

A vector of 2048 int64 values is 2048 x 8 = 16 KB. Most modern CPUs have 32-64 KB of L1 data cache per core. A 16 KB vector fits comfortably, leaving room for the operator's own state and temporary buffers. This means the entire input vector stays in the fastest memory the CPU has — no cache misses during processing.

SIMD Sweet Spot

With 2048 values and AVX-512 (which processes 8 int64s per instruction), the CPU executes only 256 SIMD instructions to process the entire vector. That is enough iterations for the CPU's out-of-order execution to stay busy, but few enough to avoid L2 cache spills. Smaller batches (64, 128) leave SIMD pipelines starved; larger batches (8192+) evict the vector from L1.

Amortized Overhead

Each next_chunk() call has fixed overhead: function dispatch, null-check setup, selection vector initialization. With 2048 values per call, this overhead is amortized across the entire batch. Processing 100M rows requires only ~48,828 function calls total across all operators — compared to 100M calls per operator in Volcano.

Columnar Vectors and SIMD

Each vector is a contiguous array of a single data type. Not rows of mixed types — a flat int64[], a flat varchar[], a flat double[]. This columnar layout is critical because SIMD instructions require all elements to be the same type and contiguous in memory.

id: int64[2048]

price: double[2048]

qty: int32[2048]

Each column is a separate flat array in memory. The CPU can load 4 doubles (AVX-256) or 8 int32s (AVX-256) in a single instruction, then apply the same operation (compare, add, multiply) to all of them simultaneously.

Selection Vectors: Filtering Without Copying

When a filter operator evaluates WHERE price > 100 on a vector, it does not copy the matching rows into a new vector. Instead, it produces a selection vector — an array of indices pointing to the rows that passed the predicate. Downstream operators read only those indices, skipping filtered rows without any data movement.

price vector

filter

selection vector

result

Click "Apply Filter" to see which indices pass

Pipeline Breakers

Not every operator can produce output immediately after seeing one vector. Some need to consume all input before producing any output. These are pipeline breakers and they create boundaries that split the query plan into separate execution pipelines.

Streaming (Non-blocking)

Filter

Checks predicate on each vector, passes matching rows immediately. Never buffers.

Streaming (Non-blocking)

Projection

Evaluates expressions (column + 1, UPPER(name)) on each vector, returns result instantly.

Pipeline Breaker

Hash Join (Build)

Must consume the entire build side to construct the hash table before the probe side can start.

Pipeline Breaker

Sort

Needs all input rows to determine the correct ordering. Cannot emit any output early.

Pipeline Breaker

Hash Aggregate

Must see every group's rows before producing final aggregate values (SUM, COUNT, AVG).

Streaming (Non-blocking)

Hash Join (Probe)

Once the build side's hash table is ready, probe rows are looked up and joined immediately per-vector.

Putting Numbers on It

A concrete example of the overhead difference when scanning 100 million rows through a 3-operator pipeline (Scan, Filter, Project).

	Volcano	Vectorized (2048)
Rows to process	100,000,000	100,000,000
Calls per operator	100,000,000 next()	48,828 next_chunk()
Total function calls (3 ops)	300,000,000	146,484
Cache behavior	Random — each row may be a different layout	Sequential — flat array of same type in L1
SIMD possible?	No — single value, scalar operations only	Yes — 4-8 values processed per instruction
Branch prediction	Per-row type checks and null checks	One check per vector, then tight branchless loop
Typical throughput	~50-200M rows/sec	~1-4B rows/sec

Frequently Asked Questions

Does vectorized execution only help with numeric data?

No. While SIMD benefits are most obvious for fixed-width numeric types (int32, int64, double), vectorized execution also improves string processing. String vectors store offsets and lengths in contiguous arrays, so operations like prefix matching or length computation still benefit from sequential memory access and reduced function call overhead. The batching advantage (amortizing per-call costs across 2048 rows) applies regardless of data type.

What happens when a vector has NULL values?

Each vector in DuckDB carries a validity mask — a bitmask where each bit indicates whether the corresponding value is NULL. Operations check this bitmask before processing. For many operations (arithmetic, comparisons), the engine uses bitwise AND on validity masks to propagate NULLs without branching on individual values. This keeps the tight processing loop intact while correctly handling NULL semantics.

Is the vector size always exactly 2048?

The standard vector size is 2048, defined as a compile-time constant STANDARD_VECTOR_SIZE. The last vector in a scan may contain fewer values (whatever remains). The value 2048 is a carefully chosen default that balances L1 cache utilization, SIMD efficiency, and amortization of per-call overhead. DuckDB can be compiled with a different vector size for experimentation, but 2048 has been found to be near-optimal across a wide range of hardware and query patterns.

How does vectorized execution relate to columnar storage?

They are complementary but independent concepts. Columnar storage means data is stored on disk with all values of a single column together. Vectorized execution means the query engine processes batches of values from one column at a time. Columnar storage makes it efficient to read a single column from disk; vectorized execution makes it efficient to process that column in memory. DuckDB uses both, but you could have columnar storage with a Volcano-style engine (wasting the cache benefits) or vectorized execution over row-stored data (though you would pay to extract columns first).

Can vectorized engines handle complex expressions like CASE WHEN?

Yes. Complex expressions are decomposed into primitive vectorized operations. A CASE WHEN is evaluated by first computing the condition for all 2048 values (producing a boolean vector), then computing the THEN branch for all values, computing the ELSE branch for all values, and finally using the boolean vector to select the correct result for each position. This avoids per-row branching and keeps the processing vectorized throughout. The selection vector mechanism ensures that only the relevant rows are actually computed for each branch.

The Problem with One Row at a Time

Volcano vs Vectorized: Side by Side

Why Exactly 2048 Values?

Columnar Vectors and SIMD

Selection Vectors: Filtering Without Copying

Pipeline Breakers

Putting Numbers on It

Frequently Asked Questions

🔗 Related Topics