DuckDB Internals
DuckDB is an in-process OLAP database that crunches billions of rows on a laptop. No server, no config, no dependencies — just embed it and query. These pages dissect the engineering that makes this possible: vectorized execution, push-based pipelines, adaptive radix trees, and a query optimizer that rivals production systems ten times its size.
Inspired by the DiDi curriculum from the University of Tübingen. All content is original.
DuckDB Architecture Overview
Key Numbers
Why DuckDB Exists
Vectorized Execution
Process 2048 values at once instead of row-by-row — how DuckDB's columnar engine exploits CPU caches, SIMD, and pipeline parallelism
Query Pipeline
Push-based execution with morsel-driven parallelism — how DuckDB builds and runs query plans without materializing intermediate results
ART Indexing
Adaptive Radix Trees — the cache-friendly, space-efficient index structure DuckDB uses instead of B-trees
Sorting Large Tables
External merge sort with columnar layouts — how DuckDB sorts datasets bigger than memory without falling off a performance cliff
Memory Management
Buffer manager, memory limits, and spill-to-disk — how an in-process database handles datasets larger than available RAM
Query Optimizer
Join ordering, filter pushdown, common subexpression elimination — the rewrite rules that make DuckDB queries fast
Storage Format
Row groups, column segments, min/max indexes, and Parquet integration — how DuckDB persists and scans data on disk
DuckDB vs SQLite
Column-store vs row-store, OLAP vs OLTP, embedded vs embedded — when to use which and why the architecture matters
DuckDB vs Other Databases
| DuckDB | PostgreSQL | SQLite | ClickHouse | |
|---|---|---|---|---|
| Deployment | Embedded (in-process) | Client-server | Embedded (in-process) | Client-server / cluster |
| Storage layout | Columnar | Row-based (heap) | Row-based (B-tree) | Columnar (MergeTree) |
| Best for | Local analytics, data science | OLTP + moderate OLAP | OLTP, mobile, config | Production OLAP at scale |
| Execution model | Vectorized, push-based | Volcano (pull, row) | Virtual machine (row) | Vectorized, push-based |
| Parallelism | Morsel-driven (auto) | Parallel query (limited) | Single-threaded | Multi-threaded + distributed |
| Concurrency | Single writer, multi reader | Full MVCC | WAL mode multi-reader | Append-oriented |
| Index type | ART (Adaptive Radix Tree) | B-tree, GIN, GiST | B-tree | Sparse (MergeTree primary) |
| Dependencies | Zero | Server + extensions | Zero | Server + ZooKeeper/Keeper |