Query Optimizer
How DuckDB Rewrites Queries Before They Run
Before DuckDB executes a single row of your query, the optimizer transforms the logical plan through a sequence of rewrite passes. Each pass applies a specific strategy — pushing filters closer to the data source, pruning unnecessary columns, reordering joins for minimal intermediate data, eliminating redundant computations. The result is often a plan that processes orders of magnitude less data than the naive interpretation of your SQL.
DuckDB's optimizer is deliberately simpler than PostgreSQL's or Oracle's. It does not need to reason about dozens of index types, tablespace placement, or replication topology. This simplicity is a feature: fewer moving parts means faster optimization (microseconds, not milliseconds) and more predictable behavior.
Optimizer Pipeline
Click through the optimization passes to see how each one transforms the query plan. Start with a naive plan and watch it improve step by step.
SELECT o.total, c.name FROM orders o JOIN customers c ON o.cust_id = c.id JOIN products p ON o.prod_id = p.id WHERE c.country = 'DE' AND p.category = 'electronics' AND o.total > 100 ORDER BY o.total DESC LIMIT 50 Filter Pushdown
Move predicates as close to the data source as possible. The earlier you filter, the less data flows through the rest of the plan.
Parquet-Aware Filter Pushdown
When the data source is a Parquet file, filter pushdown becomes even more powerful. Parquet stores min/max statistics per row group (typically 100K-1M rows). The filter is evaluated against these statistics before reading any data pages. If a row group's max value for the filtered column is below the filter threshold, the entire row group is skipped. On a well-sorted Parquet file, this can eliminate 95%+ of I/O.
Projection Pushdown
Only read the columns the query actually uses. In a columnar storage engine, this translates directly to less I/O.
Join Ordering
For N tables, there are N! possible join orders. DuckDB uses dynamic programming to find the cheapest order based on estimated intermediate cardinalities.
Common Subexpression Elimination
When the same expression appears multiple times in a query, compute it once and reuse the result.
SELECT price * (1 + tax_rate) AS total, price * (1 + tax_rate) * quantity AS line_total WHERE price * (1 + tax_rate) > 50 price * (1 + tax_rate) computed 3 timesLET _cse1 = price * (1 + tax_rate); SELECT _cse1 AS total, _cse1 * quantity AS line_total WHERE _cse1 > 50 Statistics Propagation
DuckDB tracks per-column statistics and propagates them through operators to estimate intermediate cardinalities. These estimates drive the cost model.
DuckDB vs PostgreSQL Optimizer
Both use cost-based optimization, but with different priorities and constraints.
| Aspect | DuckDB | PostgreSQL |
|---|---|---|
| Optimization target | Analytical queries (full scans, aggregates, joins over large datasets) | Mixed OLTP/OLAP (point lookups, range scans, joins, updates) |
| Index awareness | Minimal (ART for PKs only). Optimizer rarely reasons about indexes. | Extensive (B-tree, GIN, GiST, BRIN, hash). Index selection is a major optimizer component. |
| Statistics source | In-memory column statistics. Updated during data loading. | pg_statistic table, populated by ANALYZE. Can go stale. |
| Join ordering | Dynamic programming for small join graphs, greedy for large ones. | Dynamic programming up to join_collapse_limit (default 8), then GEQO (genetic algorithm). |
| Optimization speed | Microseconds (fewer rules, simpler plan space) | Milliseconds (more rules, larger plan space) |
| Parquet pushdown | Native: pushes filters into Parquet row group metadata | Requires FDW extensions (e.g., parquet_fdw). No native support. |
Frequently Asked Questions
How many optimization passes does DuckDB's optimizer run?
DuckDB runs roughly a dozen optimization passes on each query's logical plan. The exact number varies by version, but the major ones include: filter pushdown, projection pushdown, join ordering (using dynamic programming), common subexpression elimination, constant folding, limit pushdown, and statistics propagation. Each pass is incremental — it takes the plan produced by the previous pass and attempts to improve it further. The passes are ordered so that earlier optimizations expose opportunities for later ones.
Does DuckDB support index-based optimization?
DuckDB has minimal index-based optimization compared to traditional databases. It supports ART (Adaptive Radix Tree) indexes for primary key lookups and constraint enforcement, but the optimizer does not have a complex index selection component. This is intentional: DuckDB is a columnar analytical engine where full column scans with vectorized execution are often faster than index-based random access. The optimizer focuses instead on reducing the amount of data scanned through filter and projection pushdown.
How does filter pushdown work with Parquet files?
When DuckDB scans a Parquet file with a WHERE clause, the optimizer pushes the filter predicate into the Parquet reader. Parquet stores min/max statistics for each row group and each column chunk. The reader uses these statistics to skip entire row groups that cannot contain matching rows. For example, if you filter WHERE date > '2024-01-01' and a row group's max date is '2023-06-15', that entire row group (potentially millions of rows) is skipped without reading any of its data pages.
What is the difference between a logical plan and a physical plan?
A logical plan describes what the query does: which tables are joined, what filters are applied, which columns are projected. It uses abstract operators like LogicalGet, LogicalFilter, LogicalJoin. A physical plan describes how the query executes: which algorithm to use for each join (hash join vs merge join vs nested loop), how to parallelize the scan, how to partition the hash table. The optimizer works on the logical plan. After all logical optimizations, a separate planner converts the logical plan to a physical plan by choosing concrete implementations for each operator.
How does DuckDB estimate cardinality for join ordering?
DuckDB maintains column-level statistics including min value, max value, approximate distinct count (using HyperLogLog), and null count. When estimating the output cardinality of a join, the optimizer uses these statistics combined with assumptions about data distribution (e.g., uniform distribution within the min/max range). For equality joins, the estimated output size is roughly (left_rows x right_rows) / max(distinct_left, distinct_right). These estimates can be inaccurate for skewed data, but they are sufficient for choosing good join orders in most practical cases.