SQLite B-Tree

Every persistent thing in a SQLite database — every table, every index — is a b-tree. The schema itself is a b-tree (the special sqlite_master at page 1). Tables with INTEGER PRIMARY KEY are stored as table btrees keyed on the 64-bit rowid; secondary indexes and WITHOUT ROWID tables are index btrees keyed on the user's columns. Both share the same on-disk page format: a slotted page with a header, a cell pointer array, and variable-sized cells. This page walks the page format in detail, distinguishes table from index btrees, explains overflow pages for large payloads, and traces the page-split algorithm that keeps the tree balanced as data grows.

Slotted Page Layout

Key Numbers

Default page size

4096 B

Page size range

512 B – 65 KB

Cell pointer

2 B/cell

Local payload max

~1023 B (4K)

Rowid

64-bit signed

Overflow chain

4-byte ptr

Tree height typical

3-4 levels

Why a B-Tree, Not Just a Sorted File

Random access by key

A sorted file gives O(log N) lookup but every insert is O(N) — must shift all later bytes. B-trees give O(log N) for both, with localized writes (a single page changes per insert in the average case).

Page-aligned with disk

Each b-tree node is exactly one page. Reading a node is one disk seek. The branching factor (cells per page) is naturally optimized for SSD/HDD I/O units.

In-place updates

Updates that fit in the existing cell don't require a split. Page locality is preserved. SQLite's b-tree even reuses freed cell space within a page (via the fragment list) before allocating new cell space.

Table Btree: Rowid-Keyed

The default storage for tables with INTEGER PRIMARY KEY (or no explicit PK).

Every "ordinary" SQLite table is a b-tree where:

key   = 64-bit signed integer rowid (auto-assigned or your INTEGER PRIMARY KEY)
value = the row's column values, serialized via the SQLite "record format"

cell layout in a leaf:
  varint payload-length
  varint rowid
  payload bytes:
    varint header-length
    varint type-code per column (NULL, int, real, text, blob, …)
    raw column values

The varints (variable-length integers) make small numbers cheap: a column type code for NULL is one byte; an INT row payload header is ~5 bytes total. The record format also encodes column types alongside data, letting SQLite be schemaless internally — type affinity is enforced at insert time but storage is type-aware.

Rowid lookup: descend the b-tree from root to leaf. Interior cells contain (rowid, child_page); leaf cells contain (rowid, payload). Standard b-tree algorithm. For a 1M-row table at 4 KB pages, the tree is ~3 levels deep: root, interior, leaf. Three reads to find any row.

Index Btree: Composite-Keyed

Secondary indexes and WITHOUT ROWID tables.

Index btrees use a different cell format:

key   = the indexed column(s) value, then the rowid
        (concatenation makes keys unique even for non-unique indexes)
value = empty (the rowid is part of the key, suffices to find the row)

CREATE INDEX idx_name ON users(name);
→ index btree keyed on (name, rowid)
→ leaf cell payload is empty
→ to read the user's other columns, follow the rowid into the table btree

The (column, rowid) composite key serves two purposes: (1) it makes index entries unique even when many rows share the same column value, (2) it provides a stable secondary order so that a SELECT WHERE name='alice' ORDER BY id returns rows in rowid order without a sort.

A query that uses an index to find rows pays for the extra hop: index btree descent (to find the rowid), then table btree descent (to read the row). Two trees touched, ~6 reads for a 1M-row table. Covered indexes (where every column the query needs is in the index key) avoid the second descent.

WITHOUT ROWID Tables

A table whose primary key is also its index — no separate rowid.

The default table format wastes space when the user has a natural primary key (a UUID, a string), because the table btree's rowid is auto-assigned and you typically also need an index on the natural key — paying for two btrees. WITHOUT ROWID collapses these: the natural primary key becomes the table btree's key, and the row data is the cell payload.

CREATE TABLE users (
  email TEXT PRIMARY KEY,
  name TEXT,
  created_at INTEGER
) WITHOUT ROWID;

→ table is a b-tree keyed on email (string compare)
→ cell payload is (name, created_at)
→ no separate index on email needed (the table IS the index)

Tradeoffs: lookups by primary key are faster (one btree instead of two). Lookups by other columns still need separate secondary indexes that point to the primary key (no rowid to use as the lightweight pointer). And the primary key gets repeated in every secondary index. For tables with many secondary indexes and a non-trivial primary key, WITHOUT ROWID can be a loss; for primary-key-dominated lookup workloads, it's a clear win.

Overflow Pages

Storing payloads larger than a page.

A 4 KB page leaves room for cells whose payload is up to ~1023 bytes locally (after the header, cell pointer array, and other cell overhead). For larger payloads (a 10 KB BLOB, a long TEXT), SQLite splits:

cell on btree leaf:
  varint payload-length
  varint rowid
  first 1023 bytes of payload locally
  4-byte: page number of first overflow page

overflow page 1:
  4-byte: page number of next overflow page (or 0 if last)
  next chunk of payload (4092 bytes)

overflow page 2: ...
...

Reading a row with overflow: read the leaf cell, follow the chain of overflow pages until you've read the full payload. For a 10 KB BLOB on 4 KB pages, that's 1 leaf read + 3 overflow reads = 4 page reads. If the workload often reads big BLOBs, raising PRAGMA page_size to 16K or 32K (before any tables are created) reduces overflow chain depth.

Page Splits

When inserts exceed page capacity.

Insert into a full page → split:

1. allocate a new page (from freelist or by growing the file)
2. distribute cells between the two pages, ~50/50 by size
3. insert a divider cell into the parent page, pointing to the new page
4. if parent is also full, recursively split

cost: typically 1 new page write + 1 modified parent page
worst case: split cascades to root, root splits, tree grows by 1 level

Splits are uncommon in steady state — most pages have free space left over from earlier operations, and inserts fit. But during bulk inserts in random order, splits are frequent and concentrate writes. Bulk-insert workloads benefit hugely from sorted insertion (rowid order, or insert into a BEGIN; INSERT; COMMIT batch) so each new page fills up in one stretch instead of getting split repeatedly.

The Freelist

Recycling pages from deletions.

Deletes free pages — sometimes empty leaves, sometimes whole subtrees. SQLite doesn't truncate the file on every delete (too expensive). Instead, freed pages are linked into a freelist whose head is stored in the database header (offset 32, 4 bytes). The next allocation pops from this list before extending the file.

Without auto-vacuum, a database file never shrinks — only grows or stays the same size. Freelist pages are reused for new writes but if you've deleted 90% of your data, the file is still its peak size. VACUUM rewrites the entire database into a new file with no freelist, then renames atomically. PRAGMA auto_vacuum=FULL truncates after every commit at the cost of more I/O; INCREMENTAL shrinks lazily on PRAGMA incremental_vacuum(N).

FAQ

What's the difference between table btree and index btree?

Table btrees are keyed on a 64-bit signed integer rowid; the cell payload is the row data. Index btrees are keyed on the indexed column value(s) followed by the rowid; the cell payload is empty (the rowid is sufficient to look up the row in the table btree). Both use the same b-tree code; the difference is in the comparison function and cell layout. WITHOUT ROWID tables turn the table btree into a regular index btree where the primary key is the search key — the row data goes into the leaf cells of the same btree.

What are overflow pages?

Cells whose payload exceeds local payload limit (~1/4 of usable page size) spill onto chained overflow pages. The first part of the cell stays on the b-tree leaf; a 4-byte page number points to the next chunk on a separate page; that page's first 4 bytes point to the next, and so on. Overflow lets SQLite store BLOBs and large strings without inflating page sizes, but it costs extra disk reads to fetch a single row. For BLOB-heavy schemas, raise the page size to reduce overflow frequency.

How does a page split work?

Insert into a full page → SQLite allocates a new page from the freelist (or extends the file), distributes cells between the original and new pages so each is roughly half full, and inserts a divider key (usually a copy of the smallest key on the new page) into the parent. If the parent is also full, that split cascades up. Worst case: split propagates to the root, which itself splits and adds a level. Tree height grows by at most 1 per insert, but in practice splits are rare — most pages have headroom from the initial sizing.

What's the freelist?

Pages that were once allocated but are now unused (a row was deleted, a page emptied) are kept on a singly-linked list rooted at offset 32 in the database header. New page allocations come from this list before extending the file. Deletes don't shrink the file by default; they release pages to the freelist for reuse. PRAGMA auto_vacuum=FULL or auto_vacuum=INCREMENTAL truncates the file when the freelist gets large; without it, you need VACUUM to reclaim disk space.

Why do cells live at the end of the page?

SQLite uses a 'slotted page' design. The page begins with a small header, then a 'cell pointer array' that grows downward toward the cells. Cells themselves grow upward from the bottom of the page. Both ends grow toward the middle. This lets cells be variable-sized while still allowing O(1) random access via the pointer array — critical for cells of wildly different sizes (a row with one int and a row with a 1KB string can coexist on the same page).

How is a row located by rowid?

Standard b-tree descent. The root page of the table btree is found via the schema btree (page 1, sqlite_master). At each interior page, scan/binary-search the cells to find the child whose range contains the target rowid; descend. At the leaf, scan/binary-search to find the cell with the matching rowid; return its payload (or follow overflow chain if the row is large). For a 1M-row table, that's ~3 page reads — typically all in the page cache, so submillisecond.