SQLite Internals

SQLite is a full SQL database compiled into a library. It runs without a server, stores a database in a single file, and offers ACID transactions with crash recovery. The library is around 200 KB of C, has more test code than every other open-source database combined, and ships in roughly every phone, every browser, every airplane, and most cars on the planet. These pages explain why and how — page format, B-trees, the WAL, the virtual table mechanism, and the testing culture that makes SQLite the world's most-deployed database.

Based on the SQLite source, the official SQLite documentation, and Richard Hipp's published talks. Original content.

SQLite Architecture

Application linked against libsqlite3 (no IPC, no daemon) Python · Go · iOS · Android · Browsers · Embedded systems Tokenizer SQL → tokens Parser (Lemon) tokens → AST Code Generator AST → bytecode VDBE opcode VM B-Tree (table btrees + index btrees) cursor open/seek/insert/delete · page splits · overflow chains · freelist Pager (page cache + journal/WAL) in-memory page cache · rollback journal OR write-ahead log · transaction control OS Interface (VFS) — open / read / write / fsync / lock — pluggable: unix, win32, memory, custom

Key Numbers

Library Size
~200 KB
Default Page
4096 B
Max Pages
2^32
Max DB Size
281 TB
Test Code Ratio
~100×
First Released
2000
Public Domain
Yes

Why SQLite Exists

The Gap
In 2000 Richard Hipp needed a database for a Navy missile damage-control system. PostgreSQL required a server. Berkeley DB had license issues. He wanted something that would link into a single executable, run anywhere, never need an admin, and survive abrupt power loss — submarines lose power.
The Insight
A database does not need to be a server. The hard parts — atomic commit, B-tree storage, recovery — are library problems. The OS already has files and locks. If you treat the OS as your VFS abstraction, you get cross-platform durability for free, and the DB becomes a single .o file you compile in.
The Result
SQLite ended up in places nobody planned: every iPhone (Core Data), every Android phone, every Firefox/Chrome (IndexedDB, history, cookies), Skype, Dropbox, every Airbus A350, every Tesla. Apple alone ships hundreds of SQLite databases per device. There is no other database with this footprint.
✦ Live

Write-Ahead Logging (WAL)

How SQLite's WAL mode works in detail — file format, wal-index, checkpoints, reader/writer concurrency, and the wal2 extension

Coming soon

B-Tree Storage

Coming soon: how SQLite lays out table and index B-trees, page splits, overflow chains, and the freelist

Coming soon

File Format

Coming soon: the 100-byte database header, page types, the lock-byte page, and how file size is bounded

Coming soon

Virtual Tables

Coming soon: the xBestIndex / xFilter API that lets FTS5, R*Tree, and the json_each extension look like normal tables

Coming soon

Atomic Commit Protocol

Coming soon: how SQLite guarantees ACID through the rollback journal — the original durability protocol before WAL

The Serverless Model

SQLite is not a server. It is a library that touches a file.

A traditional database is a process. The application speaks a wire protocol over TCP or a Unix socket; the server has its own user, its own port, its own buffer pool, its own logs. SQLite removes the process. libsqlite3 compiles into your binary; a sqlite3_open() call returns a handle that reads and writes a database file directly through normal POSIX (or Windows, or any custom VFS) calls.

Concurrency is coordinated by file locks: SHARED, RESERVED, PENDING, EXCLUSIVE. The kernel arbitrates which process holds which lock; SQLite reads its own pager state from the file to decide what is consistent. In WAL mode there is also a small -shm shared-memory file for the wal-index.

The serverless model has three consequences. First, no operational presence: there is nothing running for an attacker to find and nothing for an SRE to monitor. Second, no network attack surface — but also no network access; the database is reachable only by processes that can open(2) the file. Third, transactions are durable to whatever fsync the host file system provides, and SQLite is meticulous about choosing sync points correctly.

The File Format

Everything SQLite stores lives in fixed-size pages inside one file.

A SQLite file begins with a 100-byte database header. It contains the magic string "SQLite format 3\0", the page size (default 4096, range 512–65536, must be a power of two), the file format version, the user-visible schema cookie, the change counter, the freelist trunk page number, the database size in pages, and a few flags for application-level compatibility.

The rest of the file is a sequence of pages of identical size. Page 1 contains the database header in its first 100 bytes and the start of the sqlite_schema table after that — which is itself a B-tree storing the SQL CREATE statements for every other table and index. Every other page is one of: a table B-tree page (interior or leaf), an index B-tree page (interior or leaf), an overflow page (holds the tail of a row whose payload exceeds the per-page budget), a freelist trunk page, a freelist leaf page, or a pointer-map page (in auto-vacuum mode).

The freelist is SQLite's deferred-deallocation mechanism. When a row is deleted and a page becomes empty, the page is added to the freelist instead of being released to the file system. Future inserts pull from the freelist before extending the file. The freelist is a singly-linked list of trunk pages, each of which holds page numbers of leaf pages available for reuse — a two-level structure to keep the linked-list traversal short.

B-Tree Storage

SQLite is a B-tree database. Tables are B-trees; indexes are B-trees.

Every table in SQLite is stored as a B-tree keyed on the row's ROWID (a 64-bit integer). If you declare INTEGER PRIMARY KEY, that column is the rowid; otherwise SQLite assigns one. The B-tree is what SQLite calls a "table b-tree": leaf pages store the row's payload (the serialized column values) in the leaves, with interior pages containing only rowid keys + child pointers.

Every index (other than the implicit rowid index of a table) is stored as a separate "index b-tree". Leaf pages here store the indexed column values plus the rowid as a pointer back to the table. Looking up a row by an indexed column means: walk the index b-tree to find the rowid, then walk the table b-tree by rowid to fetch the payload — two B-tree descents, two cache-line-friendly walks of the height-3-or-4 trees typical at SQLite's scale.

A WITHOUT ROWID table is different: the table's primary key columns are the B-tree key directly, no rowid layer in between. This saves the second B-tree descent for primary-key lookups and lets the leaf store the row in the table's own key-order. The cost is that secondary indexes must store the full primary key (which may be wider than 8 bytes) instead of a compact rowid.

Page splits happen when an insert into a leaf would exceed the page size. SQLite uses a simple split-down-the-middle heuristic plus a "balance" routine that may redistribute cells across siblings to defer splits. Long values (BLOBs, big strings) that don't fit in a leaf are spilled to an overflow chain: a singly-linked list of pages holding the tail of the value. The leaf cell stores the prefix and the first overflow page number.

The Atomic Commit Protocol (Rollback Journal)

The original durability mechanism — superseded by WAL for most workloads, still the default in some embeddings.

In rollback-journal mode, the protocol for a write transaction is:

  1. Acquire a RESERVED lock on the database (one writer allowed).
  2. Open the journal file (filename-journal). Write its header. fsync.
  3. For each page about to be modified, copy the original page into the journal. fsync the journal.
  4. Modify the database pages in memory and write them to the database file.
  5. Acquire EXCLUSIVE lock. fsync the database file.
  6. Delete (or truncate, or zero-the-header on, depending on journal_mode) the journal. fsync the directory.
  7. Release the lock.

If the process crashes after step 3 but before step 6, the journal exists with the original pages. Recovery: any reader noticing a non-empty journal copies the original pages back into the database — undoing the partial transaction. If the crash is after step 6, the journal is gone and the new database state is committed.

This protocol is bulletproof but writes pages twice (once to journal, once to the file) and forces multiple fsyncs per transaction. Hence WAL.

Write-Ahead Logging (WAL)

Modern default: a separate log file, readers and writers don't block each other.

In WAL mode, modifications go to a separate filename-wal file, not into the main database. Readers continue to read from the main database (and any frames in the WAL that precede a snapshot point they captured at transaction start). Writers append new frames to the WAL. A checkpoint periodically copies WAL frames back into the main database so the WAL doesn't grow forever.

The big consequences: readers and writers no longer fight for the same pages. Multi-reader / single-writer concurrency becomes the natural default. fsync overhead per commit drops because only the WAL needs to sync, not the main database. And since the WAL is append-only during normal operation, sequential I/O dominates — much friendlier to SSDs and rotational disks alike.

The full mechanics — frame format, the shared-memory wal-index, checkpoint modes, the reader/writer dance — are covered in the dedicated WAL deep dive: SQLite WAL Internals.

Virtual Tables

A pluggable table API that turned SQLite into an extension platform.

A virtual table looks and behaves like a normal table to SQL — you can SELECT from it, JOIN with it, filter it — but its storage is whatever the extension author wants: a CSV file, a remote API, an in-memory data structure, an inverted index. The mechanism is a C struct of function pointers (sqlite3_module) that SQLite calls during query execution.

The most important callback is xBestIndex. The optimizer asks the virtual table: "given these WHERE clauses and ORDER BY columns, what does each constraint cost?" The vtab responds with a plan. xFilter then opens a cursor configured by that plan, and xNext / xColumn stream rows back. This is the protocol that lets FTS5 turn its inverted index into a virtual table, lets R*Tree expose spatial range queries, and lets json_each / json_tree stream the elements of a JSON document as rows.

The virtual table mechanism turned SQLite into a query engine over arbitrary data. CSV files, sequence generators (generate_series), foreign databases via dbstat, and external schemas all become first-class SQL citizens. The idea predates "polyglot persistence" by a decade.

FTS5 — Full-Text Search

An inverted index implemented as a virtual table.

FTS5 is SQLite's full-text search extension. You declare a virtual table with the fts5 module and one or more text columns; FTS5 maintains an inverted index from terms to (rowid, position) tuples. A MATCH query parses an FTS query syntax, walks the index, scores hits with BM25, and returns ranked rowids. Token streams support ASCII, Unicode61 (Unicode-normalizing), Porter stemming, and trigrams.

The implementation is interesting because of contentless tables. By default FTS5 stores both the inverted index and a copy of the original text. Setting content='' tells FTS5 to keep only the index — useful when your original text already lives in another table or another system, and you want only the searchable bits in SQLite. The trade-off is that snippets and highlights need access to the source text, which a contentless table cannot provide.

R*Tree and JSON1

Two more virtual-table-driven extensions.

R*Tree is SQLite's spatial index. It stores N-dimensional bounding boxes and supports range queries ("find all rectangles intersecting this rectangle"). The implementation is a virtual table backed by hidden shadow tables; you can use R*Tree to index latitude/longitude bounding boxes for geographic search, time intervals for calendar overlap detection, or even N-dimensional feature vectors for crude nearest-box queries.

JSON1 ships built into modern SQLite. It does not store JSON in a special type — JSON is text in TEXT columns. What JSON1 adds is a set of functions (json_extract, json_set, json_array_length) plus the json_each and json_tree table-valued functions. These let you SELECT against JSON content as if it were structured: SELECT value FROM json_each(meta, '$.tags'). Combined with generated columns and indexes on JSON paths, you get something close to JSONB performance with much less storage complexity.

Transactions: Deferred, Immediate, Exclusive

SQLite transactions don't all behave the same. Pick the right kind.

BEGIN alone is the same as BEGIN DEFERRED. No lock is taken until the transaction reads or writes. The first read takes a SHARED lock; the first write takes a RESERVED lock. The risk: if two connections both start a deferred transaction, both read, and then both attempt to upgrade to RESERVED, one will get SQLITE_BUSY.

BEGIN IMMEDIATE takes a RESERVED lock immediately. No other connection can write while you hold it, but other readers proceed. This is the right choice when you know you will write — it converts late SQLITE_BUSY failures into early ones, and avoids the deadlock window of two upgraders.

BEGIN EXCLUSIVE takes an EXCLUSIVE lock immediately, blocking everything including readers. In WAL mode this is rarely needed; in rollback journal mode it can be used for catalog-style operations that must be globally serialized.

Memory Mode and Mmap Mode

Two ways to make SQLite go faster, both with caveats.

Open :memory: and SQLite never opens a file. Pages live in heap; the whole database vanishes when the connection closes. Useful for unit tests, caches, and ephemeral computations. Cross-connection sharing within a process: open the URI file::memory:?cache=shared with SQLITE_OPEN_URI.

PRAGMA mmap_size turns on memory-mapped I/O. Instead of read()-ing pages into the page cache, SQLite mmaps the database file and accesses pages directly from the page cache the kernel maintains. For read-heavy workloads on large databases this can halve I/O latency. The risks: any I/O error becomes a SIGBUS instead of a readable error code, and writes still go through the regular write path (mmap is read-only from SQLite's perspective).

The combination of WAL mode, a generous cache_size (negative values = KiB), and a sane synchronous setting (NORMAL is safe; FULL is paranoid) is the configuration that wins almost every benchmark.

SQLite vs DuckDB vs Postgres

SQLiteDuckDBPostgreSQL
DeploymentEmbedded libraryEmbedded libraryClient-server
Storage layoutRow (B-tree)Columnar segmentsRow (heap)
Optimized forOLTP, small transactionsOLAP, analytical scansMixed OLTP/OLAP
ConcurrencyWAL: multi-reader, 1 writer1 writer, multi-readerFull MVCC
Index typesB-tree, FTS5, R*Tree (vtab)ART (Adaptive Radix Tree)B-tree, GIN, GiST, BRIN, hash
Wire protocolNone (in-process)None (in-process)libpq / wire protocol
Best forMobile, edge, app stateLocal analytics, data scienceMulti-user transactional apps

Tradeoffs and When to Use SQLite

SQLite is not always the answer. Be honest about what it can't do.

Use SQLite when: the database is bound to one application, the working set fits on local disk, write concurrency is moderate, and you value zero operational surface. iPhone apps, browser storage, IoT logs, single-tenant SaaS where each tenant gets a file (LiteFS, Cloudflare D1, Turso), embedded analytics, on-device ML feature stores, configuration stores.

Don't use SQLite when: many independent processes need to write concurrently to the same database. The single-writer model is fundamental — even WAL mode allows only one writer at a time. Don't use it for high-throughput multi-tenant OLTP where dozens of writers are common; use Postgres or MySQL. Don't use it as a shared store across a fleet of stateless services unless you're willing to put a proxy in front (LiteFS Cloud, rqlite).

Honest weaknesses: no native MVCC (transactions serialize on write, not on snapshot). No bitmap indexes, no GIN, no expression indexes on multiple expressions (single-expression indexes are fine). The query planner is good but simpler than Postgres's — you sometimes need to use INDEXED BY to nudge it. Replication is the user's problem (LiteFS, Litestream, rqlite are third-party). Network access requires a wrapper.

Frequently Asked Questions

Why is SQLite called 'serverless'?

There is no SQLite process. The library is linked directly into your application — Python, mobile app, browser — and reads and writes a single file on disk. Concurrency is coordinated with OS file locks. There is no daemon to start, no port to open, no user to provision. This is why SQLite ships in roughly every phone and browser on the planet: nothing operationally exists to fail.

When should I use SQLite vs Postgres?

Use SQLite when the database lives next to one application instance, the working set fits comfortably on local disk, and write concurrency is moderate. Examples: mobile apps, browser local storage, on-device caches, embedded analytics, single-tenant SaaS, edge deployments (LiteFS, Cloudflare D1). Use Postgres when many independent processes write to the same database, when you need rich types and indexes (GIN, GiST, BRIN), or when network access from arbitrary clients is required.

Can SQLite handle concurrent writes?

In default rollback-journal mode, no — writes serialize behind a database-level write lock. In WAL mode (the modern default for most embeddings), readers do not block writers and writers do not block readers, but only one writer is active at a time. The WAL design is described in detail in the WAL deep dive.

What is the 'in-memory' mode?

Open the filename `:memory:` and SQLite never touches disk — pages live in heap memory. It is a full SQL engine that costs only RAM, ideal for tests, scratch databases inside a process, or temporary tables. Memory databases vanish when the connection closes; for shared in-memory use within a process, use the URI form `file::memory:?cache=shared`.

How big can a SQLite database get?

Theoretical limit: 281 TB (2^48 bytes) at the default 4 KB page size, or 4 PB at 64 KB pages. Practical limits are much lower — the file system, the device, and the reasonable working-set assumptions of SQLite's design make multi-hundred-GB databases the upper end of what's commonly run. SQLite is happiest at sizes from a few KB to ~100 GB.

Is SQLite ACID?

Yes, fully. The atomic commit protocol (rollback journal mode) and the WAL protocol both implement crash-safe ACID with serializable isolation. Durability is the part most often tuned: setting `PRAGMA synchronous=NORMAL` (the default in WAL) reduces fsync frequency in exchange for a tiny window where the last few committed transactions could be lost on power failure — but the database itself remains uncorrupted.

What's special about SQLite's testing?

SQLite has roughly 100x more test code than library code (millions of lines of TCL/SQL test scripts plus aviation-grade DO-178B testing for safety-critical use). Every supported branch and merge of internal C code is verified by 100% MC/DC coverage. This is why SQLite is the only database that ships in the avionics of every Airbus A350.

How does SQLite compare to DuckDB?

Both are embedded — same dependency model, same single-file philosophy. SQLite is row-oriented, optimized for OLTP and small reads; DuckDB is column-oriented, optimized for OLAP scans and aggregates. Use SQLite for transactional workloads (writing log rows, app state, mobile data). Use DuckDB for analytical workloads (scanning a Parquet file, computing aggregates over a few GB of CSVs).