SQLite Internals
SQLite is a full SQL database compiled into a library. It runs without a server, stores a database in a single file, and offers ACID transactions with crash recovery. The library is around 200 KB of C, has more test code than every other open-source database combined, and ships in roughly every phone, every browser, every airplane, and most cars on the planet. These pages explain why and how — page format, B-trees, the WAL, the virtual table mechanism, and the testing culture that makes SQLite the world's most-deployed database.
Based on the SQLite source, the official SQLite documentation, and Richard Hipp's published talks. Original content.
SQLite Architecture
Key Numbers
Why SQLite Exists
Write-Ahead Logging (WAL)
How SQLite's WAL mode works in detail — file format, wal-index, checkpoints, reader/writer concurrency, and the wal2 extension
B-Tree Storage
Coming soon: how SQLite lays out table and index B-trees, page splits, overflow chains, and the freelist
File Format
Coming soon: the 100-byte database header, page types, the lock-byte page, and how file size is bounded
Virtual Tables
Coming soon: the xBestIndex / xFilter API that lets FTS5, R*Tree, and the json_each extension look like normal tables
Atomic Commit Protocol
Coming soon: how SQLite guarantees ACID through the rollback journal — the original durability protocol before WAL
The Serverless Model
SQLite is not a server. It is a library that touches a file.
A traditional database is a process. The application speaks a wire protocol over TCP or
a Unix socket; the server has its own user, its own port, its own buffer pool, its own
logs. SQLite removes the process. libsqlite3 compiles into your binary; a
sqlite3_open() call returns a handle that reads and writes a database file
directly through normal POSIX (or Windows, or any custom VFS) calls.
Concurrency is coordinated by file locks: SHARED, RESERVED,
PENDING, EXCLUSIVE. The kernel arbitrates which process holds
which lock; SQLite reads its own pager state from the file to decide what is consistent.
In WAL mode there is also a small -shm shared-memory file for the wal-index.
The serverless model has three consequences. First, no operational presence: there is
nothing running for an attacker to find and nothing for an SRE to monitor. Second, no
network attack surface — but also no network access; the database is reachable only by
processes that can open(2) the file. Third, transactions are durable to
whatever fsync the host file system provides, and SQLite is meticulous about choosing
sync points correctly.
The File Format
Everything SQLite stores lives in fixed-size pages inside one file.
A SQLite file begins with a 100-byte database header. It contains the
magic string "SQLite format 3\0", the page size (default 4096, range
512–65536, must be a power of two), the file format version, the user-visible schema
cookie, the change counter, the freelist trunk page number, the database size in pages,
and a few flags for application-level compatibility.
The rest of the file is a sequence of pages of identical size. Page 1 contains the
database header in its first 100 bytes and the start of the sqlite_schema
table after that — which is itself a B-tree storing the SQL CREATE statements for every
other table and index. Every other page is one of: a table B-tree page
(interior or leaf), an index B-tree page (interior or leaf), an
overflow page (holds the tail of a row whose payload exceeds the
per-page budget), a freelist trunk page, a freelist leaf page,
or a pointer-map page (in auto-vacuum mode).
The freelist is SQLite's deferred-deallocation mechanism. When a row is deleted and a page becomes empty, the page is added to the freelist instead of being released to the file system. Future inserts pull from the freelist before extending the file. The freelist is a singly-linked list of trunk pages, each of which holds page numbers of leaf pages available for reuse — a two-level structure to keep the linked-list traversal short.
B-Tree Storage
SQLite is a B-tree database. Tables are B-trees; indexes are B-trees.
Every table in SQLite is stored as a B-tree keyed on the row's
ROWID (a 64-bit integer). If you declare INTEGER PRIMARY KEY,
that column is the rowid; otherwise SQLite assigns one. The B-tree is what
SQLite calls a "table b-tree": leaf pages store the row's payload (the serialized
column values) in the leaves, with interior pages containing only rowid keys + child
pointers.
Every index (other than the implicit rowid index of a table) is stored as a separate "index b-tree". Leaf pages here store the indexed column values plus the rowid as a pointer back to the table. Looking up a row by an indexed column means: walk the index b-tree to find the rowid, then walk the table b-tree by rowid to fetch the payload — two B-tree descents, two cache-line-friendly walks of the height-3-or-4 trees typical at SQLite's scale.
A WITHOUT ROWID table is different: the table's primary key columns are the B-tree key directly, no rowid layer in between. This saves the second B-tree descent for primary-key lookups and lets the leaf store the row in the table's own key-order. The cost is that secondary indexes must store the full primary key (which may be wider than 8 bytes) instead of a compact rowid.
Page splits happen when an insert into a leaf would exceed the page size. SQLite uses a simple split-down-the-middle heuristic plus a "balance" routine that may redistribute cells across siblings to defer splits. Long values (BLOBs, big strings) that don't fit in a leaf are spilled to an overflow chain: a singly-linked list of pages holding the tail of the value. The leaf cell stores the prefix and the first overflow page number.
The Atomic Commit Protocol (Rollback Journal)
The original durability mechanism — superseded by WAL for most workloads, still the default in some embeddings.
In rollback-journal mode, the protocol for a write transaction is:
- Acquire a RESERVED lock on the database (one writer allowed).
- Open the journal file (
filename-journal). Write its header. fsync. - For each page about to be modified, copy the original page into the journal. fsync the journal.
- Modify the database pages in memory and write them to the database file.
- Acquire EXCLUSIVE lock. fsync the database file.
- Delete (or truncate, or zero-the-header on, depending on
journal_mode) the journal. fsync the directory. - Release the lock.
If the process crashes after step 3 but before step 6, the journal exists with the original pages. Recovery: any reader noticing a non-empty journal copies the original pages back into the database — undoing the partial transaction. If the crash is after step 6, the journal is gone and the new database state is committed.
This protocol is bulletproof but writes pages twice (once to journal, once to the file) and forces multiple fsyncs per transaction. Hence WAL.
Write-Ahead Logging (WAL)
Modern default: a separate log file, readers and writers don't block each other.
In WAL mode, modifications go to a separate filename-wal file, not into the
main database. Readers continue to read from the main database (and any frames in the
WAL that precede a snapshot point they captured at transaction start). Writers append
new frames to the WAL. A checkpoint periodically copies WAL frames back
into the main database so the WAL doesn't grow forever.
The big consequences: readers and writers no longer fight for the same pages. Multi-reader / single-writer concurrency becomes the natural default. fsync overhead per commit drops because only the WAL needs to sync, not the main database. And since the WAL is append-only during normal operation, sequential I/O dominates — much friendlier to SSDs and rotational disks alike.
The full mechanics — frame format, the shared-memory wal-index, checkpoint modes, the reader/writer dance — are covered in the dedicated WAL deep dive: SQLite WAL Internals.
Virtual Tables
A pluggable table API that turned SQLite into an extension platform.
A virtual table looks and behaves like a normal table to SQL — you can
SELECT from it, JOIN with it, filter it — but its storage is whatever the extension
author wants: a CSV file, a remote API, an in-memory data structure, an inverted index.
The mechanism is a C struct of function pointers (sqlite3_module) that
SQLite calls during query execution.
The most important callback is xBestIndex. The optimizer asks the virtual
table: "given these WHERE clauses and ORDER BY columns, what does each constraint cost?"
The vtab responds with a plan. xFilter then opens a cursor configured by
that plan, and xNext / xColumn stream rows back. This is the
protocol that lets FTS5 turn its inverted index into a virtual table,
lets R*Tree expose spatial range queries, and lets
json_each / json_tree stream the elements of a JSON document as rows.
The virtual table mechanism turned SQLite into a query engine over arbitrary data. CSV
files, sequence generators (generate_series), foreign databases via
dbstat, and external schemas all become first-class SQL citizens. The idea
predates "polyglot persistence" by a decade.
FTS5 — Full-Text Search
An inverted index implemented as a virtual table.
FTS5 is SQLite's full-text search extension. You declare a virtual table with the
fts5 module and one or more text columns; FTS5 maintains an inverted index
from terms to (rowid, position) tuples. A MATCH query parses an FTS query syntax,
walks the index, scores hits with BM25, and returns ranked rowids. Token streams support
ASCII, Unicode61 (Unicode-normalizing), Porter stemming, and trigrams.
The implementation is interesting because of contentless tables. By
default FTS5 stores both the inverted index and a copy of the original text.
Setting content='' tells FTS5 to keep only the index — useful when your
original text already lives in another table or another system, and you want only the
searchable bits in SQLite. The trade-off is that snippets and highlights need access to
the source text, which a contentless table cannot provide.
R*Tree and JSON1
Two more virtual-table-driven extensions.
R*Tree is SQLite's spatial index. It stores N-dimensional bounding boxes and supports range queries ("find all rectangles intersecting this rectangle"). The implementation is a virtual table backed by hidden shadow tables; you can use R*Tree to index latitude/longitude bounding boxes for geographic search, time intervals for calendar overlap detection, or even N-dimensional feature vectors for crude nearest-box queries.
JSON1 ships built into modern SQLite. It does not store JSON in a
special type — JSON is text in TEXT columns. What JSON1 adds is a set of functions
(json_extract, json_set, json_array_length) plus
the json_each and json_tree table-valued functions. These let
you SELECT against JSON content as if it were structured: SELECT value FROM
json_each(meta, '$.tags'). Combined with generated columns and indexes on JSON
paths, you get something close to JSONB performance with much less storage complexity.
Transactions: Deferred, Immediate, Exclusive
SQLite transactions don't all behave the same. Pick the right kind.
BEGIN alone is the same as BEGIN DEFERRED. No lock is taken
until the transaction reads or writes. The first read takes a SHARED lock; the first
write takes a RESERVED lock. The risk: if two connections both start a deferred
transaction, both read, and then both attempt to upgrade to RESERVED, one will get
SQLITE_BUSY.
BEGIN IMMEDIATE takes a RESERVED lock immediately. No other connection can
write while you hold it, but other readers proceed. This is the right choice when you
know you will write — it converts late SQLITE_BUSY failures into early
ones, and avoids the deadlock window of two upgraders.
BEGIN EXCLUSIVE takes an EXCLUSIVE lock immediately, blocking everything
including readers. In WAL mode this is rarely needed; in rollback journal mode it can
be used for catalog-style operations that must be globally serialized.
Memory Mode and Mmap Mode
Two ways to make SQLite go faster, both with caveats.
Open :memory: and SQLite never opens a file. Pages live in heap; the whole
database vanishes when the connection closes. Useful for unit tests, caches, and
ephemeral computations. Cross-connection sharing within a process: open the URI
file::memory:?cache=shared with SQLITE_OPEN_URI.
PRAGMA mmap_size turns on memory-mapped I/O. Instead of read()-ing
pages into the page cache, SQLite mmaps the database file and accesses pages directly
from the page cache the kernel maintains. For read-heavy workloads on large databases
this can halve I/O latency. The risks: any I/O error becomes a SIGBUS instead of a
readable error code, and writes still go through the regular write path (mmap is read-only
from SQLite's perspective).
The combination of WAL mode, a generous cache_size (negative values = KiB),
and a sane synchronous setting (NORMAL is safe; FULL is paranoid) is the
configuration that wins almost every benchmark.
SQLite vs DuckDB vs Postgres
| SQLite | DuckDB | PostgreSQL | |
|---|---|---|---|
| Deployment | Embedded library | Embedded library | Client-server |
| Storage layout | Row (B-tree) | Columnar segments | Row (heap) |
| Optimized for | OLTP, small transactions | OLAP, analytical scans | Mixed OLTP/OLAP |
| Concurrency | WAL: multi-reader, 1 writer | 1 writer, multi-reader | Full MVCC |
| Index types | B-tree, FTS5, R*Tree (vtab) | ART (Adaptive Radix Tree) | B-tree, GIN, GiST, BRIN, hash |
| Wire protocol | None (in-process) | None (in-process) | libpq / wire protocol |
| Best for | Mobile, edge, app state | Local analytics, data science | Multi-user transactional apps |
Tradeoffs and When to Use SQLite
SQLite is not always the answer. Be honest about what it can't do.
Use SQLite when: the database is bound to one application, the working set fits on local disk, write concurrency is moderate, and you value zero operational surface. iPhone apps, browser storage, IoT logs, single-tenant SaaS where each tenant gets a file (LiteFS, Cloudflare D1, Turso), embedded analytics, on-device ML feature stores, configuration stores.
Don't use SQLite when: many independent processes need to write concurrently to the same database. The single-writer model is fundamental — even WAL mode allows only one writer at a time. Don't use it for high-throughput multi-tenant OLTP where dozens of writers are common; use Postgres or MySQL. Don't use it as a shared store across a fleet of stateless services unless you're willing to put a proxy in front (LiteFS Cloud, rqlite).
Honest weaknesses: no native MVCC (transactions serialize on write,
not on snapshot). No bitmap indexes, no GIN, no expression indexes on multiple
expressions (single-expression indexes are fine). The query planner is good but simpler
than Postgres's — you sometimes need to use INDEXED BY to nudge it.
Replication is the user's problem (LiteFS, Litestream, rqlite are third-party). Network
access requires a wrapper.
Frequently Asked Questions
Why is SQLite called 'serverless'?
There is no SQLite process. The library is linked directly into your application — Python, mobile app, browser — and reads and writes a single file on disk. Concurrency is coordinated with OS file locks. There is no daemon to start, no port to open, no user to provision. This is why SQLite ships in roughly every phone and browser on the planet: nothing operationally exists to fail.
When should I use SQLite vs Postgres?
Use SQLite when the database lives next to one application instance, the working set fits comfortably on local disk, and write concurrency is moderate. Examples: mobile apps, browser local storage, on-device caches, embedded analytics, single-tenant SaaS, edge deployments (LiteFS, Cloudflare D1). Use Postgres when many independent processes write to the same database, when you need rich types and indexes (GIN, GiST, BRIN), or when network access from arbitrary clients is required.
Can SQLite handle concurrent writes?
In default rollback-journal mode, no — writes serialize behind a database-level write lock. In WAL mode (the modern default for most embeddings), readers do not block writers and writers do not block readers, but only one writer is active at a time. The WAL design is described in detail in the WAL deep dive.
What is the 'in-memory' mode?
Open the filename `:memory:` and SQLite never touches disk — pages live in heap memory. It is a full SQL engine that costs only RAM, ideal for tests, scratch databases inside a process, or temporary tables. Memory databases vanish when the connection closes; for shared in-memory use within a process, use the URI form `file::memory:?cache=shared`.
How big can a SQLite database get?
Theoretical limit: 281 TB (2^48 bytes) at the default 4 KB page size, or 4 PB at 64 KB pages. Practical limits are much lower — the file system, the device, and the reasonable working-set assumptions of SQLite's design make multi-hundred-GB databases the upper end of what's commonly run. SQLite is happiest at sizes from a few KB to ~100 GB.
Is SQLite ACID?
Yes, fully. The atomic commit protocol (rollback journal mode) and the WAL protocol both implement crash-safe ACID with serializable isolation. Durability is the part most often tuned: setting `PRAGMA synchronous=NORMAL` (the default in WAL) reduces fsync frequency in exchange for a tiny window where the last few committed transactions could be lost on power failure — but the database itself remains uncorrupted.
What's special about SQLite's testing?
SQLite has roughly 100x more test code than library code (millions of lines of TCL/SQL test scripts plus aviation-grade DO-178B testing for safety-critical use). Every supported branch and merge of internal C code is verified by 100% MC/DC coverage. This is why SQLite is the only database that ships in the avionics of every Airbus A350.
How does SQLite compare to DuckDB?
Both are embedded — same dependency model, same single-file philosophy. SQLite is row-oriented, optimized for OLTP and small reads; DuckDB is column-oriented, optimized for OLAP scans and aggregates. Use SQLite for transactional workloads (writing log rows, app state, mobile data). Use DuckDB for analytical workloads (scanning a Parquet file, computing aggregates over a few GB of CSVs).