PostgreSQL Write-Ahead Log
Every change to a PostgreSQL database — every INSERT, UPDATE, DELETE, every page split,
every catalog change — is first written to the Write-Ahead Log before the corresponding
data page is modified. The WAL is the source of truth for crash recovery, the substrate
for streaming replication, the source of physical backups via pg_basebackup,
and the substrate for logical decoding (CDC). It lives as a stream of 16 MB files in
pg_wal/, organized by 64-bit Log Sequence Numbers (LSNs). This page traces
the WAL from a single backend's XLogInsert through the wal_writer, the
checkpointer, archiving, streaming to replicas, and logical decoding for change data
capture.
WAL Pipeline
Key Numbers
Why a WAL
WAL Records and the LSN
Every change is a record; every record has a position.
A WAL record is a self-contained unit describing one logical change. The header includes:
XLogRecord (24 bytes): uint32 xl_tot_len - total record length TransactionId xl_xid - transaction this record belongs to XLogRecPtr xl_prev - LSN of the previous record (chain) uint8 xl_info - flags (commit, abort, info bits) RmgrId xl_rmid - resource manager (heap, btree, hash, …) pg_crc32c xl_crc - CRC of the record
Each resource manager defines its own record types. Heap manages: HEAP_INSERT, HEAP_UPDATE, HEAP_DELETE, HEAP_LOCK. Btree: BTREE_INSERT_LEAF, BTREE_SPLIT, BTREE_DELETE. Transaction commit/abort: XLOG_XACT_COMMIT, XLOG_XACT_ABORT. Each carries type-specific data needed to redo the change during replay.
LSN (Log Sequence Number) is a 64-bit byte offset into the conceptual infinite WAL stream.
Format: XX/YYYYYYYY, e.g. 0/3000148. LSNs are globally
monotonic — every record gets a strictly greater LSN. PostgreSQL exposes the current LSN
via pg_current_wal_lsn().
WAL Levels: minimal, replica, logical
How much information goes into the log.
wal_level controls verbosity:
minimal — only enough for crash recovery on this server.
Bulk operations (CREATE TABLE AS, COPY) skip WAL entirely
(just write to disk + fsync at end). Cheapest, but no
streaming replication, no archiving useful for replication.
replica — adds enough to drive streaming replicas and PITR.
Bulk operations now do log records. Default since PG 10.
logical — adds row-level details (full rows or modified columns)
required for logical decoding. ~10-20% more WAL volume.
Required for logical replication slots, debezium, etc.
Most installations use replica. Set logical proactively if you
ever expect to use logical replication, CDC, or AWS DMS — the storage cost is small and
changing requires a restart.
WAL Buffers and the wal_writer
Decoupling the backend's commit from the disk fsync.
Backends call XLogInsert to add a record. The record is copied into the WAL
buffers in shared memory — fast, no I/O. The function returns the new LSN. The backend
continues with its work; if it's a transaction commit, the backend now waits for the WAL
up to that LSN to be durably written.
The wal_writer process writes WAL buffers to pg_wal/ files.
It wakes every wal_writer_delay (default 200 ms), or when buffers fill, and
flushes everything pending. On COMMIT, the committing backend issues an fsync
itself (or signals wal_writer to do it under synchronous_commit=on, the
default).
The win: group commit. Multiple concurrent commits arriving in the same window all get included in one fsync. With 1000 concurrent commits each generating 1 KB of WAL, the total cost is roughly one fsync of 1 MB — instead of 1000 separate fsyncs. This is what gives PostgreSQL its surprisingly good throughput on commit-heavy workloads.
The Checkpointer
Periodic flush of dirty pages so old WAL can be discarded.
The checkpointer runs every checkpoint_timeout (default 5 min) or when WAL
crosses max_wal_size. Its job:
1. write a CHECKPOINT_BEGIN record to the WAL 2. iterate every dirty page in shared_buffers 3. write each dirty page to its data file (with checksum) 4. fsync all touched data files 5. write a CHECKPOINT_END record with the LSN where this all finished 6. update the control file with the new redo point 7. delete or recycle WAL segments older than the new redo point
After a checkpoint, the WAL is no longer needed for crash recovery up to the new redo
point — every change before it is already on disk in the data files. WAL segments before
that LSN are recycled (renamed for future writes) up to max_wal_size, then
deleted.
Checkpoints are disruptive: they generate a lot of disk I/O in a short window. PostgreSQL
smooths them via checkpoint_completion_target (default 0.9), which spreads
the I/O over 90% of the inter-checkpoint interval rather than as fast as possible.
Archiving and PITR
Continuous WAL backup for point-in-time recovery.
Set archive_mode=on and archive_command='cp %p /backup/%f'
(or anything that copies the file to durable storage). PostgreSQL invokes archive_command
for each completed WAL segment; on success, the segment can be removed locally.
To do a Point-In-Time Recovery:
1. take base backup with pg_basebackup → tarball of the data dir
2. continuously archive WAL → durable storage
When you need to recover to time T:
3. restore the base backup to a fresh data directory
4. set restore_command in postgresql.conf:
restore_command = 'cp /backup/%f %p'
5. set recovery_target_time = 'YYYY-MM-DD HH:MM:SS+TZ'
6. start the server — it replays WAL from base backup forward,
stops when it sees a transaction commit at or after the target time The result is the database state as it was at time T. PITR is the recommended foundation for backup strategy in production.
Streaming Replication
A replica subscribes to the WAL stream.
A replica connects to the primary using libpq, sends a START_REPLICATION
command with its current LSN, and the primary's walsender process
(one per replica) streams every WAL byte from that point forward. The replica's
walreceiver writes the bytes to its own pg_wal/; a recovery
process replays them into the data files.
The replica also sends back flush LSN and apply LSN
acknowledgments so the primary knows progress. synchronous_standby_names
makes commits wait for replica acks; synchronous_commit=remote_apply waits
until the replica has applied (visible to its readers), giving a strict read-your-writes
guarantee.
Replication Slots
Persistent state tracking what each consumer needs.
A replication slot is a persistent record on the primary saying "consumer X has consumed WAL up to LSN Y." The primary won't delete WAL with LSN > Y as long as the slot exists. This solves the catch-up problem: a replica that disconnects briefly can resume without a full base backup.
SELECT pg_create_physical_replication_slot('my_replica');
SELECT pg_create_logical_replication_slot('my_cdc', 'pgoutput');
SELECT * FROM pg_replication_slots; Caveat: an abandoned slot is dangerous. The primary keeps WAL forever for the dead consumer, filling the disk. Always monitor and drop slots from departed replicas.
Logical Decoding
Extracting row-level changes from the WAL.
With wal_level=logical, the WAL contains enough information to reconstruct
pre- and post-images of changed rows. A logical decoding consumer attaches
to a logical replication slot and receives a stream of (table, op, before, after) events.
Built-in output plugin pgoutput drives PostgreSQL's own logical replication
(publication/subscription). External tools use it too: Debezium produces Kafka events;
AWS DMS copies to other databases; pglogical does cross-version replication.
CREATE PUBLICATION mypub FOR TABLE users, orders; -- on the subscriber: CREATE SUBSCRIPTION mysub CONNECTION 'host=primary dbname=app' PUBLICATION mypub; -- subscriber now receives every INSERT/UPDATE/DELETE on users and orders
Limitations: logical decoding only sees committed transactions. It doesn't see DDL by default (table schema changes have to be replicated separately). Large transactions are staged in memory or temp files until commit, then emitted.
Operational Tradeoffs
| Setting | Tradeoff |
|---|---|
| synchronous_commit=on | safe; commit waits for fsync (default) |
| synchronous_commit=off | commit returns before fsync; lose <wal_writer_delay of writes on crash; ~30% faster |
| checkpoint_timeout=5min | balanced; default |
| checkpoint_timeout=30min | fewer checkpoints, more WAL retained, longer recovery |
| max_wal_size=10GB | more headroom for write bursts; longer recovery |
| wal_compression=on | ~5-10x smaller full-page images at cost of CPU |
| wal_level=logical | ~15% more WAL; required for CDC |
FAQ
What is wal_level and which should I pick?
wal_level controls how much information is logged. 'minimal' (rare these days) only logs what's needed for crash recovery; can't drive replication or PITR. 'replica' (default) logs enough for streaming replicas and physical backups. 'logical' adds enough for logical decoding (per-row change events with column values), feeding tools like pglogical, Debezium, AWS DMS. logical is required for logical replication slots; the cost is ~10-20% more WAL volume due to including changed columns. Set logical if you might ever need it — the cost is small and changing later requires a restart.
Why are WAL segments exactly 16 MB?
Historical default — chosen as a balance between rotation overhead (frequent fsync of new files is expensive) and recovery granularity (you replay one segment at a time on startup). It's been the default since PostgreSQL 9.0. Modern installations often raise it to 64 MB or 1 GB on high-throughput systems via initdb --wal-segsize=1024 (1 GB segments). Bigger segments mean fewer file rotations and lower archive_command frequency at the cost of larger archive files and slower recovery on a damaged segment.
What's the difference between checkpointer and bgwriter?
Both flush dirty buffers from shared_buffers to disk, but for different purposes. The bgwriter runs continuously, lazily flushing pages to keep some clean buffers available for new reads — reduces the number of dirty pages a backend has to flush itself. The checkpointer runs at scheduled intervals (checkpoint_timeout, default 5min) or when WAL crosses max_wal_size. It flushes ALL dirty buffers and writes a checkpoint record to WAL. After a successful checkpoint, all WAL before that record can be discarded. The checkpointer is the durability fence; bgwriter is performance smoothing.
How does PITR (Point-In-Time Recovery) work?
Take a base backup (pg_basebackup or filesystem snapshot of the data directory). Continuously archive WAL segments to safe storage (archive_command). To recover to time T: restore the base backup, set recovery target time T in postgresql.conf, start the server. It replays WAL from the base backup's start_lsn forward, applying every record up to a transaction commit at or after time T, then stops. The result is the database state as it was at time T. Critical: archive must keep ALL segments since the base backup; missing one breaks the chain.
How do replication slots avoid losing WAL?
A physical replication slot tracks the LSN that a specific replica has consumed. The primary's WAL retention is bounded below by the minimum slot LSN — the primary refuses to delete WAL that any active slot still needs. This protects against the catch-up problem where a brief replica disconnect would otherwise force a full base backup. Caveat: an abandoned slot prevents WAL deletion forever, filling the disk. Always monitor pg_replication_slots and drop slots from disconnected replicas.
What is the wal_writer process?
A dedicated background process whose only job is to write WAL buffers to disk and fsync them. Backends fill the WAL buffers in shared memory; the wal_writer drains them. wal_writer_delay (default 200ms) is the minimum interval between writes when there's no commit pressure. On commit, the committing backend itself fsyncs (or waits for wal_writer to do so under synchronous_commit=off). Decoupling write-from-buffer from fsync is what lets PostgreSQL batch fsyncs across multiple commits — the famous 'group commit' optimization.