PostgreSQL Write-Ahead Log

Every change to a PostgreSQL database — every INSERT, UPDATE, DELETE, every page split, every catalog change — is first written to the Write-Ahead Log before the corresponding data page is modified. The WAL is the source of truth for crash recovery, the substrate for streaming replication, the source of physical backups via pg_basebackup, and the substrate for logical decoding (CDC). It lives as a stream of 16 MB files in pg_wal/, organized by 64-bit Log Sequence Numbers (LSNs). This page traces the WAL from a single backend's XLogInsert through the wal_writer, the checkpointer, archiving, streaming to replicas, and logical decoding for change data capture.

WAL Pipeline

Key Numbers

Default segment

16 MB

LSN width

64 bits

checkpoint_timeout

5 min

max_wal_size

1 GB default

wal_writer_delay

200 ms

wal_buffers

~16 MB auto

WAL levels

minimal/replica/logical

Why a WAL

Crash recovery

After a crash, replay WAL from the last checkpoint. Sequential I/O during recovery; data pages in RAM only need partial replay. Without WAL, you'd need to fsync every modified page on every commit — random writes everywhere.

Streaming replication

A replica is just another consumer of the WAL stream. The primary ships records as they're generated; the replica replays them. Same protocol, same code path — replication is "WAL with a network in the middle."

PITR + CDC

Archived WAL is the audit log of every change ever. PITR rewinds to any moment; logical decoding extracts row-level deltas for downstream systems. The WAL is the universal source.

WAL Records and the LSN

Every change is a record; every record has a position.

A WAL record is a self-contained unit describing one logical change. The header includes:

XLogRecord (24 bytes):
  uint32 xl_tot_len          - total record length
  TransactionId xl_xid       - transaction this record belongs to
  XLogRecPtr xl_prev         - LSN of the previous record (chain)
  uint8 xl_info              - flags (commit, abort, info bits)
  RmgrId xl_rmid             - resource manager (heap, btree, hash, …)
  pg_crc32c xl_crc           - CRC of the record

Each resource manager defines its own record types. Heap manages: HEAP_INSERT, HEAP_UPDATE, HEAP_DELETE, HEAP_LOCK. Btree: BTREE_INSERT_LEAF, BTREE_SPLIT, BTREE_DELETE. Transaction commit/abort: XLOG_XACT_COMMIT, XLOG_XACT_ABORT. Each carries type-specific data needed to redo the change during replay.

LSN (Log Sequence Number) is a 64-bit byte offset into the conceptual infinite WAL stream. Format: XX/YYYYYYYY, e.g. 0/3000148. LSNs are globally monotonic — every record gets a strictly greater LSN. PostgreSQL exposes the current LSN via pg_current_wal_lsn().

WAL Levels: minimal, replica, logical

How much information goes into the log.

wal_level controls verbosity:

minimal  — only enough for crash recovery on this server.
           Bulk operations (CREATE TABLE AS, COPY) skip WAL entirely
           (just write to disk + fsync at end). Cheapest, but no
           streaming replication, no archiving useful for replication.

replica  — adds enough to drive streaming replicas and PITR.
           Bulk operations now do log records. Default since PG 10.

logical  — adds row-level details (full rows or modified columns)
           required for logical decoding. ~10-20% more WAL volume.
           Required for logical replication slots, debezium, etc.

Most installations use replica. Set logical proactively if you ever expect to use logical replication, CDC, or AWS DMS — the storage cost is small and changing requires a restart.

WAL Buffers and the wal_writer

Decoupling the backend's commit from the disk fsync.

Backends call XLogInsert to add a record. The record is copied into the WAL buffers in shared memory — fast, no I/O. The function returns the new LSN. The backend continues with its work; if it's a transaction commit, the backend now waits for the WAL up to that LSN to be durably written.

The wal_writer process writes WAL buffers to pg_wal/ files. It wakes every wal_writer_delay (default 200 ms), or when buffers fill, and flushes everything pending. On COMMIT, the committing backend issues an fsync itself (or signals wal_writer to do it under synchronous_commit=on, the default).

The win: group commit. Multiple concurrent commits arriving in the same window all get included in one fsync. With 1000 concurrent commits each generating 1 KB of WAL, the total cost is roughly one fsync of 1 MB — instead of 1000 separate fsyncs. This is what gives PostgreSQL its surprisingly good throughput on commit-heavy workloads.

The Checkpointer

Periodic flush of dirty pages so old WAL can be discarded.

The checkpointer runs every checkpoint_timeout (default 5 min) or when WAL crosses max_wal_size. Its job:

1. write a CHECKPOINT_BEGIN record to the WAL
2. iterate every dirty page in shared_buffers
3. write each dirty page to its data file (with checksum)
4. fsync all touched data files
5. write a CHECKPOINT_END record with the LSN where this all finished
6. update the control file with the new redo point
7. delete or recycle WAL segments older than the new redo point

After a checkpoint, the WAL is no longer needed for crash recovery up to the new redo point — every change before it is already on disk in the data files. WAL segments before that LSN are recycled (renamed for future writes) up to max_wal_size, then deleted.

Checkpoints are disruptive: they generate a lot of disk I/O in a short window. PostgreSQL smooths them via checkpoint_completion_target (default 0.9), which spreads the I/O over 90% of the inter-checkpoint interval rather than as fast as possible.

Archiving and PITR

Continuous WAL backup for point-in-time recovery.

Set archive_mode=on and archive_command='cp %p /backup/%f' (or anything that copies the file to durable storage). PostgreSQL invokes archive_command for each completed WAL segment; on success, the segment can be removed locally.

To do a Point-In-Time Recovery:

1. take base backup with pg_basebackup → tarball of the data dir
2. continuously archive WAL → durable storage

When you need to recover to time T:
3. restore the base backup to a fresh data directory
4. set restore_command in postgresql.conf:
     restore_command = 'cp /backup/%f %p'
5. set recovery_target_time = 'YYYY-MM-DD HH:MM:SS+TZ'
6. start the server — it replays WAL from base backup forward,
   stops when it sees a transaction commit at or after the target time

The result is the database state as it was at time T. PITR is the recommended foundation for backup strategy in production.

Streaming Replication

A replica subscribes to the WAL stream.

A replica connects to the primary using libpq, sends a START_REPLICATION command with its current LSN, and the primary's walsender process (one per replica) streams every WAL byte from that point forward. The replica's walreceiver writes the bytes to its own pg_wal/; a recovery process replays them into the data files.

The replica also sends back flush LSN and apply LSN acknowledgments so the primary knows progress. synchronous_standby_names makes commits wait for replica acks; synchronous_commit=remote_apply waits until the replica has applied (visible to its readers), giving a strict read-your-writes guarantee.

Replication Slots

Persistent state tracking what each consumer needs.

A replication slot is a persistent record on the primary saying "consumer X has consumed WAL up to LSN Y." The primary won't delete WAL with LSN > Y as long as the slot exists. This solves the catch-up problem: a replica that disconnects briefly can resume without a full base backup.

SELECT pg_create_physical_replication_slot('my_replica');
SELECT pg_create_logical_replication_slot('my_cdc', 'pgoutput');

SELECT * FROM pg_replication_slots;

Caveat: an abandoned slot is dangerous. The primary keeps WAL forever for the dead consumer, filling the disk. Always monitor and drop slots from departed replicas.

Logical Decoding

Extracting row-level changes from the WAL.

With wal_level=logical, the WAL contains enough information to reconstruct pre- and post-images of changed rows. A logical decoding consumer attaches to a logical replication slot and receives a stream of (table, op, before, after) events.

Built-in output plugin pgoutput drives PostgreSQL's own logical replication (publication/subscription). External tools use it too: Debezium produces Kafka events; AWS DMS copies to other databases; pglogical does cross-version replication.

CREATE PUBLICATION mypub FOR TABLE users, orders;

-- on the subscriber:
CREATE SUBSCRIPTION mysub
  CONNECTION 'host=primary dbname=app'
  PUBLICATION mypub;
-- subscriber now receives every INSERT/UPDATE/DELETE on users and orders

Limitations: logical decoding only sees committed transactions. It doesn't see DDL by default (table schema changes have to be replicated separately). Large transactions are staged in memory or temp files until commit, then emitted.

Operational Tradeoffs

Setting	Tradeoff
synchronous_commit=on	safe; commit waits for fsync (default)
synchronous_commit=off	commit returns before fsync; lose <wal_writer_delay of writes on crash; ~30% faster
checkpoint_timeout=5min	balanced; default
checkpoint_timeout=30min	fewer checkpoints, more WAL retained, longer recovery
max_wal_size=10GB	more headroom for write bursts; longer recovery
wal_compression=on	~5-10x smaller full-page images at cost of CPU
wal_level=logical	~15% more WAL; required for CDC

FAQ

What is wal_level and which should I pick?

wal_level controls how much information is logged. 'minimal' (rare these days) only logs what's needed for crash recovery; can't drive replication or PITR. 'replica' (default) logs enough for streaming replicas and physical backups. 'logical' adds enough for logical decoding (per-row change events with column values), feeding tools like pglogical, Debezium, AWS DMS. logical is required for logical replication slots; the cost is ~10-20% more WAL volume due to including changed columns. Set logical if you might ever need it — the cost is small and changing later requires a restart.

Why are WAL segments exactly 16 MB?

Historical default — chosen as a balance between rotation overhead (frequent fsync of new files is expensive) and recovery granularity (you replay one segment at a time on startup). It's been the default since PostgreSQL 9.0. Modern installations often raise it to 64 MB or 1 GB on high-throughput systems via initdb --wal-segsize=1024 (1 GB segments). Bigger segments mean fewer file rotations and lower archive_command frequency at the cost of larger archive files and slower recovery on a damaged segment.

What's the difference between checkpointer and bgwriter?

Both flush dirty buffers from shared_buffers to disk, but for different purposes. The bgwriter runs continuously, lazily flushing pages to keep some clean buffers available for new reads — reduces the number of dirty pages a backend has to flush itself. The checkpointer runs at scheduled intervals (checkpoint_timeout, default 5min) or when WAL crosses max_wal_size. It flushes ALL dirty buffers and writes a checkpoint record to WAL. After a successful checkpoint, all WAL before that record can be discarded. The checkpointer is the durability fence; bgwriter is performance smoothing.

How does PITR (Point-In-Time Recovery) work?

Take a base backup (pg_basebackup or filesystem snapshot of the data directory). Continuously archive WAL segments to safe storage (archive_command). To recover to time T: restore the base backup, set recovery target time T in postgresql.conf, start the server. It replays WAL from the base backup's start_lsn forward, applying every record up to a transaction commit at or after time T, then stops. The result is the database state as it was at time T. Critical: archive must keep ALL segments since the base backup; missing one breaks the chain.

How do replication slots avoid losing WAL?

A physical replication slot tracks the LSN that a specific replica has consumed. The primary's WAL retention is bounded below by the minimum slot LSN — the primary refuses to delete WAL that any active slot still needs. This protects against the catch-up problem where a brief replica disconnect would otherwise force a full base backup. Caveat: an abandoned slot prevents WAL deletion forever, filling the disk. Always monitor pg_replication_slots and drop slots from disconnected replicas.

What is the wal_writer process?

A dedicated background process whose only job is to write WAL buffers to disk and fsync them. Backends fill the WAL buffers in shared memory; the wal_writer drains them. wal_writer_delay (default 200ms) is the minimum interval between writes when there's no commit pressure. On commit, the committing backend itself fsyncs (or waits for wal_writer to do so under synchronous_commit=off). Decoupling write-from-buffer from fsync is what lets PostgreSQL batch fsyncs across multiple commits — the famous 'group commit' optimization.