Redis Persistence
Redis is a memory-resident database that pretends to be durable. The pretense is held up by two
orthogonal mechanisms — RDB, a periodic binary snapshot of the entire keyspace produced by a
forked child, and AOF, an append-only log of every write command since the last snapshot. Each
has its own failure mode and recovery profile, and modern deployments typically use both
simultaneously through the hybrid aof-use-rdb-preamble format. This page walks
through how each format is structured on disk, what the fsync knobs actually buy you, how AOF
rewrite avoids unbounded log growth, and how persistence interacts with replication.
Persistence Pipeline
Key Numbers
Why Two Formats Exist
RDB: The Snapshot Format
A point-in-time binary dump of every database, written by a forked child.
RDB is triggered by the SAVE command (synchronous, blocks the main thread —
almost never used in production), BGSAVE (asynchronous, forks a child), the
save directive in redis.conf, or replication. The default save policy is the
famous three-rule cascade:
save 3600 1 # after 3600s if at least 1 key changed save 300 100 # after 300s if at least 100 keys changed save 60 10000 # after 60s if at least 10000 keys changed
The semantics: a BGSAVE is triggered when any rule's time window has elapsed and that rule's threshold of changes has been met. So a busy server saves every minute; a quiet one saves once an hour. Each successful save resets the change counter.
The on-disk format starts with the magic REDIS followed by a 4-byte ASCII
version number (0011 at the time of writing — the format is incremented when
new value encodings or metadata are added, and recent Redis can read older RDBs but not
vice versa). Then comes a stream of length-prefixed records:
REDIS0011 magic + version FAredis-ver 7.2.4 AUX field: server version FA redis-bits 64 AUX field: pointer width FA ctime AUX field: time of dump FA used-mem AUX field: dataset size hint FE 00 SELECTDB: database 0 FB RESIZEDB: pre-size the dict 00 STRING (type 0) 01 LIST (encoded variant) 04 HASH FD key with EXPIRE FF end of file + checksum
Notice that integer string values are stored using a compact varint encoding rather than
ASCII, list and hash values use the same listpack encoding they have in memory (Redis can
memcpy the in-memory representation directly into the RDB for many encodings),
and the file ends with a CRC64 of all preceding bytes. On load, a non-matching CRC aborts
with a clear error unless rdbchecksum no was set.
Because the dump is produced by a forked child walking the parent's memory, every page touched during the walk that the parent later writes to gets COW-duplicated. A 50 GB Redis taking 30s to dump under a 10 MB/s write rate adds about 300 MB of COW pages. The fork(2) latency itself is the bigger concern: on Linux without huge pages, kernels copy 8 bytes of page table per 4 KB of RSS, so a 100 GB instance is around 200 MB of page table — fork takes 100-300 ms. That latency lands on the main thread as a blip in p99.
AOF: The Append-Only File
Every write command, serialized in the RESP protocol, appended to a file.
AOF logs every command that mutates the dataset in the same RESP wire protocol used between
client and server. Reading an AOF file is identical to replaying a session of writes. A single
SET foo bar appends:
*3\r\n $3\r\nSET\r\n $3\r\nfoo\r\n $3\r\nbar\r\n
That's 24 bytes for a 6-byte logical write — AOF is verbose. For pure-numeric or counter workloads it's far less efficient than RDB, but the upside is that recovery is replay-by-replay: any Redis client can read the file. This makes ad-hoc debugging easy and forms the basis of cross-version migrations (dump AOF on old Redis, replay on new).
The fsync policy is set with appendfsync:
appendfsync always # fsync after every command — ~5-50x throughput cut appendfsync everysec # fsync once per second from a bg thread (default) appendfsync no # never fsync, leave to kernel writeback (~30s)
Under everysec, the main thread writes commands into the kernel page cache via
write(2) on every command — this is fast, microseconds of latency. A separate
background I/O thread (bio_aof_fsync) wakes once a second and runs fsync(2)
on the AOF file descriptor, flushing the page cache to disk. If fsync is still
running when the next second arrives, the main thread will block on the next
write if the page cache is too dirty (a kernel-imposed back-pressure). This is
why slow disks under heavy write load cause Redis latency spikes despite the "async" promise
of everysec — the kernel will eventually serialize.
appendfsync always is rarely justified. It cuts write throughput to whatever your
disk's sync IOPS can sustain — on cloud network-attached disks, often 1000-5000 fsyncs per
second. Workloads that genuinely need it usually want a synchronous-replica setup instead,
where the durability point is "two replicas have it" rather than "one disk fsynced it".
AOF Rewrite (Compaction)
Without rewrite, AOF grows unbounded. Rewrite forks and emits a fresh, compacted file.
A million SET on the same key produces a million entries in AOF, even though the only one
that matters is the last. Rewrite collapses this. It is triggered automatically by
auto-aof-rewrite-percentage 100 and auto-aof-rewrite-min-size 64mb —
meaning when AOF doubles in size since the last rewrite and exceeds 64 MB. It can also be
triggered manually with BGREWRITEAOF.
Rewrite is implemented exactly like BGSAVE: fork a child, walk the in-memory dataset, emit one minimal command sequence per key reproducing its current state, write that to a temp file. The child does not read the existing AOF — that would be much slower and require conflict resolution with concurrent writes. While the child writes, the parent keeps appending real-time mutations to two places: the original AOF (for crash safety in case the rewrite fails) and a separate aof-rewrite-buf (an in-memory buffer of post-fork commands).
When the child finishes, it signals the parent. The parent appends the rewrite-buf contents
to the temp file, fsyncs it, and renames it onto appendonly.aof. The atomic
rename via rename(2) is what makes the swap safe — the file descriptor is
replaced in one POSIX operation. The old AOF file is unlinked, freeing disk.
Common failure mode: the rewrite buffer grows faster than the child can finish. On a server
under sustained write load with a slow disk, the buffer can balloon to gigabytes before the
child writes the snapshot. Watch aof_rewrite_buffer_length in INFO Persistence —
if it's growing toward your free RAM, you may need to slow the workload, faster disks, or
tune no-appendfsync-on-rewrite yes (which disables fsync on the live AOF during
rewrite, sacrificing durability for throughput just during that window).
The Hybrid Format (RDB Preamble + AOF Tail)
Introduced in Redis 4.0, this is what production should use.
With aof-use-rdb-preamble yes (default since 4.0), the AOF rewrite child produces
a binary RDB chunk as the start of the new AOF file, then appends commands collected during
the rewrite. The format on disk is:
REDIS0011 ...rdb bytes... FF← rewrite output (binary) *3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n ← new commands (text) *3\r\n$5\r\nLPUSH\r\n$1\r\nq\r\n$1\r\nx\r\n ← appended forever after
On load, Redis detects the REDIS magic at byte zero, loads the RDB section
(fast — same as a normal RDB load), then switches to RESP-stream mode for the rest. This is
the best of both: the bulk of the dataset is in compact binary (loads in seconds, not
minutes), and only the post-rewrite tail is verbose RESP (small, by definition — it's the
writes that have happened since the last rewrite).
Backward compatibility: a Redis 3.x cannot load a hybrid AOF — it sees the RDB magic and
bails. If you need to downgrade, set aof-use-rdb-preamble no, run
BGREWRITEAOF, and the resulting file is pure RESP and loadable anywhere.
Persistence and Replication
A surprising amount of replication machinery is just persistence in motion.
When a replica connects to a master for full sync (PSYNC ? -1), the master
generates an RDB snapshot and streams it to the replica. There are two variants. The classic
flow writes RDB to the master's local disk first, then sends it; this works on any kernel but
is twice the I/O. Diskless replication (repl-diskless-sync yes)
streams RDB bytes directly from the forked child's pipe over the socket to the replica,
skipping the master's disk entirely.
Diskless is unambiguously better for masters with slow disks but fast networks (cloud
ephemeral disk vs 10 Gbps NIC), and the default became yes in Redis 6.0. The
catch: if multiple replicas are syncing at once, they share the same RDB stream — this is
the repl-diskless-sync-delay feature, which waits a configurable number of
seconds for additional replicas to join before kicking off the fork.
Replicas typically don't persist by default — save "" in the replica config
disables RDB, and appendonly no disables AOF. The reasoning: the master is the
durability point; if the master fails, you fail over to a replica that already has the data
in RAM, and you take a fresh snapshot then. Persisting on every replica multiplies disk
cost without proportional benefit. However, if you want a replica to survive
a host reboot without re-syncing GBs from the master, enable persistence on it — recovery
from local AOF is much faster than network sync.
Operational Tradeoffs
| Aspect | RDB only | AOF only | Hybrid (recommended) |
|---|---|---|---|
| Restart speed (100 GB) | 30-90 s | 10-30 min | 30-90 s + small tail |
| Worst-case data loss | save interval (minutes) | ~1 s (everysec) | ~1 s |
| File size | compact (binary) | verbose (RESP × N) | compact + small tail |
| Fork frequency | per save trigger | per rewrite trigger | per rewrite trigger |
| Disk write rate | spiky (per save) | continuous + spike | continuous + spike |
| Cross-version portable | major-version only | any version | any version (after rewrite) |
FAQ
RDB vs AOF — which one should I pick?
Run both. RDB gives you cheap, point-in-time snapshots that compress well and load fast on restart; AOF gives you near-zero-data-loss durability if you set appendfsync=everysec. The mixed RDB+AOF format introduced in Redis 4.0 (aof-use-rdb-preamble yes) writes a binary RDB chunk at the start of the AOF file followed by an incremental log of subsequent commands — best of both. The only reason to pick one is operational simplicity: if you genuinely don't need durability (a cache fronting a database) RDB alone is fine; if you cannot tolerate even a few seconds of data loss and don't care about restart speed, pure AOF works.
What does appendfsync=everysec actually risk?
Up to one second of acknowledged writes lost on a kernel/host crash. The Redis main thread appends each command to the AOF buffer synchronously, and a background thread calls fsync(2) once per second. If the host loses power between fsyncs, the buffered writes — already acknowledged to clients — are gone. appendfsync=always issues an fsync after every write command, which is durable but cuts throughput by 5-50x depending on disk; appendfsync=no leaves fsync entirely to the kernel (typically 30s on Linux), which is fast but loses far more on crash.
When does AOF rewrite trigger and what does it cost?
Automatically when the AOF file has grown by auto-aof-rewrite-percentage (default 100%) over its size at the last rewrite, and exceeds auto-aof-rewrite-min-size (default 64 MB). The rewrite forks a child that snapshots the in-memory dataset to a new AOF file. While the child runs, the parent keeps appending new commands to a buffer; when the child finishes, the buffered commands are flushed to the new file and it atomically replaces the old. The fork uses copy-on-write, so memory overhead is bounded by the write rate during the rewrite window.
Does RDB block the main thread?
Only for the fork(2) call. After fork, the child is a separate process that walks the in-memory dataset and writes RDB to disk while the parent continues serving requests. The fork itself is the expensive part: on a 100 GB instance, fork can take 100-500ms depending on huge-page settings and kernel version. Use the latency-monitor or check INFO Persistence rdb_last_bgsave_time_sec / latest_fork_usec. Disable Linux transparent huge pages (echo never > /sys/kernel/mm/transparent_hugepage/enabled) — they make fork much slower.
What happens to persistence on a replica?
By default, replicas don't persist. They receive an RDB snapshot during initial sync, replay the replication stream into memory, and serve reads — but on restart they re-sync from the master. Set save (RDB) and appendonly yes on replicas if you want them to survive a master failure with their own snapshot. Important: if you're using diskless replication (repl-diskless-sync yes), the master streams the RDB over the socket without writing it to its own disk, but the replica still has to load it.
How do I recover from a partially-written AOF?
Redis ships with redis-check-aof. Running redis-check-aof --fix /path/to/appendonly.aof will scan the file, find the last fully-formed command, and truncate the trailing partial. The truncated tail is saved to .bak. Most production failures look like this: the host crashed mid-write, so the last few KB are garbage. The fix recovers everything before the corruption. If aof-load-truncated yes (the default), Redis does this automatically on startup — it warns and continues.