Redis Internals
Redis is a single-threaded, in-memory data structure server written in roughly 130k lines of C. The headline number — over a million operations per second on a single core — is not a benchmark trick. It is what falls out of an extremely tight architecture: one thread, one event loop, hand-written data structures, and a network protocol designed to be parsed in nanoseconds. This hub dissects the trade-offs that make that number real, and the ones (replication lag, fork-and-snapshot stalls, slot migration, cluster split-brain) that bite you in production.
Grounded in the Redis 7.x source tree, the official RESP3 spec, and Salvatore Sanfilippo's design notes.
Architecture at a Glance
Key Numbers
Why Redis Exists
Persistence (RDB + AOF)
Snapshots, append-only files, and the hybrid format — how an in-memory database survives crashes without paying the latency cost on every write
Single-Threaded Event Loop
Why one thread handling 100k+ requests/sec beats a thread pool — epoll/kqueue, the ae library, and the cost of a single slow command
Data Structure Encodings
intset, ziplist, listpack, quicklist, skiplist — Redis silently swaps representations to keep small collections cache-resident
Cluster Mode (16384 Slots)
Hash slots, gossip protocol, MOVED/ASK redirects, and why Redis Cluster picks AP over CP — the sharding model that scales horizontally
Replication & Sentinel
Async replication, PSYNC2, the replication backlog, and how Sentinel orchestrates failover without a consensus protocol
Streams & Consumer Groups
XADD, XREADGROUP, pending entries lists — the log-structured data type that turns Redis into a lightweight Kafka
Expiration & Eviction
Lazy + active expiration, the eight maxmemory policies, and how approximate-LRU/LFU keep eviction cheap
Redis vs Dragonfly vs KeyDB
Multi-threaded forks of the Redis protocol — what changed, what stayed the same, and which workloads actually benefit
The Single-Threaded Event Loop
Redis runs every command on one thread. The aeEventLoop in src/ae.c is a thin
portability wrapper over epoll on Linux, kqueue on BSD/macOS, and evport
on Solaris. Each tick the loop calls aeApiPoll(), processes ready file events, then runs any
time events whose deadline has passed. There are no mutexes on the keyspace because there is nothing to
contend with.
Reading from a client socket calls readQueryFromClient(), which appends bytes to a per-client
input buffer. processInputBuffer() parses RESP commands inline; if a complete command is present
it calls processCommand(), which performs auth checks, looks up the command table entry,
dispatches to the C function, and finally enqueues the reply via addReply(). The reply lands in
a small fixed-size static buffer first (buf[16384]) and overflows into a list of clientReplyBlock
chunks for large responses. Sockets are written when the loop detects them writable, never inline.
Since 6.0, Redis ships I/O threads (io-threads config). They are misnamed: they only parallelise
socket reads and writes, never command execution. The main thread still holds the keyspace alone. This buys
you 2–3x throughput on workloads dominated by network I/O (large GETs, big pipelines), but does nothing for
CPU-heavy commands like SUNIONSTORE over millions of elements.
The cost is sharp: any single command that takes 50 ms blocks every other client for 50 ms. KEYS *
on a million-key database, FLUSHDB, large LRANGE, even Lua scripts — all are head-of-line
blockers. The SCAN, HSCAN, and UNLINK commands exist precisely to give
cooperative, incremental versions of these operations.
Native Data Structures and Their Encodings
Redis exposes nine high-level types — string, list, hash, set, zset, stream, bitmap, hyperloglog, geospatial — but each has multiple internal encodings that the engine swaps based on size and content. The encoding swap is invisible to clients but dominates memory cost and command complexity.
Strings are stored as sds (Simple Dynamic String): a length-prefixed buffer with
O(1) length. Numeric strings up to a long are stored as raw integers when possible (OBJ_ENCODING_INT),
saving the SDS header entirely. Strings ≤44 bytes use EMBSTR — the SDS lives in the same allocation
as the robj header, halving allocator overhead.
Lists use quicklist: a linked list of listpack nodes. A listpack is a
compact serialised array (variable-width entries, no pointers) tuned to fit in a few cache lines.
list-max-listpack-size bounds each node. This gives you the iteration cost of an array with the
insert cost of a linked list.
Hashes and sorted sets use a listpack when small (default ≤128 entries with
values ≤64 bytes), then upgrade to a hashtable + skiplist. The skiplist is Redis's
classic Pugh-style randomised structure with deterministic backwards pointers, giving O(log N) range queries
with simple, lock-free-friendly code. Sets of small integers stay in an intset — a sorted int
array searchable by binary search.
Streams use a radix tree (rax) keyed by the stream ID, with each node holding a
listpack of entries. The radix tree gives prefix iteration; the listpacks amortise the per-entry overhead.
Consumer group state (PEL — pending entries list) lives in a separate rax per group.
Persistence: RDB, AOF, and the Hybrid Format
Redis offers two on-disk formats and most production setups use both. RDB is a point-in-time
binary snapshot. When triggered (SAVE, BGSAVE, or the configured save
thresholds), Redis fork()s. The child inherits a copy-on-write view of the dataset and writes it
to a single file using a compact custom encoding (varints, listpack inlining, optional LZF compression). The
parent keeps serving traffic. RDB files restore in seconds because the on-disk format is close to the in-memory
layout.
AOF is an append-only log of every write command. It is durable to whatever fsync
policy you choose: always (every write — disk-bound), everysec (a background thread
flushes once per second — lose ≤1 s on power loss), or no (let the kernel decide). AOF files grow
and are compacted by BGREWRITEAOF, which forks and writes the minimum command stream that
reconstructs the current state.
Modern Redis uses aof-use-rdb-preamble yes by default. The rewritten AOF starts with an RDB
snapshot of the keyspace at fork time, followed by AOF commands accumulated during the rewrite. This combines
RDB's compactness with AOF's durability.
The hidden cost of fork() on a large dataset: COW page faults. Every page the parent writes after
the fork must be duplicated. On a 100 GB instance with high write churn, the snapshot child can balloon to tens
of gigabytes of resident memory, and Linux's transparent huge pages will turn 4 KB faults into 2 MB faults —
a frequent cause of latency spikes in production. Disabling THP is in the Redis admin checklist for a reason.
Replication: Async, Backlog-Driven, Eventually Consistent
Redis replication is asynchronous. The primary acknowledges the client as soon as the command runs locally,
then propagates to replicas. Each replica connects, issues PSYNC with its replication ID and
offset, and either resumes from the in-memory replication backlog (a circular buffer, default 1 MB) or
receives a full sync — an RDB snapshot followed by the buffered commands accumulated during snapshotting.
PSYNC2 (Redis 4.0+) made replicas survive primary failover: when a replica is promoted it
keeps its replication ID, and other replicas can partial-resync against it. WAIT numreplicas timeout
lets you opt into synchronous semantics on a per-call basis — useful for "credit transfer" style operations.
The headline correctness pitfall: a network partition between primary and replica means the primary keeps
acknowledging writes that will be lost when the replica is promoted. min-replicas-to-write +
min-replicas-max-lag let you opt into stop-the-world write rejection if too few replicas are
keeping up. This is the closest Redis gets to a synchronous replication knob.
Sentinel and Cluster: Two Different HA Models
Sentinel monitors a single primary + N replicas. Sentinels gossip and use a quorum-based voting protocol to declare the primary down (subjective + objective down) and elect a leader sentinel that orchestrates failover. The dataset stays a single shard. Clients query Sentinels for the current primary's address, then connect directly. Sentinel solves "who is master?" — not horizontal scaling.
Redis Cluster is a sharded, peer-to-peer multi-primary system with no external coordinator.
The keyspace is partitioned into 16384 hash slots: HASH_SLOT(key) = CRC16(key) % 16384. Each
primary owns a subset of slots; replicas follow specific primaries. Nodes gossip (binary protocol on a
separate port) about slot assignments, node liveness, and configuration epoch.
A client connects to any node. If it issues a command for a slot owned elsewhere, it gets a MOVED
reply with the correct address — the client should update its slot map and retry. During slot migration,
ASK redirects send the client to the migration target for the duration of the migration, without
mutating its slot map.
Cluster makes a deliberate AP choice: it favours availability over consistency. With async replication and a
gossip-driven failover, a network partition can briefly elect two primaries for the same slot, leading to
diverged writes. Setting cluster-allow-replica-migration + tuning cluster-node-timeout
mitigates but does not eliminate this. For workloads that can't tolerate it, run Sentinel + a routing proxy
or use a CP system (TiDB, CockroachDB).
Expiration and Eviction
Each database keeps a parallel dict mapping keys to expiration timestamps. Expiration is enforced
two ways. Lazy expiration: every command lookup checks the expire dict and deletes if past
due, before serving. Active expiration: a periodic task (default 10 Hz, controlled by hz)
samples 20 keys with TTLs from each db, deletes expired ones, and continues if >25% expired. The probabilistic
sampling keeps total expiration cost bounded.
Eviction kicks in when memory exceeds maxmemory. Eight policies exist:
noeviction (reject writes), allkeys-lru, allkeys-lfu, allkeys-random,
volatile-lru, volatile-lfu, volatile-ttl, volatile-random.
The "volatile-" variants only evict keys with a TTL set. LRU/LFU are approximate: Redis samples
maxmemory-samples keys (default 5) and evicts the worst. Tuning maxmemory-samples to
10 makes it nearly indistinguishable from true LRU at modest CPU cost.
LFU stores a logarithmic counter + last-access time in the upper 24 bits of the object header. The counter
decays over lfu-decay-time minutes. This makes LFU genuinely usable as a long-term policy without
counter saturation.
Transactions, Scripting, and Functions
MULTI / EXEC queue commands and execute them atomically — but Redis transactions
don't roll back on error. If EXEC's third command fails, commands one and two still ran. There
are no savepoints. WATCH adds optimistic concurrency control: the transaction aborts if any
WATCHed key changed before EXEC.
Lua scripts (EVAL, EVALSHA) run atomically inside the single
thread, which is exactly the property that makes them powerful and dangerous: a 200 ms script halts every
other client. Scripts cannot call blocking commands. The redis.replicate_commands() mode
replicates effects, not the script itself, avoiding non-determinism issues.
Redis 7 introduced Functions: a persistent, replicated form of Lua libraries with explicit
naming and lifecycle (FUNCTION LOAD, FCALL). Unlike EVAL's fragile
"load on first call" model, Functions survive restarts and replicate cleanly to replicas.
Streams: A Log-Structured Type
Streams (XADD, XRANGE, XREAD, XREADGROUP) are an
append-only log identified by millisecond-precision IDs. They support consumer groups with at-least-once
semantics: each consumer in a group sees a disjoint subset of messages and must XACK them once
processed. Unacked messages stay in the per-consumer Pending Entries List (PEL); other consumers can
XCLAIM them after a configurable idle threshold, providing failover.
Streams aren't Kafka. There is no partitioning within a stream. Throughput per stream is bounded by the single Redis thread's command rate. They shine for queue-shaped workloads in the 10k–100k msg/sec range where you already run Redis and don't want to introduce another moving part. Beyond that, run Kafka.
Tradeoffs and When Not To Use Redis
Redis is not a database in the durability sense most engineers expect. With default settings, you can lose up
to one second of writes on power loss. Synchronous replication doesn't exist as a primary-side guarantee — the
best you have is WAIT per-call. Cluster mode is AP. Memory is your storage limit (Redis on Flash
via Enterprise/KeyDB exists but has different ergonomics). And the single thread is a hard ceiling: a workload
that genuinely needs 4 cores worth of CPU on commands cannot scale up — only out, via cluster.
If you need: durable transactions, complex queries, joins, multi-key consistency across shards — use a real database. If you need: a multi-million-ops/sec cache, a fast counter store, a leaderboard, a cooperative job queue, a pub/sub bus, or a session store — Redis is hard to beat.
Redis vs Other Caches & In-Memory Stores
| Redis | Dragonfly | KeyDB | Memcached | |
|---|---|---|---|---|
| Threading | Single (+ I/O threads) | Shared-nothing multi-thread | Multi-thread w/ locks | Multi-thread, slabs |
| Data structures | 9+ rich types | Redis-compatible | Redis-compatible | String only |
| Persistence | RDB + AOF | DflySnapshot (point-in-time) | FLASH + RDB/AOF | None (volatile) |
| Replication | Async + Sentinel/Cluster | Async | Active-active multi-master | Client-side hashing |
| Cluster mode | 16384 slots, gossip | Single-node (sharding planned) | Active replication | None native |
| Best for | The default in-memory choice | Vertical scale on big boxes | Active-active geo | Pure cache, rarely changing topology |
| Wire compatibility | RESP2/3 (canonical) | RESP-compatible | RESP-compatible | memcached protocol |
FAQ
Why is single-threaded Redis faster than multi-threaded competitors for many workloads?
Because most Redis commands are O(1) memory reads/writes. The bottleneck is network I/O, not CPU. A single thread saturates a 10 Gbit NIC long before it runs out of CPU on simple GET/SET. Adding threads only helps if commands themselves are CPU-bound — which is why Dragonfly and KeyDB win on workloads dominated by big set operations or many concurrent slow commands.
How durable is Redis really?
With default everysec AOF and async replication, you can lose up to one second of writes on a single-node power loss, and unbounded writes on a primary failure with replica lag. appendfsync always + WAIT tighten this dramatically but cost throughput. Redis is not a substitute for a transactional database when durability is non-negotiable.
When should I use Redis Cluster vs Sentinel?
Sentinel if your dataset fits on one node and you want HA without changing the data model. Cluster if you need to shard. Cluster requires that multi-key operations stay within one slot (use hash tags: {user:42}:profile, {user:42}:cart live on the same slot), and Lua/transactions can't span slots. That's a real client-side burden — don't adopt Cluster unless you've actually exhausted vertical scaling.
Why does my Redis spike to high latency during BGSAVE?
Almost always copy-on-write fork costs, made worse by transparent huge pages (THP). When the parent process modifies pages after the fork, Linux must duplicate them. With THP enabled, each fault duplicates 2 MB instead of 4 KB. Disable THP at the OS level. Also check vm.overcommit_memory=1 so fork doesn't fail on memory-tight nodes.
Should I use Redis Streams or Kafka?
Streams if you already run Redis, your throughput is below ~100k msg/sec per stream, and you want at-least-once consumer groups with minimal ops. Kafka if you need partitioning within a topic, retention measured in days/weeks rather than memory, exactly-once semantics across producers/consumers, or compaction.
Is Redis a good database (not just a cache)?
For specific shapes, yes: leaderboards, rate limiters, session stores, real-time counters, geo lookups. The data model maps directly to native types and a single instance handles enormous load. For anything needing joins, multi-row transactions, secondary indexes, or complex queries, use a real database and put Redis in front of it.
What's the deal with Valkey?
After Redis Inc. switched the upstream license to RSALv2/SSPL in March 2024, the Linux Foundation forked the last BSD-licensed Redis 7.2 as Valkey. Major cloud vendors (AWS, GCP, Oracle) and the Redis community now back Valkey as the open-source continuation. The protocol and data structures are identical; if you're starting fresh with no enterprise contract, Valkey is the open-source default.