Redis Internals

Redis is a single-threaded, in-memory data structure server written in roughly 130k lines of C. The headline number — over a million operations per second on a single core — is not a benchmark trick. It is what falls out of an extremely tight architecture: one thread, one event loop, hand-written data structures, and a network protocol designed to be parsed in nanoseconds. This hub dissects the trade-offs that make that number real, and the ones (replication lag, fork-and-snapshot stalls, slot migration, cluster split-brain) that bite you in production.

Grounded in the Redis 7.x source tree, the official RESP3 spec, and Salvatore Sanfilippo's design notes.

Architecture at a Glance

Single Redis Process Clients (TCP/Unix) RESP2 / RESP3 ae Event Loop epoll / kqueue / evport readQueryFromClient() processCommand() addReply() output buffer Keyspace (db[0..15]) dict + expire dict string · list · hash set · zset · stream bitmap · hyperloglog · geo module types (JSON, search…) Persistence RDB snapshot (fork) AOF append-only log Replication async PSYNC2 backlog replicaof / sentinel / cluster One thread executes commands. I/O threads (since 6.0) only parse and write the socket.

Key Numbers

Threads (data path)
1
Cluster slots
16,384
Max key/value size
512 MB
Active expire HZ
10 / sec
Listpack threshold
128 entries
Replication backlog
1 MB default
RESP protocol
RESP3

Why Redis Exists

The Gap
In 2009, web apps needed an out-of-process cache and a queue. Memcached was a flat key-value store with no data structures. SQL databases were hundreds of microseconds per round-trip. There was no in-memory server that spoke lists, sets, and counters natively.
The Insight
If everything fits in RAM, you don't need locks, B-trees, or query planning. A single thread serving a 10 Gbit NIC will saturate it before CPU becomes the bottleneck. You can therefore expose data structures (not just values) as first-class primitives without compromising throughput.
The Result
Redis ships rich, atomic operations on lists, sorted sets, hashes, and streams — at sub-millisecond latency — while staying simple enough that a competent engineer can read the entire source over a long weekend. Operations that took round-trips are now O(log N) atomic primitives.
✦ Live

Persistence (RDB + AOF)

Snapshots, append-only files, and the hybrid format — how an in-memory database survives crashes without paying the latency cost on every write

Coming soon

Single-Threaded Event Loop

Why one thread handling 100k+ requests/sec beats a thread pool — epoll/kqueue, the ae library, and the cost of a single slow command

Coming soon

Data Structure Encodings

intset, ziplist, listpack, quicklist, skiplist — Redis silently swaps representations to keep small collections cache-resident

Coming soon

Cluster Mode (16384 Slots)

Hash slots, gossip protocol, MOVED/ASK redirects, and why Redis Cluster picks AP over CP — the sharding model that scales horizontally

Coming soon

Replication & Sentinel

Async replication, PSYNC2, the replication backlog, and how Sentinel orchestrates failover without a consensus protocol

Coming soon

Streams & Consumer Groups

XADD, XREADGROUP, pending entries lists — the log-structured data type that turns Redis into a lightweight Kafka

Coming soon

Expiration & Eviction

Lazy + active expiration, the eight maxmemory policies, and how approximate-LRU/LFU keep eviction cheap

Coming soon

Redis vs Dragonfly vs KeyDB

Multi-threaded forks of the Redis protocol — what changed, what stayed the same, and which workloads actually benefit

The Single-Threaded Event Loop

Redis runs every command on one thread. The aeEventLoop in src/ae.c is a thin portability wrapper over epoll on Linux, kqueue on BSD/macOS, and evport on Solaris. Each tick the loop calls aeApiPoll(), processes ready file events, then runs any time events whose deadline has passed. There are no mutexes on the keyspace because there is nothing to contend with.

Reading from a client socket calls readQueryFromClient(), which appends bytes to a per-client input buffer. processInputBuffer() parses RESP commands inline; if a complete command is present it calls processCommand(), which performs auth checks, looks up the command table entry, dispatches to the C function, and finally enqueues the reply via addReply(). The reply lands in a small fixed-size static buffer first (buf[16384]) and overflows into a list of clientReplyBlock chunks for large responses. Sockets are written when the loop detects them writable, never inline.

Since 6.0, Redis ships I/O threads (io-threads config). They are misnamed: they only parallelise socket reads and writes, never command execution. The main thread still holds the keyspace alone. This buys you 2–3x throughput on workloads dominated by network I/O (large GETs, big pipelines), but does nothing for CPU-heavy commands like SUNIONSTORE over millions of elements.

The cost is sharp: any single command that takes 50 ms blocks every other client for 50 ms. KEYS * on a million-key database, FLUSHDB, large LRANGE, even Lua scripts — all are head-of-line blockers. The SCAN, HSCAN, and UNLINK commands exist precisely to give cooperative, incremental versions of these operations.

Native Data Structures and Their Encodings

Redis exposes nine high-level types — string, list, hash, set, zset, stream, bitmap, hyperloglog, geospatial — but each has multiple internal encodings that the engine swaps based on size and content. The encoding swap is invisible to clients but dominates memory cost and command complexity.

Strings are stored as sds (Simple Dynamic String): a length-prefixed buffer with O(1) length. Numeric strings up to a long are stored as raw integers when possible (OBJ_ENCODING_INT), saving the SDS header entirely. Strings ≤44 bytes use EMBSTR — the SDS lives in the same allocation as the robj header, halving allocator overhead.

Lists use quicklist: a linked list of listpack nodes. A listpack is a compact serialised array (variable-width entries, no pointers) tuned to fit in a few cache lines. list-max-listpack-size bounds each node. This gives you the iteration cost of an array with the insert cost of a linked list.

Hashes and sorted sets use a listpack when small (default ≤128 entries with values ≤64 bytes), then upgrade to a hashtable + skiplist. The skiplist is Redis's classic Pugh-style randomised structure with deterministic backwards pointers, giving O(log N) range queries with simple, lock-free-friendly code. Sets of small integers stay in an intset — a sorted int array searchable by binary search.

Streams use a radix tree (rax) keyed by the stream ID, with each node holding a listpack of entries. The radix tree gives prefix iteration; the listpacks amortise the per-entry overhead. Consumer group state (PEL — pending entries list) lives in a separate rax per group.

Persistence: RDB, AOF, and the Hybrid Format

Redis offers two on-disk formats and most production setups use both. RDB is a point-in-time binary snapshot. When triggered (SAVE, BGSAVE, or the configured save thresholds), Redis fork()s. The child inherits a copy-on-write view of the dataset and writes it to a single file using a compact custom encoding (varints, listpack inlining, optional LZF compression). The parent keeps serving traffic. RDB files restore in seconds because the on-disk format is close to the in-memory layout.

AOF is an append-only log of every write command. It is durable to whatever fsync policy you choose: always (every write — disk-bound), everysec (a background thread flushes once per second — lose ≤1 s on power loss), or no (let the kernel decide). AOF files grow and are compacted by BGREWRITEAOF, which forks and writes the minimum command stream that reconstructs the current state.

Modern Redis uses aof-use-rdb-preamble yes by default. The rewritten AOF starts with an RDB snapshot of the keyspace at fork time, followed by AOF commands accumulated during the rewrite. This combines RDB's compactness with AOF's durability.

The hidden cost of fork() on a large dataset: COW page faults. Every page the parent writes after the fork must be duplicated. On a 100 GB instance with high write churn, the snapshot child can balloon to tens of gigabytes of resident memory, and Linux's transparent huge pages will turn 4 KB faults into 2 MB faults — a frequent cause of latency spikes in production. Disabling THP is in the Redis admin checklist for a reason.

Replication: Async, Backlog-Driven, Eventually Consistent

Redis replication is asynchronous. The primary acknowledges the client as soon as the command runs locally, then propagates to replicas. Each replica connects, issues PSYNC with its replication ID and offset, and either resumes from the in-memory replication backlog (a circular buffer, default 1 MB) or receives a full sync — an RDB snapshot followed by the buffered commands accumulated during snapshotting.

PSYNC2 (Redis 4.0+) made replicas survive primary failover: when a replica is promoted it keeps its replication ID, and other replicas can partial-resync against it. WAIT numreplicas timeout lets you opt into synchronous semantics on a per-call basis — useful for "credit transfer" style operations.

The headline correctness pitfall: a network partition between primary and replica means the primary keeps acknowledging writes that will be lost when the replica is promoted. min-replicas-to-write + min-replicas-max-lag let you opt into stop-the-world write rejection if too few replicas are keeping up. This is the closest Redis gets to a synchronous replication knob.

Sentinel and Cluster: Two Different HA Models

Sentinel monitors a single primary + N replicas. Sentinels gossip and use a quorum-based voting protocol to declare the primary down (subjective + objective down) and elect a leader sentinel that orchestrates failover. The dataset stays a single shard. Clients query Sentinels for the current primary's address, then connect directly. Sentinel solves "who is master?" — not horizontal scaling.

Redis Cluster is a sharded, peer-to-peer multi-primary system with no external coordinator. The keyspace is partitioned into 16384 hash slots: HASH_SLOT(key) = CRC16(key) % 16384. Each primary owns a subset of slots; replicas follow specific primaries. Nodes gossip (binary protocol on a separate port) about slot assignments, node liveness, and configuration epoch.

A client connects to any node. If it issues a command for a slot owned elsewhere, it gets a MOVED reply with the correct address — the client should update its slot map and retry. During slot migration, ASK redirects send the client to the migration target for the duration of the migration, without mutating its slot map.

Cluster makes a deliberate AP choice: it favours availability over consistency. With async replication and a gossip-driven failover, a network partition can briefly elect two primaries for the same slot, leading to diverged writes. Setting cluster-allow-replica-migration + tuning cluster-node-timeout mitigates but does not eliminate this. For workloads that can't tolerate it, run Sentinel + a routing proxy or use a CP system (TiDB, CockroachDB).

Expiration and Eviction

Each database keeps a parallel dict mapping keys to expiration timestamps. Expiration is enforced two ways. Lazy expiration: every command lookup checks the expire dict and deletes if past due, before serving. Active expiration: a periodic task (default 10 Hz, controlled by hz) samples 20 keys with TTLs from each db, deletes expired ones, and continues if >25% expired. The probabilistic sampling keeps total expiration cost bounded.

Eviction kicks in when memory exceeds maxmemory. Eight policies exist: noeviction (reject writes), allkeys-lru, allkeys-lfu, allkeys-random, volatile-lru, volatile-lfu, volatile-ttl, volatile-random. The "volatile-" variants only evict keys with a TTL set. LRU/LFU are approximate: Redis samples maxmemory-samples keys (default 5) and evicts the worst. Tuning maxmemory-samples to 10 makes it nearly indistinguishable from true LRU at modest CPU cost.

LFU stores a logarithmic counter + last-access time in the upper 24 bits of the object header. The counter decays over lfu-decay-time minutes. This makes LFU genuinely usable as a long-term policy without counter saturation.

Transactions, Scripting, and Functions

MULTI / EXEC queue commands and execute them atomically — but Redis transactions don't roll back on error. If EXEC's third command fails, commands one and two still ran. There are no savepoints. WATCH adds optimistic concurrency control: the transaction aborts if any WATCHed key changed before EXEC.

Lua scripts (EVAL, EVALSHA) run atomically inside the single thread, which is exactly the property that makes them powerful and dangerous: a 200 ms script halts every other client. Scripts cannot call blocking commands. The redis.replicate_commands() mode replicates effects, not the script itself, avoiding non-determinism issues.

Redis 7 introduced Functions: a persistent, replicated form of Lua libraries with explicit naming and lifecycle (FUNCTION LOAD, FCALL). Unlike EVAL's fragile "load on first call" model, Functions survive restarts and replicate cleanly to replicas.

Streams: A Log-Structured Type

Streams (XADD, XRANGE, XREAD, XREADGROUP) are an append-only log identified by millisecond-precision IDs. They support consumer groups with at-least-once semantics: each consumer in a group sees a disjoint subset of messages and must XACK them once processed. Unacked messages stay in the per-consumer Pending Entries List (PEL); other consumers can XCLAIM them after a configurable idle threshold, providing failover.

Streams aren't Kafka. There is no partitioning within a stream. Throughput per stream is bounded by the single Redis thread's command rate. They shine for queue-shaped workloads in the 10k–100k msg/sec range where you already run Redis and don't want to introduce another moving part. Beyond that, run Kafka.

Tradeoffs and When Not To Use Redis

Redis is not a database in the durability sense most engineers expect. With default settings, you can lose up to one second of writes on power loss. Synchronous replication doesn't exist as a primary-side guarantee — the best you have is WAIT per-call. Cluster mode is AP. Memory is your storage limit (Redis on Flash via Enterprise/KeyDB exists but has different ergonomics). And the single thread is a hard ceiling: a workload that genuinely needs 4 cores worth of CPU on commands cannot scale up — only out, via cluster.

If you need: durable transactions, complex queries, joins, multi-key consistency across shards — use a real database. If you need: a multi-million-ops/sec cache, a fast counter store, a leaderboard, a cooperative job queue, a pub/sub bus, or a session store — Redis is hard to beat.

Redis vs Other Caches & In-Memory Stores

RedisDragonflyKeyDBMemcached
ThreadingSingle (+ I/O threads)Shared-nothing multi-threadMulti-thread w/ locksMulti-thread, slabs
Data structures9+ rich typesRedis-compatibleRedis-compatibleString only
PersistenceRDB + AOFDflySnapshot (point-in-time)FLASH + RDB/AOFNone (volatile)
ReplicationAsync + Sentinel/ClusterAsyncActive-active multi-masterClient-side hashing
Cluster mode16384 slots, gossipSingle-node (sharding planned)Active replicationNone native
Best forThe default in-memory choiceVertical scale on big boxesActive-active geoPure cache, rarely changing topology
Wire compatibilityRESP2/3 (canonical)RESP-compatibleRESP-compatiblememcached protocol

FAQ

Why is single-threaded Redis faster than multi-threaded competitors for many workloads?

Because most Redis commands are O(1) memory reads/writes. The bottleneck is network I/O, not CPU. A single thread saturates a 10 Gbit NIC long before it runs out of CPU on simple GET/SET. Adding threads only helps if commands themselves are CPU-bound — which is why Dragonfly and KeyDB win on workloads dominated by big set operations or many concurrent slow commands.

How durable is Redis really?

With default everysec AOF and async replication, you can lose up to one second of writes on a single-node power loss, and unbounded writes on a primary failure with replica lag. appendfsync always + WAIT tighten this dramatically but cost throughput. Redis is not a substitute for a transactional database when durability is non-negotiable.

When should I use Redis Cluster vs Sentinel?

Sentinel if your dataset fits on one node and you want HA without changing the data model. Cluster if you need to shard. Cluster requires that multi-key operations stay within one slot (use hash tags: {user:42}:profile, {user:42}:cart live on the same slot), and Lua/transactions can't span slots. That's a real client-side burden — don't adopt Cluster unless you've actually exhausted vertical scaling.

Why does my Redis spike to high latency during BGSAVE?

Almost always copy-on-write fork costs, made worse by transparent huge pages (THP). When the parent process modifies pages after the fork, Linux must duplicate them. With THP enabled, each fault duplicates 2 MB instead of 4 KB. Disable THP at the OS level. Also check vm.overcommit_memory=1 so fork doesn't fail on memory-tight nodes.

Should I use Redis Streams or Kafka?

Streams if you already run Redis, your throughput is below ~100k msg/sec per stream, and you want at-least-once consumer groups with minimal ops. Kafka if you need partitioning within a topic, retention measured in days/weeks rather than memory, exactly-once semantics across producers/consumers, or compaction.

Is Redis a good database (not just a cache)?

For specific shapes, yes: leaderboards, rate limiters, session stores, real-time counters, geo lookups. The data model maps directly to native types and a single instance handles enormous load. For anything needing joins, multi-row transactions, secondary indexes, or complex queries, use a real database and put Redis in front of it.

What's the deal with Valkey?

After Redis Inc. switched the upstream license to RSALv2/SSPL in March 2024, the Linux Foundation forked the last BSD-licensed Redis 7.2 as Valkey. Major cloud vendors (AWS, GCP, Oracle) and the Redis community now back Valkey as the open-source continuation. The protocol and data structures are identical; if you're starting fresh with no enterprise contract, Valkey is the open-source default.