Redis Cluster
Redis Cluster is the native sharded deployment mode for Redis. It splits the keyspace into
exactly 16384 hash slots distributed across master nodes, each backed by zero or more
replicas, with a gossip protocol on a separate TCP port (the cluster bus) for
membership and failure detection. Clients are slot-aware: they hash keys with CRC16, look up
the owning node, and on a stale routing table the destination node replies with
MOVED or ASK to push the client toward the right shard. There is no
central coordinator and no consensus log — all decisions are made by quorum among masters.
Cluster Topology
Key Numbers
Why Redis Cluster Is Shaped This Way
The 16384 Slots Decision
A peculiar number with practical reasoning.
Every key in Redis Cluster is mapped to one of 16384 hash slots:
slot = CRC16("user:42") & 16383
The bitwise AND is exactly equivalent to mod 16384 because 16384 is a power of
two. Each master owns a contiguous (or scattered, after resharding) range of slots; the union
of all masters' slots covers 0..16383 with no gaps and no overlaps, an invariant
enforced by the cluster bus.
Why exactly 16384? Two reasons. (1) Each cluster bus heartbeat carries a 2 KB bitmap of the slots the sender claims. 16384 bits = 2048 bytes — fits in a single Ethernet frame alongside the rest of the gossip payload, keeping each PING small and avoiding IP fragmentation. (2) A larger slot count makes resharding finer-grained (more useful) but also makes per-slot bookkeeping more expensive. Each master's keyspace is organized as 16384 lists of keys per slot, which is plenty for hot-spot avoidance without bloating per-key metadata.
The slot count is hard-coded into the protocol; it cannot be changed without breaking every client and forking the protocol version. Dragonfly, KeyDB, and Valkey all preserve 16384 for compatibility.
Hash Tags and Multi-Key Locality
How to keep related keys on the same shard.
Redis Cluster forbids most multi-key commands from spanning shards — MGET,
MSET, SUNIONSTORE, EVAL with multiple KEYS,
and MULTI/EXEC all fail with CROSSSLOT if their keys hash to
different slots. The escape hatch is hash tags:
SET {user:42}:profile ... ← hashes on "user:42"
SET {user:42}:feed ... ← hashes on "user:42"
SET {user:42}:friends ... ← hashes on "user:42"
MGET {user:42}:profile {user:42}:feed ← OK, all on same slot
The rule: if a key contains {...}, only the substring inside the first balanced
pair of braces is hashed. This lets schema designers explicitly co-locate keys. Common
patterns include user-scoped data (everything for user N tagged with {N}),
tenant-scoped data, and Lua scripts that need atomic multi-key operations.
The tradeoff: a hot tag becomes a hot shard. If {hot_user} generates 10x
traffic, that shard sees 10x load with no way to spread it. Pick tags at a granularity small
enough to load-balance.
MOVED vs ASK Redirects
Two redirect codes with very different semantics.
A client connects to any cluster node and issues SET foo bar. The receiving
node hashes foo, checks its own slot ownership table, and:
if I own this slot → process the command if slot is migrating → reply -ASK redirect (this query only) if I don't own it → reply -MOVED redirect (update your map)
-MOVED 12182 10.0.0.5:6379 tells the client "slot 12182 belongs to that node;
update your routing table permanently." The client then retries on the new node. Smart
clients (jedis-cluster, redis-py, lettuce, ioredis) cache the slot map and only encounter
MOVED on stale entries — typically right after a topology change.
-ASK 12182 10.0.0.5:6379 means "I'm currently migrating this slot — the key
may already be on the destination node. Try there, but only for this query." The client
sends an ASKING command to the destination first, which gives it permission to
serve queries for slots in IMPORTING state. This dance ensures correctness during
live migration: the source node serves keys that haven't moved, the destination serves keys
that have, and ASK redirects keep the client switching as needed.
Operational note: a client that aggressively interprets ASK as MOVED will black-hole queries until migration completes. Always use a real cluster client; never paper over redirects with a generic Redis client.
Live Resharding
Migrating slots between masters without downtime.
Resharding moves slots one at a time. The orchestrator (redis-cli --cluster reshard
or a control plane like redis-trib) drives this sequence:
1. CLUSTER SETSLOT 12182 IMPORTING← on destination 2. CLUSTER SETSLOT 12182 MIGRATING ← on source 3. CLUSTER GETKEYSINSLOT 12182 100 ← on source, 100 at a time 4. MIGRATE "" 0 5000 KEYS k1 k2 … ← move them 5. (loop 3-4 until empty) 6. CLUSTER SETSLOT 12182 NODE ← on every master
During steps 2-5, the source serves keys not yet migrated and replies ASK for keys it has already moved. The destination serves keys it has received (for clients that arrive via ASKING). Step 6 finalizes ownership and gossips the new map cluster-wide; from then on, MOVED is the response from any other node and the client routing table updates.
The MIGRATE command is interesting: it serializes the keys with DUMP, transfers them to the destination over a single connection (with RESP3 multiplexing), and atomically removes them from the source. If the network drops mid-MIGRATE, the source still has the key and the command fails — no half-state. The 5000 in the example is a millisecond timeout per batch.
Gossip and Failure Detection
The cluster bus protocol on port 16379 (data port + 10000).
Every node maintains TCP connections to every other node on the cluster bus port. They exchange compact binary messages: PING, PONG (which includes a snapshot of state), MEET (to add a new node), FAIL (broadcast on confirmed failure), PUBLISH (cluster-wide pub/sub), and a few others. Each PING includes a random sample of nodes' addresses and slot bitmaps — so knowledge propagates by gossip rather than broadcast.
Each node pings every other node periodically (typical interval: half the
cluster-node-timeout, default 15 s, so 7.5 s). If a node hasn't responded within
cluster-node-timeout, the local node marks it PFAIL (possibly
failed). PFAIL is gossiped. When a majority of masters have independently observed PFAIL on
the same node, it becomes FAIL, and that's broadcast.
Once a master is FAIL, its replicas race to be promoted. A replica needs votes from a majority
of masters; it requests them via FAILOVER_AUTH_REQUEST. The replica with the
smallest replication offset gap (most up-to-date) is naturally favored because it'll get its
request out first. Quorum of N/2+1 masters approves; the winning replica takes over the slot
ownership and gossips the new ownership.
Tuning cluster-node-timeout is a tradeoff. Lower values (5-10 s) detect failures
faster but false-positive on transient network issues, causing unnecessary failovers. Higher
values (30 s) are more stable but admit longer write outages on real failures. Production
defaults of 15 s are reasonable for most workloads.
Tradeoffs vs Alternatives
| Aspect | Redis Cluster | Sentinel | Twemproxy |
|---|---|---|---|
| Sharding | built-in (16384 slots) | none (single shard) | consistent hash |
| Failover | cluster bus quorum | Sentinel quorum | none |
| Multi-key ops | same-slot only | full | same-shard only |
| Routing | client-aware | client-aware | proxy hop |
| Live resharding | yes | n/a | requires restart |
| Operational complexity | medium | low | medium |
FAQ
Why exactly 16384 hash slots?
Salvatore Sanfilippo (antirez) chose 16384 because the cluster bus protocol gossips a bitmap of slots each node owns in every heartbeat. 16384 bits is 2 KB, which fits comfortably alongside other gossip payload in a UDP-sized packet. Larger slot counts (e.g. 65536) would push messages over MTU; smaller (e.g. 4096) would limit cluster size — at 4096 slots you can't have more than 4096 masters. 16384 was the smallest power-of-two that supports 'practical' cluster sizes (~1000 masters) while keeping gossip cheap.
What's the difference between MOVED and ASK?
MOVED is permanent, ASK is temporary. MOVED means 'this slot lives on node X — update your client routing table and never come back to me for this slot.' ASK means 'this slot is currently being migrated — for this one query, redirect to node X, but the slot still officially belongs to me.' During resharding, the source node serves keys that haven't migrated yet and replies ASK for keys that have. Once migration completes, the cluster sends a CLUSTER SETSLOT NODE command and from then on it's MOVED.
Can I do MULTI/EXEC across slots?
Only if all keys in the transaction hash to the same slot. Redis Cluster checks at command parse time and replies CROSSSLOT if you mix. The standard workaround is hash tags — a key like {user:42}.profile and {user:42}.feed both hash on user:42 and land on the same slot. Plan your key schema for hash-tag locality if you want multi-key operations. Lua scripts and SUBSCRIBE on patterns have the same constraint.
How does failure detection actually work?
Each master sends PING messages to every other master roughly every cluster-node-timeout/2 (default 7.5 s for the 15 s default timeout). If a node misses pings for cluster-node-timeout, the local node marks it PFAIL (possibly failed). Once a majority of masters concur via gossip, it becomes FAIL — at which point its replicas race to be elected master, requiring votes from a majority of masters. The smaller the cluster-node-timeout, the faster failover but the higher false-positive rate under transient network blips.
Do I need an odd number of masters?
Yes, but for failover quorum, not for slot ownership. Failover requires a majority of masters to vote for a replica's promotion. With 4 masters, two-master partitions deadlock — neither side can elect; with 5, the larger partition (3 vs 2) wins. Use 3, 5, or 7 masters and add replicas as needed for HA. The replica count doesn't affect quorum.
What happens to writes during a partition?
On the minority side, masters that find themselves separated from a majority of peers stop accepting writes after cluster-node-timeout (this is the 'minority partition stays read-only' rule). On the majority side, replicas in PFAIL/FAIL territory get promoted; writes resume against new masters. The window of accepted-then-lost writes is bounded by cluster-node-timeout — typically 15 seconds. Set min-replicas-to-write 1 and min-replicas-max-lag 10 for stricter durability at the cost of write availability.