Redis Cluster

Redis Cluster is the native sharded deployment mode for Redis. It splits the keyspace into exactly 16384 hash slots distributed across master nodes, each backed by zero or more replicas, with a gossip protocol on a separate TCP port (the cluster bus) for membership and failure detection. Clients are slot-aware: they hash keys with CRC16, look up the owning node, and on a stale routing table the destination node replies with MOVED or ASK to push the client toward the right shard. There is no central coordinator and no consensus log — all decisions are made by quorum among masters.

Cluster Topology

Key Numbers

Hash slots

16384

Slot algorithm

CRC16 mod 16384

Default node timeout

15 s

PING interval

timeout/2

Min masters

3 (for quorum)

Cluster bus port

data + 10000

Max nodes (recommended)

~1000

Why Redis Cluster Is Shaped This Way

No coordinator

Avoids the operational burden of a Zookeeper/etcd dependency. The tradeoff is weaker consistency guarantees — Redis Cluster is async-replicated and gossip-coordinated, not a Raft cluster. For caches and counters this is fine; for ledgers, look elsewhere.

Client-side routing

Clients learn the slot map and hash keys themselves; only stale clients pay the round-trip-with-redirect cost. This shifts work off the server (no proxy hop) and keeps latency at ~RTT for steady state.

Live resharding

Slots can be migrated between masters one at a time without downtime. The combination of MIGRATE, IMPORTING/MIGRATING flags, and ASK redirects lets clients keep reading and writing while the slot is in flight.

The 16384 Slots Decision

A peculiar number with practical reasoning.

Every key in Redis Cluster is mapped to one of 16384 hash slots:

slot = CRC16("user:42") & 16383

The bitwise AND is exactly equivalent to mod 16384 because 16384 is a power of two. Each master owns a contiguous (or scattered, after resharding) range of slots; the union of all masters' slots covers 0..16383 with no gaps and no overlaps, an invariant enforced by the cluster bus.

Why exactly 16384? Two reasons. (1) Each cluster bus heartbeat carries a 2 KB bitmap of the slots the sender claims. 16384 bits = 2048 bytes — fits in a single Ethernet frame alongside the rest of the gossip payload, keeping each PING small and avoiding IP fragmentation. (2) A larger slot count makes resharding finer-grained (more useful) but also makes per-slot bookkeeping more expensive. Each master's keyspace is organized as 16384 lists of keys per slot, which is plenty for hot-spot avoidance without bloating per-key metadata.

The slot count is hard-coded into the protocol; it cannot be changed without breaking every client and forking the protocol version. Dragonfly, KeyDB, and Valkey all preserve 16384 for compatibility.

Hash Tags and Multi-Key Locality

How to keep related keys on the same shard.

Redis Cluster forbids most multi-key commands from spanning shards — MGET, MSET, SUNIONSTORE, EVAL with multiple KEYS, and MULTI/EXEC all fail with CROSSSLOT if their keys hash to different slots. The escape hatch is hash tags:

SET {user:42}:profile  ...     ← hashes on "user:42"
SET {user:42}:feed     ...     ← hashes on "user:42"
SET {user:42}:friends  ...     ← hashes on "user:42"

MGET {user:42}:profile {user:42}:feed   ← OK, all on same slot

The rule: if a key contains {...}, only the substring inside the first balanced pair of braces is hashed. This lets schema designers explicitly co-locate keys. Common patterns include user-scoped data (everything for user N tagged with {N}), tenant-scoped data, and Lua scripts that need atomic multi-key operations.

The tradeoff: a hot tag becomes a hot shard. If {hot_user} generates 10x traffic, that shard sees 10x load with no way to spread it. Pick tags at a granularity small enough to load-balance.

MOVED vs ASK Redirects

Two redirect codes with very different semantics.

A client connects to any cluster node and issues SET foo bar. The receiving node hashes foo, checks its own slot ownership table, and:

if I own this slot      → process the command
if slot is migrating    → reply -ASK redirect (this query only)
if I don't own it       → reply -MOVED redirect (update your map)

-MOVED 12182 10.0.0.5:6379 tells the client "slot 12182 belongs to that node; update your routing table permanently." The client then retries on the new node. Smart clients (jedis-cluster, redis-py, lettuce, ioredis) cache the slot map and only encounter MOVED on stale entries — typically right after a topology change.

-ASK 12182 10.0.0.5:6379 means "I'm currently migrating this slot — the key may already be on the destination node. Try there, but only for this query." The client sends an ASKING command to the destination first, which gives it permission to serve queries for slots in IMPORTING state. This dance ensures correctness during live migration: the source node serves keys that haven't moved, the destination serves keys that have, and ASK redirects keep the client switching as needed.

Operational note: a client that aggressively interprets ASK as MOVED will black-hole queries until migration completes. Always use a real cluster client; never paper over redirects with a generic Redis client.

Live Resharding

Migrating slots between masters without downtime.

Resharding moves slots one at a time. The orchestrator (redis-cli --cluster reshard or a control plane like redis-trib) drives this sequence:

1. CLUSTER SETSLOT 12182 IMPORTING      ← on destination
2. CLUSTER SETSLOT 12182 MIGRATING      ← on source
3. CLUSTER GETKEYSINSLOT 12182 100                   ← on source, 100 at a time
4. MIGRATE   "" 0 5000 KEYS k1 k2 …   ← move them
5. (loop 3-4 until empty)
6. CLUSTER SETSLOT 12182 NODE           ← on every master

During steps 2-5, the source serves keys not yet migrated and replies ASK for keys it has already moved. The destination serves keys it has received (for clients that arrive via ASKING). Step 6 finalizes ownership and gossips the new map cluster-wide; from then on, MOVED is the response from any other node and the client routing table updates.

The MIGRATE command is interesting: it serializes the keys with DUMP, transfers them to the destination over a single connection (with RESP3 multiplexing), and atomically removes them from the source. If the network drops mid-MIGRATE, the source still has the key and the command fails — no half-state. The 5000 in the example is a millisecond timeout per batch.

Gossip and Failure Detection

The cluster bus protocol on port 16379 (data port + 10000).

Every node maintains TCP connections to every other node on the cluster bus port. They exchange compact binary messages: PING, PONG (which includes a snapshot of state), MEET (to add a new node), FAIL (broadcast on confirmed failure), PUBLISH (cluster-wide pub/sub), and a few others. Each PING includes a random sample of nodes' addresses and slot bitmaps — so knowledge propagates by gossip rather than broadcast.

Each node pings every other node periodically (typical interval: half the cluster-node-timeout, default 15 s, so 7.5 s). If a node hasn't responded within cluster-node-timeout, the local node marks it PFAIL (possibly failed). PFAIL is gossiped. When a majority of masters have independently observed PFAIL on the same node, it becomes FAIL, and that's broadcast.

Once a master is FAIL, its replicas race to be promoted. A replica needs votes from a majority of masters; it requests them via FAILOVER_AUTH_REQUEST. The replica with the smallest replication offset gap (most up-to-date) is naturally favored because it'll get its request out first. Quorum of N/2+1 masters approves; the winning replica takes over the slot ownership and gossips the new ownership.

Tuning cluster-node-timeout is a tradeoff. Lower values (5-10 s) detect failures faster but false-positive on transient network issues, causing unnecessary failovers. Higher values (30 s) are more stable but admit longer write outages on real failures. Production defaults of 15 s are reasonable for most workloads.

Tradeoffs vs Alternatives

Aspect	Redis Cluster	Sentinel	Twemproxy
Sharding	built-in (16384 slots)	none (single shard)	consistent hash
Failover	cluster bus quorum	Sentinel quorum	none
Multi-key ops	same-slot only	full	same-shard only
Routing	client-aware	client-aware	proxy hop
Live resharding	yes	n/a	requires restart
Operational complexity	medium	low	medium

FAQ

Why exactly 16384 hash slots?

Salvatore Sanfilippo (antirez) chose 16384 because the cluster bus protocol gossips a bitmap of slots each node owns in every heartbeat. 16384 bits is 2 KB, which fits comfortably alongside other gossip payload in a UDP-sized packet. Larger slot counts (e.g. 65536) would push messages over MTU; smaller (e.g. 4096) would limit cluster size — at 4096 slots you can't have more than 4096 masters. 16384 was the smallest power-of-two that supports 'practical' cluster sizes (~1000 masters) while keeping gossip cheap.

What's the difference between MOVED and ASK?

MOVED is permanent, ASK is temporary. MOVED means 'this slot lives on node X — update your client routing table and never come back to me for this slot.' ASK means 'this slot is currently being migrated — for this one query, redirect to node X, but the slot still officially belongs to me.' During resharding, the source node serves keys that haven't migrated yet and replies ASK for keys that have. Once migration completes, the cluster sends a CLUSTER SETSLOT NODE command and from then on it's MOVED.

Can I do MULTI/EXEC across slots?

Only if all keys in the transaction hash to the same slot. Redis Cluster checks at command parse time and replies CROSSSLOT if you mix. The standard workaround is hash tags — a key like {user:42}.profile and {user:42}.feed both hash on user:42 and land on the same slot. Plan your key schema for hash-tag locality if you want multi-key operations. Lua scripts and SUBSCRIBE on patterns have the same constraint.

How does failure detection actually work?

Each master sends PING messages to every other master roughly every cluster-node-timeout/2 (default 7.5 s for the 15 s default timeout). If a node misses pings for cluster-node-timeout, the local node marks it PFAIL (possibly failed). Once a majority of masters concur via gossip, it becomes FAIL — at which point its replicas race to be elected master, requiring votes from a majority of masters. The smaller the cluster-node-timeout, the faster failover but the higher false-positive rate under transient network blips.

Do I need an odd number of masters?

Yes, but for failover quorum, not for slot ownership. Failover requires a majority of masters to vote for a replica's promotion. With 4 masters, two-master partitions deadlock — neither side can elect; with 5, the larger partition (3 vs 2) wins. Use 3, 5, or 7 masters and add replicas as needed for HA. The replica count doesn't affect quorum.

What happens to writes during a partition?

On the minority side, masters that find themselves separated from a majority of peers stop accepting writes after cluster-node-timeout (this is the 'minority partition stays read-only' rule). On the majority side, replicas in PFAIL/FAIL territory get promoted; writes resume against new masters. The window of accepted-then-lost writes is bounded by cluster-node-timeout — typically 15 seconds. Set min-replicas-to-write 1 and min-replicas-max-lag 10 for stricter durability at the cost of write availability.