Distributed Cache — Write patterns, consistent hashing, hot keys, stampedes

Architecture

Capacity Estimation

For a service at 200 K reads/s with target 95% hit rate and average value size 2 KB:

Metric	Value	Notes
Cache reads/s	200 K	peak
DB reads/s on miss	10 K	5% miss
Working set	~200 GB	100 M items × 2 KB
Cluster RAM	500 GB	2.5× for replicas + headroom
Nodes (32 GB each)	~16	+ replicas
p99 GET latency	~1 ms	same-AZ Redis

Write Patterns

Cache-aside (lazy-loading): app reads cache; on miss, reads DB and populates cache. Writes update DB and invalidate (or update) the cache. Default choice. Hazard: stale read if a concurrent reader reloads after a writer's invalidation but before the new write commits. Mitigate by setting a short post-write TTL (cache-aside double-delete).

Write-through: app writes to cache; cache writes synchronously to DB before ack. Strong consistency between cache and DB at the cost of write latency. Common for in-process caches and CDN edge writes.

Write-back (write-behind): app writes to cache; cache asynchronously flushes to DB. Lowest latency, highest data-loss risk on cache crash. Used for counters and metrics where some loss is tolerable, or paired with durable WAL.

Refresh-ahead: cache proactively reloads soon-to-expire hot keys. Reduces miss latency on predictable hotspots; wasted work on cold ones.

Consistent Hashing for Routing

The naive scheme hash(key) % N moves nearly every key when N changes. Consistent hashing places nodes on a 0..2³² ring; each key hashes to a point and is owned by the next node clockwise. Adding or removing a node moves only ~1/N of keys.

Vanilla consistent hashing has load imbalance; use virtual nodes (each physical node placed at 100–200 ring positions) to smooth it. Even better: jump consistent hash (Lamping & Veach) gives near-perfect balance with no per-node state, used by Vitess and Discord.

Client-side routing (Redis Cluster) puts the responsibility in the SDK. Proxy-side routing (Twemproxy, Envoy) centralizes it but adds a hop. See Redis Cluster for the slot-based variant.

Replication

Replication serves two goals: availability (failover) and read scaling. Master-replica with async replication (Redis primary/replica) is standard; expect ~1 ms replica lag in-AZ, 50–200 ms cross-region.

Choices: read-from-replica for hot keys (load split), with the caveat of stale reads. Quorum reads across replicas for stronger consistency at the cost of latency. Active-active with CRDT semantics for multi-region writes — powerful but complex; only the most demanding cases need it.

Hot Key Problem

One key for "Taylor Swift's profile" gets 50 K req/s — one node melts, the rest are bored. Three mitigations:

Replicate hot keys across N nodes; the client picks one randomly. Detect heat via top-K counters (count-min sketch + min-heap) and promote keys dynamically.
Local in-process L1 cache with short TTL (1–10 s). Caffeine on the JVM, in-process LRU in Go. The cache cluster becomes L2.
Probabilistic sharding — store the hot key under N synthetic keys (profile:taylor:0..profile:taylor:N); client picks one. Writes fan out to all replicas.

Cache Stampede

A popular key expires; 10 K concurrent readers all miss simultaneously and hammer the database. Fixes:

Request coalescing (singleflight): the first miss takes a per-key lock; concurrent missers wait on the result. Trivially implemented in Go via singleflight.Group; see also Memcached's add-as-lock pattern.
Probabilistic early expiration (XFetch): each reader recomputes a randomized refresh probability that increases as the key approaches TTL; one reader proactively refreshes ahead of the herd. The classic Vattani & Chierichetti paper.
Soft TTL + hard TTL — serve the stale value past the soft TTL while one async worker refreshes; only block on a true miss past the hard TTL.

Eviction Policies

LRU — default, good general-purpose. Approximate LRU (Redis) tracks last-access via a sample of K keys, not full history.
LFU — biases toward long-popular keys; resists scan pollution but slow to adapt to traffic shifts.
TinyLFU / W-TinyLFU (Caffeine) — admission filter combines frequency and recency; near-optimal hit rate for most workloads.
FIFO — cheapest, used for streaming windows.

Set maxmemory-policy explicitly. The default of noeviction turns the cache into a single point of failure when full.

Redis vs Memcached vs DragonflyDB

Memcached — simplest, multi-threaded, only string KV with byte values. Best for pure caching where you do not need data structures.
Redis — rich data structures (lists, sets, sorted sets, streams, HyperLogLog), single-threaded core (multi-threaded I/O since 6.0), pub/sub, Lua scripting, cluster mode with hash slots. Default for most teams.
DragonflyDB — Redis-compatible, multi-threaded shared-nothing per core, claims 25× throughput on a single instance. Worth evaluating when you would otherwise scale a Redis Cluster horizontally for CPU. See Redis vs Dragonfly for the deep comparison.
Hazelcast / Infinispan — in-JVM grid caches; the right answer if your stack is heavily Java and you want compute-near-data.

Failure Modes

Cache outage => DB meltdown — the cache hides 95% of the load; when it dies, the DB sees 20× traffic. Ensure DB has at least 30% headroom and the app has client-side per-key request coalescing.
Split brain — partitioned replicas accept writes; on heal, last-writer-wins eats data. Use a single master with manual failover or a consensus-backed coordinator.
Memory pressure — eviction storms when working set exceeds memory; p99 spikes to 50 ms. Alert on used_memory / maxmemory > 0.9.
Slow keys — KEYS *, SMEMBERS on million-element sets, blocking the single-threaded core. Disable dangerous commands; use SCAN patterns.

FAQ

What is a reasonable cache hit rate target?

Depends on workload. 90–99% for read-heavy product catalogs; 60–80% for personalized feeds. Below 50% the cache is hurting more than helping (still pays the network round trip on miss).

Cache invalidation or short TTL?

Both. TTL is your safety net — if invalidation logic has a bug, data eventually self-corrects. Invalidation cuts staleness window to seconds; TTL bounds it to minutes.

One big cache or per-service caches?

Per-service for isolation and independent scaling. Shared only for cross-cutting data like session tokens. Shared caches become political hot potatoes during incidents.

How do you migrate cache topology?

Run new cluster in parallel, dual-write from the app, switch reads when warmed, drop old cluster. Never live-resize a hash ring under traffic without dual-write.