Design a Key-Value Store

Consistent Hashing, Replication, Conflict Resolution, and LSM Storage

A distributed key-value store maps keys to values across a cluster of nodes (e.g., DynamoDB, etcd, Redis Cluster). The core challenges: partitioning data across nodes, replicating for fault tolerance, and resolving conflicts when concurrent writes happen. At millions of ops/sec, you need consistent hashing for balanced distribution, tunable quorum for consistency vs availability trade-offs, and an LSM-tree storage engine for fast writes.

Consistent Hashing Ring

Add or remove nodes and keys to see how consistent hashing distributes data. Each node can have virtual nodes (vnodes) for better balance. Keys are assigned to the first node clockwise on the ring.

Virtual nodes per node: 3

Capacity Estimation

Estimate the resources needed for your key-value store cluster.

Operations per second: 100K

Avg value size (KB): 1

Total keys (millions): 500

Replication factor (N): 3

Raw storage—

With replication—

Bandwidth—

Nodes needed—

Architecture

Client

→

Coordinator

→

Consistent Hash

↓

Replica 1

Primary

↓

Replica 2

Secondary

↓

Replica 3

Secondary

↓

WAL

↓

Memtable

↓

SSTable

Key Design Decisions

Consistent Hashing vs Range Partitioning Consistent Hashing Minimal reshuffling on node change
Even distribution with vnodes
No range queries
 
vs
 Range Partitioning Efficient range scans
Hotspot risk on popular ranges
Needs rebalancing logic
 

Conflict Resolution Vector Clocks Detects true conflicts
Client resolves divergence
Metadata grows over time
 
vs
 Last-Writer-Wins (LWW) Simple timestamp comparison
May silently lose writes
Clock skew risk
 

Replication — Quorum (R + W > N)

With N=3 replicas, set W=2, R=2 for strong consistency, or W=1, R=1 for high availability. The rule R+W>N guarantees overlap between read and write sets, ensuring you always read the latest write. DynamoDB and Cassandra let you tune this per request.

Compaction Strategies

Size-tiered: merge similarly-sized SSTables — good for write-heavy. Leveled: fixed-size levels with no overlap — better read amplification, used by LevelDB/RocksDB. Trade-off: write amplification vs read amplification vs space amplification.

Consistent Hashing Ring

Capacity Estimation

Architecture

Key Design Decisions

Consistent Hashing vs Range Partitioning

Conflict Resolution

Replication — Quorum (R + W > N)

Compaction Strategies

🔗 Related Topics