Design a Key-Value Store

Consistent Hashing, Replication, Conflict Resolution, and LSM Storage

A distributed key-value store maps keys to values across a cluster of nodes (e.g., DynamoDB, etcd, Redis Cluster). The core challenges: partitioning data across nodes, replicating for fault tolerance, and resolving conflicts when concurrent writes happen. At millions of ops/sec, you need consistent hashing for balanced distribution, tunable quorum for consistency vs availability trade-offs, and an LSM-tree storage engine for fast writes.

Consistent Hashing Ring

Add or remove nodes and keys to see how consistent hashing distributes data. Each node can have virtual nodes (vnodes) for better balance. Keys are assigned to the first node clockwise on the ring.

Capacity Estimation

Estimate the resources needed for your key-value store cluster.

Raw storageβ€”
With replicationβ€”
Bandwidthβ€”
Nodes neededβ€”

Architecture

Client
β†’
Coordinator
β†’
Consistent Hash
↓
Replica 1
Primary
↓
Replica 2
Secondary
↓
Replica 3
Secondary
↓
WAL
↓
Memtable
↓
SSTable

Key Design Decisions

Consistent Hashing vs Range Partitioning

Consistent Hashing
  • Minimal reshuffling on node change
  • Even distribution with vnodes
  • No range queries
vs
Range Partitioning
  • Efficient range scans
  • Hotspot risk on popular ranges
  • Needs rebalancing logic

Conflict Resolution

Vector Clocks
  • Detects true conflicts
  • Client resolves divergence
  • Metadata grows over time
vs
Last-Writer-Wins (LWW)
  • Simple timestamp comparison
  • May silently lose writes
  • Clock skew risk

Replication β€” Quorum (R + W > N)

With N=3 replicas, set W=2, R=2 for strong consistency, or W=1, R=1 for high availability. The rule R+W>N guarantees overlap between read and write sets, ensuring you always read the latest write. DynamoDB and Cassandra let you tune this per request.

Compaction Strategies

Size-tiered: merge similarly-sized SSTables β€” good for write-heavy. Leveled: fixed-size levels with no overlap β€” better read amplification, used by LevelDB/RocksDB. Trade-off: write amplification vs read amplification vs space amplification.

A foundational system design topic β€” master partitioning, replication, and consistency trade-offs.