Database Sharding
Horizontal Partitioning Strategies, Shard Key Selection, and Scalable Data Distribution
Sharding splits a database into horizontal partitions called shards, each stored on a separate node. The goal: scale writes, storage, and reads beyond what a single database can handle β typically past 1M rows/sec or 10TB. The core challenge: choosing a shard key that evenly distributes data, minimizes cross-shard operations, and survives node additions without massive reshuffling. Get it wrong and you'll create hotspots, rebalancing storms, or impossible JOINs. Get it right and you can scale linearly.
When Do You Need Sharding?
Enter your numbers to see if sharding is warranted β and what capacity each shard needs.
Shard Key Strategies
The shard key determines how data is distributed. Pick the wrong one and you'll create bottlenecks.
Consistent Hashing Ring
Add or remove nodes and watch how consistent hashing minimizes key remapping β compared to naive modulo.
Naive Modulo (hash % N)
Consistent Hashing
Resharding: Adding a Node
When you add a shard, naive modulo reassigns most keys. Consistent hashing reassigns only ~1/N.
Before (4 shards)
After adding shard 5
Cross-Shard Query Patterns
Some queries must touch multiple shards. Choose the right strategy based on your access pattern.
Pros
Cons
Hot Key Mitigation
A single popular shard key (e.g., celebrity user) can overwhelm one shard. Here's how to handle it.
π Key Splitting
user:1234 β user:1234:0, user:1234:1, user:1234:2
When: one key has >20% of traffic
π Local Cache
Redis L1 cache on each shard node for hot keys
When: read-heavy hot keys
π³οΈ Replica Read
Route hot key reads to read replicas, writes to primary
β Replication lag may show stale reads
π§ Salting
Add random suffix to distribute load: user:1234:salt β shard N
When: hot key pattern is predictable
Shard Key Decision Framework
π Query Pattern First
Identify your top 3 most common queries. Can they be answered by a single shard? If yes, use a shard key aligned to that query. If all queries are cross-shard, lean toward denormalization.
βοΈ Write Distribution
Will writes hit a few keys heavily (time-series, social media)? Use hash-based for even write distribution. Writes uniformly distributed? Range-based works fine.
π’ Cardinality
Shard key must have high cardinality β thousands+ of distinct values. User ID is good; Country code (200 values) is not. Low cardinality = few large shards.
π‘οΈ Resharding Cost
Plan for 2Γ expected growth to avoid resharding in <12 months. If resharding is likely, use directory-based or consistent hashing from day one.
β Golden Rule
If you can't answer "which shard does this query hit?" before executing, your shard key needs work. Every unnecessary scatter-gather query multiplies latency by the number of shards.
Frequently Asked Questions
When should I shard vs. just adding read replicas?
Add replicas first. Replicas handle read scaling, improve availability, and require zero application changes. Shard when: (1) you hit write throughput limits, (2) storage exceeds a single node, or (3) p99 latency spikes under write load. Typically this is 5β50M+ rows/day or 1TB+ of hot data.
What happens to foreign keys and JOINs?
Foreign keys don't automatically span shards. You have three options: denormalize (duplicate data for co-location), re-published queries (run sub-query on each shard and merge), or application-level joins (fetch from each shard and join in your service layer). Denormalization is most common in sharded systems.
Can I shard without a middleware proxy?
Yes. Many databases (MongoDB, Cassandra, CockroachDB) handle shard routing natively. Proxy-based sharding (Vitess, CockroachDB, Spanner) adds a routing layer that your app talks to like a single database. Direct sharding (CockroachDB, YugabyteDB) handles it transparently. Proxy adds latency but gives you full control; native sharding is simpler but less flexible.
How do I choose between hash and range sharding?
Hash sharding when: queries are random-access by ID, writes are uniformly distributed, you want even load. Range sharding when: queries naturally cluster (time ranges, alphabetical ranges, geographic regions), you frequently scan contiguous ranges, and data distribution is naturally chunkable.
What's the ideal number of shards?
Design for 2Γ your expected 18-month growth. Start with enough shards that each can handle your peak load with 50% headroom. For most mid-size systems: 4β16 shards is a good range. Too few shards = insufficient scale. Too many = coordination overhead, more cross-shard queries, harder operations. You can always add shards with consistent hashing.
How do I do cross-shard transactions?
Distributed transactions (2-phase commit, SAGA pattern) are expensive and slow. The sharding philosophy prefers eventual consistency: writes go to one shard, async events propagate to others. For true ACID across shards, use a distributed transaction coordinator (Vitess, CockroachDB Google Spanner). Prefer single-shard transactions + async reconciliation wherever possible.