Sharding

Range, Hash, and Geo Sharding — Consistent Hashing, Cross-Shard Queries, and How CockroachDB / Vitess / TiDB Pick Differently

Beyond the limit where one machine handles all traffic — both writes and reads — you have to split the data across multiple machines. That is sharding (the term comes from "shattering" a single logical database into pieces). The choices are: how to map rows to shards (range / hash / geo), how to handle queries that touch multiple shards, how to add and remove shards without downtime, and how to balance load when some keys are hotter than others. Every distributed database — CockroachDB, Vitess, TiDB, Spanner, DynamoDB, MongoDB — picks differently, and the choice cascades into transaction support, query latency, and operational complexity.

Three Sharding Strategies

Range, hash, and geo. Each has a clean use case and dangerous edge cases.

Range Sharding A–F G–M N–S T–Z range scans cheap Hash Sharding hash%4=0 hash%4=1 hash%4=2 hash%4=3 uniform load Geo Sharding us-east eu-west ap-south low local latency Use range for ordered scans, hash for uniform load, geo for low-latency reads near users.

Key Numbers

512 MB
CockroachDB default range size · split when exceeded
96 MB
TiKV default region size
10 GB
DynamoDB partition cap (auto-splits beyond)
128–512
Virtual nodes per physical node (consistent hashing)
2-3×
Cross-shard transaction overhead vs single-shard
~50ms
Online resharding lag for healthy clusters
3
Replicas per shard (typical · majority quorum=2)

1. Range Sharding

Sort the keyspace, slice it into contiguous ranges, assign each range to a shard.

Range sharding preserves order, so range scans (WHERE k BETWEEN ...) read from a small contiguous set of shards. The challenge: monotonic key insertion (timestamps, auto-increment IDs) creates a hotspot at the rightmost shard.

# CockroachDB: every range is a 512 MB chunk of the sorted keyspace
# When a range exceeds 512 MB, it splits at the midpoint
# When two adjacent ranges are both small, they may merge

# Hot range mitigation:
#   - Hash a prefix into the key (loses some range-scan locality)
#   - Use UUIDs or random prefixes for high-write tables
#   - Manual SPLIT AT for known hot patterns

# CockroachDB range descriptor (simplified):
RangeDescriptor {
    StartKey: "user/A",
    EndKey:   "user/F",
    Replicas: ["node1", "node2", "node3"],   // Raft group
}

Used by: CockroachDB, TiDB/TiKV, HBase, Bigtable, Spanner.

2. Hash Sharding

Run a hash function over the key, modulo the number of shards. Uniform load by construction.

# Simple hash sharding (the "modulo" approach)
shard_id = hash(key) % num_shards

# Problem: adding a shard reshuffles ~all keys
# (every key whose hash%new_count != hash%old_count moves)

# Solution: CONSISTENT HASHING
# Map both keys and shards onto a ring (0 to 2^32-1)
# Each key belongs to the next shard clockwise on the ring
#
# Adding a shard only moves the keys that fall between
# the new shard's position and the previous one
#
# Removing a shard reassigns its keys to the next clockwise shard

Pure consistent hashing has uneven load (some arcs are bigger than others). The fix is virtual nodes: each physical node owns 100–500 positions on the ring. Average over many positions → uniform load.

Used by: DynamoDB, Cassandra, Riak, Memcached. Excellent for KV workloads with point lookups; bad for range scans (a contiguous key range maps to random shards).

3. Geo Sharding

Place each shard in the region closest to its users. The key contains a region prefix (or table-level region annotation):

# CockroachDB regional-by-row: each row stamped with crdb_region
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email STRING,
  crdb_region crdb_internal_region NOT VISIBLE NOT NULL
    DEFAULT default_to_database_primary_region(gateway_region())
) LOCALITY REGIONAL BY ROW;

# Now each row's leaseholder lives in its crdb_region.
# Reads from the same region are local (~5 ms).
# Cross-region reads cost a WAN round-trip (~80 ms).

Spanner pioneered this with placement directives. CockroachDB exposes it as LOCALITY REGIONAL BY ROW. The hard part: cross-region updates and consistency. Spanner uses TrueTime + 2PC across regions; CockroachDB uses bounded staleness reads to avoid the WAN cost.

4. Vertical vs Horizontal Partitioning

Horizontal partitioning = sharding (split rows). Vertical partitioning = split columns into separate tables (or even separate databases). Vertical is sometimes useful when one column is huge or accessed differently:

# Vertical partition: separate hot and cold columns
users_main:    (id, email, name, last_login)            -- read often
users_profile: (id, bio, avatar_url, preferences_json)  -- read on profile
users_audit:   (id, login_history, action_log)          -- write-heavy

# Joins across vertical partitions cost extra. Use only when:
#   - One column dominates row size
#   - Different access patterns benefit from different physical layout
#   - Different durability requirements (hot=SSD, cold=S3)

Modern OLTP databases rarely need explicit vertical partitioning — column store engines (Parquet, ORC) and TOAST (PostgreSQL) handle the same need without breaking the schema.

5. The Cross-Shard Query Problem

A query that touches multiple shards needs scatter-gather:

SELECT count(*) FROM orders WHERE created_at > '2026-01-01';

-- If "orders" is hash-sharded by order_id:
--   1. Send query to ALL N shards (scatter)
--   2. Each shard returns its count
--   3. Coordinator sums (gather)
-- Latency = max(per-shard latency); cost = N * per-shard cost.

-- If range-sharded by created_at:
--   1. Coordinator reads only the shards in range
--   2. Latency = O(matching shards)

This is why range sharding wins for time-series and log workloads, even with the hotspot risk. Hash sharding is great for "look up user by id" but kills "show me yesterday's signups."

Modern distributed databases (CockroachDB, TiDB, Vitess) push predicates down to each shard so the coordinator only assembles the final result. They also use distributed SQL planners that decide whether to broadcast a small table to all shards or pull all data to one node for a join.

6. Distributed Transactions

If a transaction touches multiple shards, you need coordination. Two-phase commit (2PC) is the textbook answer; modern systems wrap it with timestamp-ordered concurrency control:

SystemApproach
Spanner2PC + Paxos per shard + TrueTime for ordering
CockroachDBParallel commits + Raft per range + HLC timestamps
TiDBPercolator-style 2PC + timestamp oracle (PD)
Vitess2PC available, but most apps stick to single-shard for speed
DynamoDBTransactWriteItems for up to 100 items via 2PC

The cost: cross-shard transactions take 2-3× the latency of single-shard ones. Most schemas are designed to keep transactions on a single shard via colocation (e.g., putting all rows for one user on the same shard).

7. Online Resharding

Adding capacity without downtime. The basic flow:

  • Provision new shard(s) with empty state.
  • Pick rows to move — split a range or rehash a region.
  • Copy and apply ongoing changes — bulk copy + tail the change log.
  • Cut over atomically — flip the routing layer in one consistent step (commit a metadata change via consensus).

CockroachDB does this transparently via range splits and rebalancing. Vitess does it via VReplication — a tail-copy of the source binlog into the destination shard, with a final cutover when caught up. TiDB uses PD (Placement Driver) to schedule region splits and rebalancing.

8. Approach Comparison

SystemStrategyNotable
CockroachDBRange, 512 MBAuto-split, per-range Raft, regional-by-row
Vitess (MySQL)Hash by vindexPowers YouTube, Slack; explicit resharding workflow
TiDB / TiKVRange, ~96 MB regionsPD orchestrates splits, Raft per region
SpannerRange + localityTrueTime makes 2PC across regions practical
DynamoDBHash partition + composite range key10 GB partition cap, auto-splits
MongoDBRange or hash, configurableChunk size 64 MB default; balancer redistributes
CassandraConsistent hash with vnodesToken ring, 256 vnodes/host default

9. Hot Shards and the Celebrity Problem

The most common production failure mode: one key (or one range of keys) gets disproportionate traffic. Justin Bieber tweets, the World Cup, a viral video — these create skewed access patterns that any shard map cannot handle naturally.

# Strategies for hot keys:
# 1. Caching layer in front (Redis, CDN)
#    - Reads: easy
#    - Writes: hard (cache invalidation)
#
# 2. Read replicas for the hot shard
#    - Add multiple followers, scatter reads
#    - Stale reads acceptable
#
# 3. Sub-sharding the hot key
#    - Split "celebrity_user/posts" into "celebrity_user/posts_shard_N"
#    - Application aggregates on read
#
# 4. Last-write-wins with reduced consistency
#    - Convert to AP, accept eventual consistency for the hot data

The celebrity problem is one reason Twitter/X invested heavily in fanout-on-write architectures. CockroachDB has built-in load-based splitting — a range exceeding QPS threshold gets split even if it's small.

10. Routing Layer

Every sharded system needs a routing component that maps a query to its target shard(s). Three architectures:

  • Smart client: client library knows the shard map, routes directly to the right shard. Saves a hop but couples clients to topology. Used by Cassandra (token-aware drivers), MongoDB (mongos optional), Redis Cluster.
  • Proxy layer: a stateless proxy sits between clients and shards. Easier to deploy and version, costs an extra RTT. Used by Vitess (vtgate), Envoy, Twemproxy.
  • Built into the database: the database itself is distributed. Any node can accept any query and route internally. CockroachDB and TiDB work this way.

The shard map (which shard owns which keys) is stored in metadata that itself must be highly available. CockroachDB stores it in a top-level Raft group; Vitess uses a dedicated topology service (etcd or ZooKeeper); MongoDB uses a config server replica set.

11. Worked Example: Designing User-Centric Sharding

You're building a chat application. Schema: users, conversations, messages. How to shard?

# Most queries: "fetch all messages in conversation C"
# So: shard messages by conversation_id
#
# users: hash by user_id (point lookups, even load)
# conversations: hash by conversation_id
# messages: hash by conversation_id (colocated with conversation)
#
# Query "all messages from user X across all conversations":
#   Two-step: first GET conversations WHERE participant=X
#             then SCATTER GET messages WHERE conversation IN (...)
#
# Cross-shard transactions only when:
#   - User joins/leaves a conversation (rare)
#   - Conversation moved or merged (very rare)
# Otherwise: all reads/writes on a single conversation are single-shard.

The general pattern: pick a "tenant" or "aggregate root" (user, conversation, account, customer) and colocate all that aggregate's data on one shard. 90%+ of operations stay single-shard. This is what every production system does, even ones that nominally support cross-shard transactions.

Tradeoffs

ApproachStrengthWeakness
RangeRange scans cheap; matches SQL workloadsHotspot on monotonic insert
HashUniform load; no hotspotRange scans are scatter-gather
GeoLow local latency for region-bound usersCross-region transactions slow; complex routing
Consistent hashingAdding/removing shards moves few keysSkew without virtual nodes
Single-shard transactions onlyFast, simpleSchema design must keep related data together

FAQ

How do I pick the shard key?

Three constraints: high cardinality (otherwise some shards stay tiny), even access frequency (no celebrity user), and colocation of rows in the same transaction (so transactions stay single-shard). Most violations show up as production hotspots.

Can I shard a relational database without losing transactions?

Yes — that's what distributed SQL (CockroachDB, Spanner, TiDB) provides. Cost: 2-3× latency for cross-shard transactions vs single-shard. Most schemas can be designed to keep 90%+ of transactions single-shard.

What's the difference between Vitess vindexes and CockroachDB indexes?

A Vitess vindex is the rule mapping a logical row to a shard — application-level routing on top of MySQL. CockroachDB's per-range structure is internal and transparent — the database decides which range a row lives in. Vitess is more explicit; CockroachDB is more autonomous.

When should I avoid sharding entirely?

Single-node databases scale further than people think. PostgreSQL with a 64-core box and NVMe routinely handles 50K+ writes/sec and TBs of data. Shard only when you've genuinely outgrown vertical scaling, because operational complexity jumps a lot.

What's "shuffle" in scatter-gather?

For multi-shard joins or aggregations, intermediate results need to be redistributed across shards based on the join key. This is the shuffle step — borrowed from MapReduce. Slow because it crosses the network for every row touched.

How does etcd shard?

It doesn't. etcd is intentionally a single Raft group with one consistent KV store — designed for ~few-GB metadata, not large-scale data. For sharded KV at huge scale you want CockroachDB, Spanner, or DynamoDB.