Sharding

Range, Hash, and Geo Sharding — Consistent Hashing, Cross-Shard Queries, and How CockroachDB / Vitess / TiDB Pick Differently

Beyond the limit where one machine handles all traffic — both writes and reads — you have to split the data across multiple machines. That is sharding (the term comes from "shattering" a single logical database into pieces). The choices are: how to map rows to shards (range / hash / geo), how to handle queries that touch multiple shards, how to add and remove shards without downtime, and how to balance load when some keys are hotter than others. Every distributed database — CockroachDB, Vitess, TiDB, Spanner, DynamoDB, MongoDB — picks differently, and the choice cascades into transaction support, query latency, and operational complexity.

Three Sharding Strategies

Range, hash, and geo. Each has a clean use case and dangerous edge cases.

Key Numbers

512 MB

CockroachDB default range size · split when exceeded

96 MB

TiKV default region size

10 GB

DynamoDB partition cap (auto-splits beyond)

128–512

Virtual nodes per physical node (consistent hashing)

2-3×

Cross-shard transaction overhead vs single-shard

~50ms

Online resharding lag for healthy clusters

Replicas per shard (typical · majority quorum=2)

1. Range Sharding

Sort the keyspace, slice it into contiguous ranges, assign each range to a shard.

Range sharding preserves order, so range scans (WHERE k BETWEEN ...) read from a small contiguous set of shards. The challenge: monotonic key insertion (timestamps, auto-increment IDs) creates a hotspot at the rightmost shard.

# CockroachDB: every range is a 512 MB chunk of the sorted keyspace
# When a range exceeds 512 MB, it splits at the midpoint
# When two adjacent ranges are both small, they may merge

# Hot range mitigation:
#   - Hash a prefix into the key (loses some range-scan locality)
#   - Use UUIDs or random prefixes for high-write tables
#   - Manual SPLIT AT for known hot patterns

# CockroachDB range descriptor (simplified):
RangeDescriptor {
    StartKey: "user/A",
    EndKey:   "user/F",
    Replicas: ["node1", "node2", "node3"],   // Raft group
}

Used by: CockroachDB, TiDB/TiKV, HBase, Bigtable, Spanner.

2. Hash Sharding

Run a hash function over the key, modulo the number of shards. Uniform load by construction.

# Simple hash sharding (the "modulo" approach)
shard_id = hash(key) % num_shards

# Problem: adding a shard reshuffles ~all keys
# (every key whose hash%new_count != hash%old_count moves)

# Solution: CONSISTENT HASHING
# Map both keys and shards onto a ring (0 to 2^32-1)
# Each key belongs to the next shard clockwise on the ring
#
# Adding a shard only moves the keys that fall between
# the new shard's position and the previous one
#
# Removing a shard reassigns its keys to the next clockwise shard

Pure consistent hashing has uneven load (some arcs are bigger than others). The fix is virtual nodes: each physical node owns 100–500 positions on the ring. Average over many positions → uniform load.

Used by: DynamoDB, Cassandra, Riak, Memcached. Excellent for KV workloads with point lookups; bad for range scans (a contiguous key range maps to random shards).

3. Geo Sharding

Place each shard in the region closest to its users. The key contains a region prefix (or table-level region annotation):

# CockroachDB regional-by-row: each row stamped with crdb_region
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email STRING,
  crdb_region crdb_internal_region NOT VISIBLE NOT NULL
    DEFAULT default_to_database_primary_region(gateway_region())
) LOCALITY REGIONAL BY ROW;

# Now each row's leaseholder lives in its crdb_region.
# Reads from the same region are local (~5 ms).
# Cross-region reads cost a WAN round-trip (~80 ms).

Spanner pioneered this with placement directives. CockroachDB exposes it as LOCALITY REGIONAL BY ROW. The hard part: cross-region updates and consistency. Spanner uses TrueTime + 2PC across regions; CockroachDB uses bounded staleness reads to avoid the WAN cost.

4. Vertical vs Horizontal Partitioning

Horizontal partitioning = sharding (split rows). Vertical partitioning = split columns into separate tables (or even separate databases). Vertical is sometimes useful when one column is huge or accessed differently:

# Vertical partition: separate hot and cold columns
users_main:    (id, email, name, last_login)            -- read often
users_profile: (id, bio, avatar_url, preferences_json)  -- read on profile
users_audit:   (id, login_history, action_log)          -- write-heavy

# Joins across vertical partitions cost extra. Use only when:
#   - One column dominates row size
#   - Different access patterns benefit from different physical layout
#   - Different durability requirements (hot=SSD, cold=S3)

Modern OLTP databases rarely need explicit vertical partitioning — column store engines (Parquet, ORC) and TOAST (PostgreSQL) handle the same need without breaking the schema.

5. The Cross-Shard Query Problem

A query that touches multiple shards needs scatter-gather:

SELECT count(*) FROM orders WHERE created_at > '2026-01-01';

-- If "orders" is hash-sharded by order_id:
--   1. Send query to ALL N shards (scatter)
--   2. Each shard returns its count
--   3. Coordinator sums (gather)
-- Latency = max(per-shard latency); cost = N * per-shard cost.

-- If range-sharded by created_at:
--   1. Coordinator reads only the shards in range
--   2. Latency = O(matching shards)

This is why range sharding wins for time-series and log workloads, even with the hotspot risk. Hash sharding is great for "look up user by id" but kills "show me yesterday's signups."

Modern distributed databases (CockroachDB, TiDB, Vitess) push predicates down to each shard so the coordinator only assembles the final result. They also use distributed SQL planners that decide whether to broadcast a small table to all shards or pull all data to one node for a join.

6. Distributed Transactions

If a transaction touches multiple shards, you need coordination. Two-phase commit (2PC) is the textbook answer; modern systems wrap it with timestamp-ordered concurrency control:

System	Approach
Spanner	2PC + Paxos per shard + TrueTime for ordering
CockroachDB	Parallel commits + Raft per range + HLC timestamps
TiDB	Percolator-style 2PC + timestamp oracle (PD)
Vitess	2PC available, but most apps stick to single-shard for speed
DynamoDB	TransactWriteItems for up to 100 items via 2PC

The cost: cross-shard transactions take 2-3× the latency of single-shard ones. Most schemas are designed to keep transactions on a single shard via colocation (e.g., putting all rows for one user on the same shard).

7. Online Resharding

Adding capacity without downtime. The basic flow:

Provision new shard(s) with empty state.
Pick rows to move — split a range or rehash a region.
Copy and apply ongoing changes — bulk copy + tail the change log.
Cut over atomically — flip the routing layer in one consistent step (commit a metadata change via consensus).

CockroachDB does this transparently via range splits and rebalancing. Vitess does it via VReplication — a tail-copy of the source binlog into the destination shard, with a final cutover when caught up. TiDB uses PD (Placement Driver) to schedule region splits and rebalancing.

8. Approach Comparison

System	Strategy	Notable
CockroachDB	Range, 512 MB	Auto-split, per-range Raft, regional-by-row
Vitess (MySQL)	Hash by vindex	Powers YouTube, Slack; explicit resharding workflow
TiDB / TiKV	Range, ~96 MB regions	PD orchestrates splits, Raft per region
Spanner	Range + locality	TrueTime makes 2PC across regions practical
DynamoDB	Hash partition + composite range key	10 GB partition cap, auto-splits
MongoDB	Range or hash, configurable	Chunk size 64 MB default; balancer redistributes
Cassandra	Consistent hash with vnodes	Token ring, 256 vnodes/host default

9. Hot Shards and the Celebrity Problem

The most common production failure mode: one key (or one range of keys) gets disproportionate traffic. Justin Bieber tweets, the World Cup, a viral video — these create skewed access patterns that any shard map cannot handle naturally.

# Strategies for hot keys:
# 1. Caching layer in front (Redis, CDN)
#    - Reads: easy
#    - Writes: hard (cache invalidation)
#
# 2. Read replicas for the hot shard
#    - Add multiple followers, scatter reads
#    - Stale reads acceptable
#
# 3. Sub-sharding the hot key
#    - Split "celebrity_user/posts" into "celebrity_user/posts_shard_N"
#    - Application aggregates on read
#
# 4. Last-write-wins with reduced consistency
#    - Convert to AP, accept eventual consistency for the hot data

The celebrity problem is one reason Twitter/X invested heavily in fanout-on-write architectures. CockroachDB has built-in load-based splitting — a range exceeding QPS threshold gets split even if it's small.

10. Routing Layer

Every sharded system needs a routing component that maps a query to its target shard(s). Three architectures:

Smart client: client library knows the shard map, routes directly to the right shard. Saves a hop but couples clients to topology. Used by Cassandra (token-aware drivers), MongoDB (mongos optional), Redis Cluster.
Proxy layer: a stateless proxy sits between clients and shards. Easier to deploy and version, costs an extra RTT. Used by Vitess (vtgate), Envoy, Twemproxy.
Built into the database: the database itself is distributed. Any node can accept any query and route internally. CockroachDB and TiDB work this way.

The shard map (which shard owns which keys) is stored in metadata that itself must be highly available. CockroachDB stores it in a top-level Raft group; Vitess uses a dedicated topology service (etcd or ZooKeeper); MongoDB uses a config server replica set.

11. Worked Example: Designing User-Centric Sharding

You're building a chat application. Schema: users, conversations, messages. How to shard?

# Most queries: "fetch all messages in conversation C"
# So: shard messages by conversation_id
#
# users: hash by user_id (point lookups, even load)
# conversations: hash by conversation_id
# messages: hash by conversation_id (colocated with conversation)
#
# Query "all messages from user X across all conversations":
#   Two-step: first GET conversations WHERE participant=X
#             then SCATTER GET messages WHERE conversation IN (...)
#
# Cross-shard transactions only when:
#   - User joins/leaves a conversation (rare)
#   - Conversation moved or merged (very rare)
# Otherwise: all reads/writes on a single conversation are single-shard.

The general pattern: pick a "tenant" or "aggregate root" (user, conversation, account, customer) and colocate all that aggregate's data on one shard. 90%+ of operations stay single-shard. This is what every production system does, even ones that nominally support cross-shard transactions.

Tradeoffs

Approach	Strength	Weakness
Range	Range scans cheap; matches SQL workloads	Hotspot on monotonic insert
Hash	Uniform load; no hotspot	Range scans are scatter-gather
Geo	Low local latency for region-bound users	Cross-region transactions slow; complex routing
Consistent hashing	Adding/removing shards moves few keys	Skew without virtual nodes
Single-shard transactions only	Fast, simple	Schema design must keep related data together

FAQ

How do I pick the shard key?

Three constraints: high cardinality (otherwise some shards stay tiny), even access frequency (no celebrity user), and colocation of rows in the same transaction (so transactions stay single-shard). Most violations show up as production hotspots.

Can I shard a relational database without losing transactions?

Yes — that's what distributed SQL (CockroachDB, Spanner, TiDB) provides. Cost: 2-3× latency for cross-shard transactions vs single-shard. Most schemas can be designed to keep 90%+ of transactions single-shard.

What's the difference between Vitess vindexes and CockroachDB indexes?

A Vitess vindex is the rule mapping a logical row to a shard — application-level routing on top of MySQL. CockroachDB's per-range structure is internal and transparent — the database decides which range a row lives in. Vitess is more explicit; CockroachDB is more autonomous.

When should I avoid sharding entirely?

Single-node databases scale further than people think. PostgreSQL with a 64-core box and NVMe routinely handles 50K+ writes/sec and TBs of data. Shard only when you've genuinely outgrown vertical scaling, because operational complexity jumps a lot.

What's "shuffle" in scatter-gather?

For multi-shard joins or aggregations, intermediate results need to be redistributed across shards based on the join key. This is the shuffle step — borrowed from MapReduce. Slow because it crosses the network for every row touched.

How does etcd shard?

It doesn't. etcd is intentionally a single Raft group with one consistent KV store — designed for ~few-GB metadata, not large-scale data. For sharded KV at huge scale you want CockroachDB, Spanner, or DynamoDB.