CockroachDB Internals
CockroachDB takes the architecture from Google's Spanner paper, drops the requirement for atomic clocks, and ships it as a single Go binary that speaks PostgreSQL on the wire. Every node is symmetric β there is no coordinator, no separate SQL/storage tier, no leader for the cluster as a whole. Data is range-sharded into 512 MiB chunks, each one its own Raft group; transactions are serializable by default, multi-key, and use a Hybrid Logical Clock to order events across the cluster without needing special hardware. This page dissects how those primitives fit together β the storage engine (Pebble), the consensus layer (Raft per range), the transaction model (MVCC + intents + parallel commits), the distributed SQL executor (DistSQL), and the multi-region placement system.
Architecture Overview
Every node runs the full stack: SQL layer, transaction coordinator, KV router, and Pebble storage. There is no role distinction. Ranges are replicated 3-way by default; the leader serves writes and (by default) consistent reads. Gossip discovers the cluster topology.
Key Numbers
Why CockroachDB Exists
Range-Based Sharding
The unit of distribution is the range: a contiguous sorted slice of the global key space, default target 512 MiB, replicated to 3 nodes via Raft. Every range has a Raft leader (which serves writes and consistent reads) and 2-4 followers.
-- Inspect range distribution for a table
SHOW RANGES FROM TABLE rides;
start_key | end_key | range_id | range_size_mb | lease_holder | replicas
----------+----------+----------+---------------+--------------+----------
NULL | /1042 | 117 | 482.3 | 4 | {2,4,7}
/1042 | /2189 | 118 | 502.1 | 7 | {2,4,7}
/2189 | /4317 | 119 | 511.8 | 2 | {1,2,4}
/4317 | NULL | 120 | 96.4 | 4 | {2,4,7}
-- Force a split (rare, mostly automatic)
ALTER TABLE rides SPLIT AT VALUES (5000), (10000);
-- Pin replicas geographically
ALTER TABLE rides CONFIGURE ZONE USING
num_replicas = 5,
constraints = '{+region=us-east: 2, +region=eu-west: 2, +region=ap-south: 1}',
lease_preferences = '[[+region=us-east]]'; Splits happen on size threshold or on demand (e.g., load-based splitting when a hot range is detected). Merges happen when adjacent ranges shrink below half the target. The split decision uses approximate quantiles built from periodic samples of the range's keys.
Pebble: The Storage Engine
Pebble is a Go-native LSM storage engine, written by the Cockroach team as a drop-in replacement for RocksDB (which they used until 20.1). It keeps the same SST file format and compaction strategies, but eliminates the cgo overhead of the C++ RocksDB bindings β Pebble runs in the same Go runtime as the rest of cockroach.
Cockroach stores everything as MVCC keys: every logical key
k becomes k/<timestamp> on disk. The
timestamp is the HLC commit timestamp, encoded so newer versions sort
first. A point read is "find the first version <= read_ts that isn't
a tombstone."
// Logical write
INSERT INTO rides VALUES (42, 'alice', '2025-04-01');
// Physical KV in Pebble (simplified)
/Table/52/1/42/0/1714568400123456789.0 β alice, 2025-04-01 // committed value
/Table/52/1/42/0/0 β intent reference // intent record
// points back to txn
// On read at ts=1714568400999999999:
// scan /Table/52/1/42/0/ in descending ts order
// first non-intent version <= read_ts wins
// if intent's txn is committed and its ts <= read_ts, it wins
// if intent's txn is pending, reader either waits or pushes (forces resolution)
Garbage collection deletes versions older than gc.ttlseconds
(default 25 hours, hence the famous "cockroachdb time travel queries"
window) by writing range tombstones during compaction.
Hybrid Logical Clocks (HLC)
Every node maintains an HLC: a (physical_time,
logical_count) pair. Physical_time tracks the OS clock;
logical_count breaks ties when wall-clock granularity isn't enough. On
every event:
// Event happens at this node
fn local_event() -> HLC {
let pt = max(self.physical, system_clock());
if pt == self.physical { self.logical += 1; } else { self.logical = 0; }
self.physical = pt;
HLC { physical: pt, logical: self.logical }
}
// Receive an HLC from another node
fn merge(remote: HLC) -> HLC {
let pt = max(self.physical, remote.physical, system_clock());
// ... bump logical to maintain monotonicity
// FAIL if pt - system_clock() > max_offset (clock skew limit)
}
The max_offset setting (default 500ms) bounds how far HLCs
can diverge. If the OS clock jumps backward by more than max_offset,
cockroach commits suicide rather than risk inconsistency. NTP must keep
clocks within max_offset; in cloud environments this usually means
running chrony with PPS-disciplined upstream.
Read uncertainty. When a transaction reads a value whose timestamp is within max_offset of its start time, cockroach can't tell if the value was written before or after the read began. It restarts the read at a higher timestamp (cheap) or, in rare cases for writes, waits out the uncertainty window (commit wait β typically a few ms). This is the latency tax that buys you serializability without atomic clocks.
Transactions: Serializable by Default
Cockroach transactions are serializable, multi-key, and span any number of ranges. The protocol uses MVCC + write intents + a transaction record:
- Begin. Coordinator picks a node (any node). Allocates a transaction ID + start_ts from local HLC.
- Reads. Each read finds the latest version <= start_ts. Updates a per-range "timestamp cache" tracking the read's timestamp on every key β this prevents future writes from "going back in time."
- Writes. Each write places an intent on the key (an MVCC version pointing back to the txn record). Intents block readers and other writers until resolved.
- Commit. Coordinator writes the transaction record's status as COMMITTED. In parallel commits (default since 19.2), the intent writes and the txn record write are issued concurrently; commit completes when both succeed. Intents are then "resolved" (rewritten as plain MVCC values) lazily.
Conflicts are detected via the timestamp cache. If T1 wrote at ts=10 and
T2 (start_ts=5) tries to read the same key, T2 must restart at ts > 10
to maintain serializability. The retry is automatic but visible to the
client as 40001 serialization failure β apps must handle
retry-on-serialization-error.
-- Recommended app pattern: retry loop
DECLARE
retries INT := 0;
BEGIN
LOOP
BEGIN
-- ... transaction body ...
COMMIT;
EXIT;
EXCEPTION WHEN sqlstate '40001' THEN
retries := retries + 1;
IF retries > 3 THEN RAISE; END IF;
PERFORM pg_sleep(0.01 * power(2, retries));
END;
END LOOP;
END; Distributed SQL (DistSQL)
The SQL planner produces a physical plan as a graph of processors that can run on any node. Filters, aggregates, and joins execute close to the data β for an aggregate over a 100 GB table, each range's partial aggregate runs on the node holding that range, and the coordinator merges the partial results.
EXPLAIN ANALYZE (DISTSQL) SELECT region, COUNT(*)
FROM rides WHERE start_time > '2025-01-01'
GROUP BY region;
β’ group (streaming) β coordinator merges
βββ β’ render
βββ β’ table reader β runs on every range's leaseholder
filter: start_time > '2025-01-01'
ranges: 117, 118, 119, 120
parallel: 4 nodes Hash joins distribute by join key β both inputs are reshuffled across the cluster so each node handles a hash partition of the keyspace. Look up joins (the inverse of hash joins, used when one side is small or indexed) avoid the shuffle. The planner picks based on estimated cardinalities from the optimizer's stats.
Multi-Region & Follow-the-Workload
Cockroach's killer feature is multi-region: declare regions and a survival goal, and the cluster places replicas accordingly.
-- Set up a multi-region database
ALTER DATABASE shop PRIMARY REGION "us-east-1";
ALTER DATABASE shop ADD REGION "eu-west-1";
ALTER DATABASE shop ADD REGION "ap-south-1";
ALTER DATABASE shop SURVIVE REGION FAILURE; -- needs >=3 regions
-- Per-table placement
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;
-- Each row's home region is set by the gateway that wrote it,
-- replicas held in nearby regions
ALTER TABLE products SET LOCALITY GLOBAL;
-- Replicas in every region; reads are fast everywhere via
-- "non-blocking transactions" (writes wait one max_offset interval)
ALTER TABLE config SET LOCALITY REGIONAL BY TABLE IN "us-east-1";
-- Whole table pinned to one region, fast in-region writes Follow-the-workload. When a range's reads are dominated
by one locality, the cluster transfers the Raft lease (the leaseholder)
there so reads don't need a cross-region hop. This happens
automatically based on per-range read counters, but can be guided with
lease_preferences.
Follower reads. Reads at a slightly stale timestamp (default 4.8 seconds) can be served from any replica, not just the leaseholder. The cluster maintains a "closed timestamp" β a watermark below which all writes are guaranteed durable β and follower reads use timestamps below that watermark. This makes "eventually consistent" reads single-digit milliseconds even across regions.
Change Data Capture (CDC)
Change feeds emit row-level changes to Kafka, Pulsar, or webhooks. They're built on the same closed-timestamp mechanism as follower reads β every range has a "rangefeed" that emits each MVCC value as it becomes durable, with periodic resolved timestamps so consumers know when they've seen everything up to time T.
CREATE CHANGEFEED FOR TABLE orders, payments
INTO 'kafka://broker:9092'
WITH updated, resolved = '10s', envelope = wrapped;
-- Resolved messages let consumers do exact-once-effective processing:
-- once you see resolved=T, all rows with mvcc_ts <= T have been emitted
{ "resolved": "1714568500123456789.0" }
{ "after": { "id": 42, "total": 99 }, "updated": "1714568501000000000.0" } Tradeoffs & When To Use
CockroachDB vs Spanner
| CockroachDB | Spanner | |
|---|---|---|
| Hardware | Commodity (any cloud, on-prem) | Google's TrueTime infra (atomic clocks + GPS) |
| Clock | HLC, configurable max_offset (default 500ms) | TrueTime, ~7ms uncertainty bound |
| Wire protocol | PostgreSQL | Spanner gRPC (and Cloud Spanner SQL) |
| Range size | 512 MiB | ~10 GB (Spanner "split") |
| Storage | Pebble (Go LSM) | Colossus + custom |
| Default isolation | Serializable (read committed since 23.1) | Strong / external consistency |
| Read latency, in-region | 1-3ms | ~3-5ms typical |
| Operating model | Self-hosted (and Cloud) | Managed only (Google Cloud) |
FAQ
Is CockroachDB really compatible with PostgreSQL?
It speaks the PostgreSQL wire protocol and reuses pgwire client libraries unchanged. The SQL grammar covers most of standard PostgreSQL plus a lot of pg-specific syntax (RETURNING, ON CONFLICT, generated columns, JSONB, common table expressions). What's missing or different: stored procedures (limited support since 22.1), custom types, listen/notify, certain pg_catalog views, and some isolation level behaviors. The honest answer is 'enough that ORMs work; not enough that every advanced query you copy-paste from a Postgres blog will work.' For application code, it's largely transparent. For DBAs, the toolchain is its own thing.
How does CockroachDB compare to Spanner?
Spanner is the conceptual ancestor β Cockroach explicitly took the multi-active-replica + serializable + range-based architecture from the Spanner paper. The single huge difference is Spanner uses TrueTime (atomic clocks + GPS in Google datacenters) to bound clock uncertainty to ~7ms, which lets it return results immediately for most reads. Cockroach can't assume special hardware, so it uses Hybrid Logical Clocks (HLC) with a configurable max_offset (default 500ms) and pays a 'commit wait' or 'read uncertainty restart' penalty when transactions land within that window. In practice Cockroach commits fast (1-3ms in-DC) but tail latency under high contention is worse than Spanner's.
What is a Range and how big is it?
A Range is the unit of replication and rebalancing. It owns a contiguous span of the key space (e.g., from /Table/53/1/100 to /Table/53/1/4742) and is replicated to 3 (default) or 5 nodes via Raft. Default target size is 512 MiB; when a range exceeds that it splits at the median key. When two adjacent small ranges combine to under half the target, they may merge. A 100 TB cluster might hold 200,000+ ranges, each running its own Raft group. The range size is bigger than TiDB's 96 MB on purpose: fewer ranges means less Raft overhead per node, at the cost of coarser load balancing.
How does serializable isolation work without locks?
Cockroach uses MVCC + a timestamp cache + write intents. Every transaction starts with a timestamp from HLC. Reads check whether any concurrent transaction has 'intent' (uncommitted writes) on the keys they touch β if so they may need to wait or restart. Writes leave intents on every key they touch. On commit, intents are resolved to committed values atomically. The serializable property is enforced by detecting cycles in the transaction dependency graph: if T1 reads what T2 writes and T2 reads what T1 writes (with conflicting timestamps), one of them aborts and retries. The abort + retry is why pessimistic-locking workloads need the SELECT FOR UPDATE escape hatch.
What's the Hybrid Logical Clock for?
HLC gives every event a (physical_time, logical_count) pair that totally orders events across the cluster while staying close to wall-clock time. When a node receives a higher timestamp, it bumps its physical to match (within max_offset) and increments logical. This means a transaction that starts at HLC time 100 and reads a value written at HLC time 105 (because the writer's clock was ahead) will see uncertainty: did this write happen before or after I started? Cockroach handles this by either restarting the read with a higher timestamp or, if the timestamp falls within max_offset, waiting it out (commit wait). HLC is what Cockroach uses instead of TrueTime β same semantics, no atomic clocks required.
Can CockroachDB do multi-region?
Yes β multi-region is one of its main selling points. You declare a database's regions and survival goal (REGION or ZONE failure), then per-table place rows with REGIONAL BY ROW (rows live in the region of the gateway that wrote them, replicas in nearby regions for survivability) or REGIONAL BY TABLE (whole table pinned to one region) or GLOBAL (replicas everywhere, fast reads anywhere, slow writes). Under the hood this is configured via zone configs that constrain replica placement. The latency math is honest: fast writes to your home region, slow writes to remote regions, and follower reads (at a slightly stale timestamp) for fast reads from anywhere.
When does CockroachDB struggle?
Three patterns: (1) High-contention workloads on hot keys β the abort-restart loop in serializable can stampede. SELECT FOR UPDATE or explicit row-level locks help, but Cockroach is honest that this is a weakness vs. lock-based systems. (2) Bulk inserts into a monotonic primary key β every write hits the same range until it splits, and the SQL layer has hash-sharded indexes to mitigate. (3) Single-row OLTP at extreme throughput β Postgres on local NVMe will commit faster than any distributed transaction can. Cockroach pays a 1-3ms latency tax per commit; for workloads where that matters, it's the wrong tool.