CockroachDB Internals

CockroachDB takes the architecture from Google's Spanner paper, drops the requirement for atomic clocks, and ships it as a single Go binary that speaks PostgreSQL on the wire. Every node is symmetric β€” there is no coordinator, no separate SQL/storage tier, no leader for the cluster as a whole. Data is range-sharded into 512 MiB chunks, each one its own Raft group; transactions are serializable by default, multi-key, and use a Hybrid Logical Clock to order events across the cluster without needing special hardware. This page dissects how those primitives fit together β€” the storage engine (Pebble), the consensus layer (Raft per range), the transaction model (MVCC + intents + parallel commits), the distributed SQL executor (DistSQL), and the multi-region placement system.

Architecture Overview

CLIENT pgwire driver CLUSTER (every node identical) Node A (us-east) SQL layer (parser β€’ optimizer β€’ DistSQL) Transaction coordinator (HLC) KV layer (range routing β€’ Raft client) Pebble (LSM, MVCC keys) Range 1L Range 5F Range 8F Range 12L Range 15F L=Raft leader F=follower Node B (us-east) SQL layer Transaction coordinator KV layer Pebble Range 1F Range 5L Range 8F Range 12F Range 15L Node C (us-west) SQL layer Transaction coordinator KV layer Pebble Range 1F Range 5F Range 8L Range 12F Range 15F

Every node runs the full stack: SQL layer, transaction coordinator, KV router, and Pebble storage. There is no role distinction. Ranges are replicated 3-way by default; the leader serves writes and (by default) consistent reads. Gossip discovers the cluster topology.

Key Numbers

Range Size
512 MiB
Default Replication
3
Storage Engine
Pebble (Go)
Wire Protocol
PostgreSQL
Default Iso
Serializable
Max Clock Offset
500 ms
Clock Source
HLC

Why CockroachDB Exists

The Gap
Spanner showed that distributed serializable SQL was possible β€” but only inside Google. Outside, you picked between PostgreSQL (single primary, sharding bolt-ons), eventually consistent NoSQL (lose ACID), or expensive proprietary systems. There was no open-source horizontally scalable SQL DB that survived a region failure without losing committed data.
The Insight
You don't actually need TrueTime β€” you can substitute Hybrid Logical Clocks with a configurable max_offset and pay a small latency tax (commit wait or restart) when transactions land within that window. Combine HLC with range-based sharding, Raft per range, MVCC + intents, and you get serializable distributed SQL on commodity hardware. Make every node identical so ops is symmetric.
The Result
Drop-in PostgreSQL replacement (mostly) with three-region survivability, automatic rebalancing, no operational asymmetry, and serializable transactions by default. The cost is a small latency tax on every write (HLC commit wait + Raft round trip) and a serious tail-latency penalty under high contention.

Range-Based Sharding

The unit of distribution is the range: a contiguous sorted slice of the global key space, default target 512 MiB, replicated to 3 nodes via Raft. Every range has a Raft leader (which serves writes and consistent reads) and 2-4 followers.

-- Inspect range distribution for a table
SHOW RANGES FROM TABLE rides;

  start_key | end_key  | range_id | range_size_mb | lease_holder | replicas
  ----------+----------+----------+---------------+--------------+----------
  NULL      | /1042    | 117      | 482.3         | 4            | {2,4,7}
  /1042     | /2189    | 118      | 502.1         | 7            | {2,4,7}
  /2189     | /4317    | 119      | 511.8         | 2            | {1,2,4}
  /4317     | NULL     | 120      | 96.4          | 4            | {2,4,7}

-- Force a split (rare, mostly automatic)
ALTER TABLE rides SPLIT AT VALUES (5000), (10000);

-- Pin replicas geographically
ALTER TABLE rides CONFIGURE ZONE USING
  num_replicas = 5,
  constraints = '{+region=us-east: 2, +region=eu-west: 2, +region=ap-south: 1}',
  lease_preferences = '[[+region=us-east]]';

Splits happen on size threshold or on demand (e.g., load-based splitting when a hot range is detected). Merges happen when adjacent ranges shrink below half the target. The split decision uses approximate quantiles built from periodic samples of the range's keys.

Pebble: The Storage Engine

Pebble is a Go-native LSM storage engine, written by the Cockroach team as a drop-in replacement for RocksDB (which they used until 20.1). It keeps the same SST file format and compaction strategies, but eliminates the cgo overhead of the C++ RocksDB bindings β€” Pebble runs in the same Go runtime as the rest of cockroach.

Cockroach stores everything as MVCC keys: every logical key k becomes k/<timestamp> on disk. The timestamp is the HLC commit timestamp, encoded so newer versions sort first. A point read is "find the first version <= read_ts that isn't a tombstone."

// Logical write
INSERT INTO rides VALUES (42, 'alice', '2025-04-01');

// Physical KV in Pebble (simplified)
/Table/52/1/42/0/1714568400123456789.0   β†’  alice, 2025-04-01    // committed value
/Table/52/1/42/0/0                       β†’  intent reference     // intent record
                                                                  // points back to txn

// On read at ts=1714568400999999999:
//   scan /Table/52/1/42/0/ in descending ts order
//   first non-intent version <= read_ts wins
//   if intent's txn is committed and its ts <= read_ts, it wins
//   if intent's txn is pending, reader either waits or pushes (forces resolution)

Garbage collection deletes versions older than gc.ttlseconds (default 25 hours, hence the famous "cockroachdb time travel queries" window) by writing range tombstones during compaction.

Hybrid Logical Clocks (HLC)

Every node maintains an HLC: a (physical_time, logical_count) pair. Physical_time tracks the OS clock; logical_count breaks ties when wall-clock granularity isn't enough. On every event:

// Event happens at this node
fn local_event() -> HLC {
  let pt = max(self.physical, system_clock());
  if pt == self.physical { self.logical += 1; } else { self.logical = 0; }
  self.physical = pt;
  HLC { physical: pt, logical: self.logical }
}

// Receive an HLC from another node
fn merge(remote: HLC) -> HLC {
  let pt = max(self.physical, remote.physical, system_clock());
  // ... bump logical to maintain monotonicity
  // FAIL if pt - system_clock() > max_offset (clock skew limit)
}

The max_offset setting (default 500ms) bounds how far HLCs can diverge. If the OS clock jumps backward by more than max_offset, cockroach commits suicide rather than risk inconsistency. NTP must keep clocks within max_offset; in cloud environments this usually means running chrony with PPS-disciplined upstream.

Read uncertainty. When a transaction reads a value whose timestamp is within max_offset of its start time, cockroach can't tell if the value was written before or after the read began. It restarts the read at a higher timestamp (cheap) or, in rare cases for writes, waits out the uncertainty window (commit wait β€” typically a few ms). This is the latency tax that buys you serializability without atomic clocks.

Transactions: Serializable by Default

Cockroach transactions are serializable, multi-key, and span any number of ranges. The protocol uses MVCC + write intents + a transaction record:

Conflicts are detected via the timestamp cache. If T1 wrote at ts=10 and T2 (start_ts=5) tries to read the same key, T2 must restart at ts > 10 to maintain serializability. The retry is automatic but visible to the client as 40001 serialization failure β€” apps must handle retry-on-serialization-error.

-- Recommended app pattern: retry loop
DECLARE
  retries INT := 0;
BEGIN
  LOOP
    BEGIN
      -- ... transaction body ...
      COMMIT;
      EXIT;
    EXCEPTION WHEN sqlstate '40001' THEN
      retries := retries + 1;
      IF retries > 3 THEN RAISE; END IF;
      PERFORM pg_sleep(0.01 * power(2, retries));
    END;
  END LOOP;
END;

Distributed SQL (DistSQL)

The SQL planner produces a physical plan as a graph of processors that can run on any node. Filters, aggregates, and joins execute close to the data β€” for an aggregate over a 100 GB table, each range's partial aggregate runs on the node holding that range, and the coordinator merges the partial results.

EXPLAIN ANALYZE (DISTSQL) SELECT region, COUNT(*)
FROM rides WHERE start_time > '2025-01-01'
GROUP BY region;

  β€’ group (streaming)        β—€ coordinator merges
    └── β€’ render
         └── β€’ table reader  β—€ runs on every range's leaseholder
              filter:  start_time > '2025-01-01'
              ranges:  117, 118, 119, 120
              parallel: 4 nodes

Hash joins distribute by join key β€” both inputs are reshuffled across the cluster so each node handles a hash partition of the keyspace. Look up joins (the inverse of hash joins, used when one side is small or indexed) avoid the shuffle. The planner picks based on estimated cardinalities from the optimizer's stats.

Multi-Region & Follow-the-Workload

Cockroach's killer feature is multi-region: declare regions and a survival goal, and the cluster places replicas accordingly.

-- Set up a multi-region database
ALTER DATABASE shop PRIMARY REGION "us-east-1";
ALTER DATABASE shop ADD REGION "eu-west-1";
ALTER DATABASE shop ADD REGION "ap-south-1";
ALTER DATABASE shop SURVIVE REGION FAILURE;   -- needs >=3 regions

-- Per-table placement
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;
-- Each row's home region is set by the gateway that wrote it,
-- replicas held in nearby regions

ALTER TABLE products SET LOCALITY GLOBAL;
-- Replicas in every region; reads are fast everywhere via
-- "non-blocking transactions" (writes wait one max_offset interval)

ALTER TABLE config SET LOCALITY REGIONAL BY TABLE IN "us-east-1";
-- Whole table pinned to one region, fast in-region writes

Follow-the-workload. When a range's reads are dominated by one locality, the cluster transfers the Raft lease (the leaseholder) there so reads don't need a cross-region hop. This happens automatically based on per-range read counters, but can be guided with lease_preferences.

Follower reads. Reads at a slightly stale timestamp (default 4.8 seconds) can be served from any replica, not just the leaseholder. The cluster maintains a "closed timestamp" β€” a watermark below which all writes are guaranteed durable β€” and follower reads use timestamps below that watermark. This makes "eventually consistent" reads single-digit milliseconds even across regions.

Change Data Capture (CDC)

Change feeds emit row-level changes to Kafka, Pulsar, or webhooks. They're built on the same closed-timestamp mechanism as follower reads β€” every range has a "rangefeed" that emits each MVCC value as it becomes durable, with periodic resolved timestamps so consumers know when they've seen everything up to time T.

CREATE CHANGEFEED FOR TABLE orders, payments
INTO 'kafka://broker:9092'
WITH updated, resolved = '10s', envelope = wrapped;

-- Resolved messages let consumers do exact-once-effective processing:
-- once you see resolved=T, all rows with mvcc_ts <= T have been emitted
{ "resolved": "1714568500123456789.0" }
{ "after": { "id": 42, "total": 99 }, "updated": "1714568501000000000.0" }

Tradeoffs & When To Use

Use Cockroach when
You need PostgreSQL semantics with horizontal scale, region survivability, and serializable correctness. SaaS multi-tenant where REGIONAL BY ROW pins customer data. Financial ledgers where serializable isolation is non-negotiable. Apps that have outgrown Postgres but can't move to a NoSQL data model.
Avoid Cockroach when
Your workload fits on one beefy Postgres node β€” the latency tax is real. High-contention OLTP on hot keys (the abort/restart loop hurts). Workloads that need stored procedures, triggers, custom types, or heavy use of pg_catalog. Pure analytics β€” there's no columnar engine; use a warehouse instead.
Operational gotchas
Clock skew > max_offset causes node suicide. Hot key writes serialize through one Raft leader. SELECT FOR UPDATE is needed for pessimistic patterns. Bulk loads need IMPORT INTO, not row-by-row INSERT. License: BSL since 21.2 (free for production up to a point, paid for cross-region commercial use).

CockroachDB vs Spanner

CockroachDBSpanner
HardwareCommodity (any cloud, on-prem)Google's TrueTime infra (atomic clocks + GPS)
ClockHLC, configurable max_offset (default 500ms)TrueTime, ~7ms uncertainty bound
Wire protocolPostgreSQLSpanner gRPC (and Cloud Spanner SQL)
Range size512 MiB~10 GB (Spanner "split")
StoragePebble (Go LSM)Colossus + custom
Default isolationSerializable (read committed since 23.1)Strong / external consistency
Read latency, in-region1-3ms~3-5ms typical
Operating modelSelf-hosted (and Cloud)Managed only (Google Cloud)

FAQ

Is CockroachDB really compatible with PostgreSQL?

It speaks the PostgreSQL wire protocol and reuses pgwire client libraries unchanged. The SQL grammar covers most of standard PostgreSQL plus a lot of pg-specific syntax (RETURNING, ON CONFLICT, generated columns, JSONB, common table expressions). What's missing or different: stored procedures (limited support since 22.1), custom types, listen/notify, certain pg_catalog views, and some isolation level behaviors. The honest answer is 'enough that ORMs work; not enough that every advanced query you copy-paste from a Postgres blog will work.' For application code, it's largely transparent. For DBAs, the toolchain is its own thing.

How does CockroachDB compare to Spanner?

Spanner is the conceptual ancestor β€” Cockroach explicitly took the multi-active-replica + serializable + range-based architecture from the Spanner paper. The single huge difference is Spanner uses TrueTime (atomic clocks + GPS in Google datacenters) to bound clock uncertainty to ~7ms, which lets it return results immediately for most reads. Cockroach can't assume special hardware, so it uses Hybrid Logical Clocks (HLC) with a configurable max_offset (default 500ms) and pays a 'commit wait' or 'read uncertainty restart' penalty when transactions land within that window. In practice Cockroach commits fast (1-3ms in-DC) but tail latency under high contention is worse than Spanner's.

What is a Range and how big is it?

A Range is the unit of replication and rebalancing. It owns a contiguous span of the key space (e.g., from /Table/53/1/100 to /Table/53/1/4742) and is replicated to 3 (default) or 5 nodes via Raft. Default target size is 512 MiB; when a range exceeds that it splits at the median key. When two adjacent small ranges combine to under half the target, they may merge. A 100 TB cluster might hold 200,000+ ranges, each running its own Raft group. The range size is bigger than TiDB's 96 MB on purpose: fewer ranges means less Raft overhead per node, at the cost of coarser load balancing.

How does serializable isolation work without locks?

Cockroach uses MVCC + a timestamp cache + write intents. Every transaction starts with a timestamp from HLC. Reads check whether any concurrent transaction has 'intent' (uncommitted writes) on the keys they touch β€” if so they may need to wait or restart. Writes leave intents on every key they touch. On commit, intents are resolved to committed values atomically. The serializable property is enforced by detecting cycles in the transaction dependency graph: if T1 reads what T2 writes and T2 reads what T1 writes (with conflicting timestamps), one of them aborts and retries. The abort + retry is why pessimistic-locking workloads need the SELECT FOR UPDATE escape hatch.

What's the Hybrid Logical Clock for?

HLC gives every event a (physical_time, logical_count) pair that totally orders events across the cluster while staying close to wall-clock time. When a node receives a higher timestamp, it bumps its physical to match (within max_offset) and increments logical. This means a transaction that starts at HLC time 100 and reads a value written at HLC time 105 (because the writer's clock was ahead) will see uncertainty: did this write happen before or after I started? Cockroach handles this by either restarting the read with a higher timestamp or, if the timestamp falls within max_offset, waiting it out (commit wait). HLC is what Cockroach uses instead of TrueTime β€” same semantics, no atomic clocks required.

Can CockroachDB do multi-region?

Yes β€” multi-region is one of its main selling points. You declare a database's regions and survival goal (REGION or ZONE failure), then per-table place rows with REGIONAL BY ROW (rows live in the region of the gateway that wrote them, replicas in nearby regions for survivability) or REGIONAL BY TABLE (whole table pinned to one region) or GLOBAL (replicas everywhere, fast reads anywhere, slow writes). Under the hood this is configured via zone configs that constrain replica placement. The latency math is honest: fast writes to your home region, slow writes to remote regions, and follower reads (at a slightly stale timestamp) for fast reads from anywhere.

When does CockroachDB struggle?

Three patterns: (1) High-contention workloads on hot keys β€” the abort-restart loop in serializable can stampede. SELECT FOR UPDATE or explicit row-level locks help, but Cockroach is honest that this is a weakness vs. lock-based systems. (2) Bulk inserts into a monotonic primary key β€” every write hits the same range until it splits, and the SQL layer has hash-sharded indexes to mitigate. (3) Single-row OLTP at extreme throughput β€” Postgres on local NVMe will commit faster than any distributed transaction can. Cockroach pays a 1-3ms latency tax per commit; for workloads where that matters, it's the wrong tool.