Spanner Internals

Spanner is Google's globally distributed SQL database — the system that underpins AdWords, Gmail metadata, and Cloud Spanner. It is the rare distributed database that offers serializable transactions across continents while still serving reads in single-digit milliseconds locally. The two ideas that make this possible are TrueTime (a clock API that exposes uncertainty rather than hiding it) and a hierarchy of Paxos groups, each replicating a slice of data across data centers. Above that sits an SQL frontend descended from F1, the engine Google built to retire its sharded MySQL fleet.

Based on the 2012 Spanner OSDI paper, the 2017 SIGMOD update, the 2013 F1 paper, and the public Cloud Spanner documentation.

Why Spanner Exists

The Gap
Google ran AdWords on a sharded MySQL fleet that had grown ungovernable: schema migrations took weeks, resharding was a multi-quarter project, and cross-shard transactions were impossible. The classic CAP-tax forced the rest of the industry into eventual consistency, but Google's accounting and ad-billing systems could not compromise on serializability.
The Insight
If you treat clock uncertainty as a first-class concept — a tight bound on the gap between true physical time and what a server thinks the time is — you can assign globally consistent timestamps to transactions. With those timestamps, you get serializability across continents using only commit-wait, not extra rounds of consensus.
The Result
A SQL database that scales to exabytes across hundreds of data centers, runs schema changes online with no locking, supports transactions that span the planet, and gives reads serializable views as of any timestamp. Cloud Spanner exposes this directly to GCP customers since 2017.

Spanner Architecture

Client / SQL frontend (F1 query compiler, transaction coordinator) Pushdowns • DML • online schema change manager Placement Driver — splits, moves, balances tablets across zones Paxos Group A tablets [users 0-N] Replica Zone 1 (Leader) Replica Zone 2 Replica Zone 3 Paxos quorum log Paxos Group B tablets [orders 0-M] Replica Zone 1 Replica Zone 2 (Leader) Replica Zone 3 Paxos quorum log Paxos Group C tablets [products] Replica Zone 1 Replica Zone 2 Replica Zone 3 (Leader) Paxos quorum log 2PC across leaders for cross-group transactions • TrueTime supplies the commit timestamp

Key Numbers

TrueTime ε bound
~7 ms
Replication
Paxos quorum
Read replicas
3+ (typical)
Cross-zone p99 commit
~50-100 ms
Local-zone read
~5 ms
Schema change
Online (no locks)
Max DB size
Petabytes+

TrueTime — Time as a Bounded Interval

Every distributed database has to deal with clock skew. Most pretend it doesn't exist, then patch over the symptoms with vector clocks, hybrid logical clocks, or last-write-wins. Spanner took the opposite path: expose the uncertainty.

The TrueTime API has three calls:

TT.now()    → returns interval [earliest, latest]   // truth lies somewhere inside
TT.after(t) → true if t is definitely in the past
TT.before(t)→ true if t is definitely in the future

Under the hood, TrueTime is implemented by a fleet of GPS receivers and atomic clocks in every data center, with a daemon on every machine that polls them and computes a tight bound on its own clock's drift. The half-width of the interval, ε ("epsilon"), is typically around 4 ms on average and bounded under 7 ms even in worst case. By making clock uncertainty explicit and small, Spanner can use timestamps as the global ordering mechanism.

The trick is the commit-wait protocol: when a transaction commits at timestamp s, the leader waits until TT.after(s) returns true before acknowledging the commit to the client. That guarantees no concurrent transaction can be assigned a later timestamp than s while still appearing earlier in real time. Total wait is roughly 2ε ≈ 7 ms — the cost of external consistency.

Paxos Groups and the Replication Layer

Data in Spanner is partitioned into tablets (key ranges within a table). Each tablet is owned by a Paxos group — a set of replicas in different zones running Multi-Paxos with a stable leader. The leader serializes writes, the replicas vote, and a write commits once a majority quorum has logged it. Stable leadership (vs naive single-decree Paxos) means the leader can pipeline writes without re-electing every round.

Reads have three flavors that reflect the consistency dial:

Because every replica's log advances monotonically and TrueTime gives each tablet's leader a safe time (a timestamp below which all transactions have committed), any replica can answer a read at timestamp s as long as s ≤ safe_time. This is what makes Spanner reads cheap in the common case despite being a globally consistent system.

Two-Phase Commit Across Paxos Groups

A transaction that touches a single Paxos group commits in one Paxos round. Cross-group transactions (the harder case) use two-phase commit on top of Paxos:

  1. The client picks one participant leader as the coordinator.
  2. Each participant leader acquires read/write locks on the keys it touches and replicates a prepare record through its Paxos group. Once the prepare is durable, it sends a prepare-ack to the coordinator with a prepare timestamp.
  3. The coordinator picks a commit timestamp ≥ all prepare timestamps and ≥ TT.now().latest, then replicates the commit decision through its own Paxos group.
  4. The coordinator waits out commit-wait, then notifies participants. Each participant replicates the commit through its Paxos group and releases locks.

The key correctness property: because each participant chose a prepare timestamp ≥ its safe time and the coordinator chose a commit timestamp ≥ all prepare timestamps, the commit timestamp lies in the future of every event the transaction observed. Combined with commit-wait, this gives external consistency: if transaction T1 commits before T2 starts in real time, T1's commit timestamp is strictly less than T2's. No external database achieves this property without atomic clocks or some equivalent.

Interleaved Tables and Locality

A naive sharded SQL system pays for joins between parent and child rows that live on different shards. Spanner exposes interleaved tables as a first-class concept: child rows are stored physically adjacent to their parent.

CREATE TABLE Customers (
  customer_id  INT64 NOT NULL,
  name         STRING(MAX),
  email        STRING(MAX),
) PRIMARY KEY (customer_id);

CREATE TABLE Orders (
  customer_id  INT64 NOT NULL,
  order_id     INT64 NOT NULL,
  total        NUMERIC,
  status       STRING(20),
) PRIMARY KEY (customer_id, order_id),
  INTERLEAVE IN PARENT Customers ON DELETE CASCADE;

The Orders rows are physically clustered by customer_id with the matching Customer row. A query like "all orders for customer 42" reads contiguous bytes from one tablet. A transaction that updates a customer and inserts an order touches a single Paxos group, avoiding 2PC entirely. Cascading deletes are equally local. This is the SQL analog of DynamoDB's PK/SK colocation, but expressed as a foreign-key relationship rather than as a key prefix convention.

The placement driver respects this: tablets for an interleaved hierarchy stay together as the key range splits. A "directory" — Spanner's name for an interleaved hierarchy under one root key — is the unit of placement and movement.

Schema Changes Without Locks

Most distributed databases lock or rebuild tables for non-trivial schema changes. Spanner's schema evolution protocol uses a future-dated DDL timestamp:

  1. The DDL operation is assigned a timestamp t safely in the future (e.g. now + 30 s).
  2. The schema change is broadcast to every Paxos group leader. Each acknowledges that it has received the new schema definition.
  3. Once all leaders have ack'd, the change becomes effective at t: any transaction with timestamp ≥ t sees the new schema, anything earlier sees the old.
  4. For schema changes that need data movement (column type widening with a default, index creation), a background populator works in the new schema's view while the old schema continues to serve reads.

The result is online ALTER for everything: add columns, drop columns, create indexes, change types, all with no impact on running queries. A new index becomes visible to queries only after the populator is fully caught up; until then, planner statistics ignore it.

The SQL Frontend — F1's Legacy

F1, described in the 2013 paper, is the SQL layer Google built on top of Spanner to retire its sharded MySQL fleet for AdWords. Modern Cloud Spanner inherits much of F1's design:

-- Cross-table aggregate that runs locally on each tablet
SELECT
  c.country,
  COUNT(*) AS order_count,
  SUM(o.total) AS revenue
FROM Customers c
JOIN Orders o USING (customer_id)
WHERE o.created_at > '2025-04-01'
GROUP BY c.country;

Because Customers and Orders are interleaved, the join executes on each tablet that holds a directory; each tablet emits a partial aggregate, the coordinator merges them. The plan is what you'd expect from any decent MPP planner, but the storage layer makes it cheap.

Strengths and Tradeoffs

Spanner buys you serializability + global distribution + SQL — a combination almost no other system offers. The cost is non-trivial:

Spanner vs CockroachDB vs TiDB

SpannerCockroachDBTiDB
OperatorManaged (GCP only)Self-host or CloudSelf-host or PingCAP Cloud
Time sourceTrueTime (GPS + atomic)HLC (hybrid logical clock)TSO (single oracle service)
ConsistencyExternal / strict serializableSerializable (no real-time)Snapshot isolation
ReplicationMulti-Paxos per groupRaft per rangeRaft per region (TiKV)
SQL dialectGoogleSQL + PostgreSQL dialectPostgreSQL wire protocolMySQL wire protocol
Schema changesOnline, no locksOnline, mostly no locksOnline, mostly no locks
Open sourceNoBSL → Apache after delayApache 2.0
Best forGlobal, mission-critical OLTPSelf-hosted geo-distributed SQLMySQL drop-in at scale

The clock is what separates them. Without atomic clocks, CockroachDB and TiDB cannot offer external consistency without extra coordination. CockroachDB compensates with retry-based conflict detection; TiDB uses a placement driver as a global timestamp oracle. Both work, but their consistency stories are weaker than Spanner's by definition.

Frequently Asked Questions

Do I really need atomic clocks to run something like Spanner?

To get external consistency as Spanner defines it (real-time linearizability across all reads and writes globally), yes — or some equivalent of bounded clock uncertainty. CockroachDB uses HLCs and gets serializability without the real-time guarantee, which is sufficient for most applications. AWS's Time Sync service has narrowed the gap by exposing bounded-error PTP clocks, and there's research and prototypes around using these for Spanner-like semantics outside Google. For most workloads, plain serializability is enough.

How does Spanner handle a leader failure?

Each Paxos group runs lease-based leader election; if the leader's lease expires (because it crashed or was partitioned away), a follower campaigns and the quorum elects a new leader. The interruption is typically a few seconds — long enough to be noticed by latency-sensitive clients but short enough that connections retry through it. In-flight transactions on the old leader are aborted and must be retried.

What's the practical write throughput per Spanner instance?

Each "processing unit" or node is rated for around 10K QPS of writes, but the real number depends heavily on transaction shape (single-group vs multi-group), key distribution, and replication topology (regional vs multi-regional). Plan with conservative numbers and load test. Hot rows are the most common cause of disappointing write performance.

How do I choose between regional and multi-regional Spanner?

Regional instances place all replicas in one region (three zones), giving low write latency (~10-30 ms) but no protection against regional outages. Multi-regional instances replicate across at least three regions for read-heavy global workloads, with write latency in the 100+ ms range. Read-only replicas in additional regions cut read latency without paying the full multi-regional write cost.

Can Spanner do JSON / document-style data?

Yes — Cloud Spanner has a JSON column type with indexing on extracted paths, similar to PostgreSQL's JSONB. It's suitable for sparse attributes attached to relational rows, but it's not a replacement for a true document database. For document workloads on GCP, Firestore is the dedicated answer.

What are change streams and how are they different from triggers?

Change streams are an asynchronous CDC API that emits committed mutations in commit-timestamp order. They scale by partition (one stream per tablet group). Unlike SQL triggers, they fire outside the committing transaction and can drop or delay messages without blocking writes. The standard pattern is to consume them with Dataflow or a Cloud Function and propagate changes to BigQuery, Pub/Sub, or another system.

Why are my Spanner queries returning ABORTED?

Spanner uses pessimistic locks plus deadlock detection in read-write transactions. When two transactions conflict, one is aborted and must be retried by the client. The Cloud Spanner client libraries do this automatically with exponential backoff. If you see frequent aborts, the cause is usually contention on a hot row (which you can shard at the application level) or transactions that hold locks too long because they make external RPCs while holding them.