Spanner Internals
Spanner is Google's globally distributed SQL database — the system that underpins AdWords, Gmail metadata, and Cloud Spanner. It is the rare distributed database that offers serializable transactions across continents while still serving reads in single-digit milliseconds locally. The two ideas that make this possible are TrueTime (a clock API that exposes uncertainty rather than hiding it) and a hierarchy of Paxos groups, each replicating a slice of data across data centers. Above that sits an SQL frontend descended from F1, the engine Google built to retire its sharded MySQL fleet.
Based on the 2012 Spanner OSDI paper, the 2017 SIGMOD update, the 2013 F1 paper, and the public Cloud Spanner documentation.
Why Spanner Exists
Spanner Architecture
Key Numbers
TrueTime — Time as a Bounded Interval
Every distributed database has to deal with clock skew. Most pretend it doesn't exist, then patch over the symptoms with vector clocks, hybrid logical clocks, or last-write-wins. Spanner took the opposite path: expose the uncertainty.
The TrueTime API has three calls:
TT.now() → returns interval [earliest, latest] // truth lies somewhere inside
TT.after(t) → true if t is definitely in the past
TT.before(t)→ true if t is definitely in the future Under the hood, TrueTime is implemented by a fleet of GPS receivers and atomic clocks in every data center, with a daemon on every machine that polls them and computes a tight bound on its own clock's drift. The half-width of the interval, ε ("epsilon"), is typically around 4 ms on average and bounded under 7 ms even in worst case. By making clock uncertainty explicit and small, Spanner can use timestamps as the global ordering mechanism.
The trick is the commit-wait protocol: when a transaction commits at timestamp
s, the leader waits until TT.after(s) returns true before acknowledging
the commit to the client. That guarantees no concurrent transaction can be assigned a later
timestamp than s while still appearing earlier in real time. Total wait is roughly
2ε ≈ 7 ms — the cost of external consistency.
Paxos Groups and the Replication Layer
Data in Spanner is partitioned into tablets (key ranges within a table). Each tablet is owned by a Paxos group — a set of replicas in different zones running Multi-Paxos with a stable leader. The leader serializes writes, the replicas vote, and a write commits once a majority quorum has logged it. Stable leadership (vs naive single-decree Paxos) means the leader can pipeline writes without re-electing every round.
Reads have three flavors that reflect the consistency dial:
- Strong reads — Routed to the leader; serve the latest committed state. Cost: cross-zone RTT to the leader.
- Read-only transactions — Pick a timestamp s in the recent past and serve from any replica caught up to s. No locks, no leader contention.
- Stale reads — Explicitly bounded staleness (e.g. "as of 5 seconds ago"), always servable locally with no coordination. The cheapest read mode.
Because every replica's log advances monotonically and TrueTime gives each tablet's leader a safe time (a timestamp below which all transactions have committed), any replica can answer a read at timestamp s as long as s ≤ safe_time. This is what makes Spanner reads cheap in the common case despite being a globally consistent system.
Two-Phase Commit Across Paxos Groups
A transaction that touches a single Paxos group commits in one Paxos round. Cross-group transactions (the harder case) use two-phase commit on top of Paxos:
- The client picks one participant leader as the coordinator.
- Each participant leader acquires read/write locks on the keys it touches and replicates a prepare record through its Paxos group. Once the prepare is durable, it sends a prepare-ack to the coordinator with a prepare timestamp.
- The coordinator picks a commit timestamp ≥ all prepare timestamps and ≥
TT.now().latest, then replicates the commit decision through its own Paxos group. - The coordinator waits out commit-wait, then notifies participants. Each participant replicates the commit through its Paxos group and releases locks.
The key correctness property: because each participant chose a prepare timestamp ≥ its safe time and the coordinator chose a commit timestamp ≥ all prepare timestamps, the commit timestamp lies in the future of every event the transaction observed. Combined with commit-wait, this gives external consistency: if transaction T1 commits before T2 starts in real time, T1's commit timestamp is strictly less than T2's. No external database achieves this property without atomic clocks or some equivalent.
Interleaved Tables and Locality
A naive sharded SQL system pays for joins between parent and child rows that live on different shards. Spanner exposes interleaved tables as a first-class concept: child rows are stored physically adjacent to their parent.
CREATE TABLE Customers (
customer_id INT64 NOT NULL,
name STRING(MAX),
email STRING(MAX),
) PRIMARY KEY (customer_id);
CREATE TABLE Orders (
customer_id INT64 NOT NULL,
order_id INT64 NOT NULL,
total NUMERIC,
status STRING(20),
) PRIMARY KEY (customer_id, order_id),
INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
The Orders rows are physically clustered by customer_id with the matching Customer
row. A query like "all orders for customer 42" reads contiguous bytes from one tablet. A
transaction that updates a customer and inserts an order touches a single Paxos group, avoiding
2PC entirely. Cascading deletes are equally local. This is the SQL analog of DynamoDB's PK/SK
colocation, but expressed as a foreign-key relationship rather than as a key prefix convention.
The placement driver respects this: tablets for an interleaved hierarchy stay together as the key range splits. A "directory" — Spanner's name for an interleaved hierarchy under one root key — is the unit of placement and movement.
Schema Changes Without Locks
Most distributed databases lock or rebuild tables for non-trivial schema changes. Spanner's schema evolution protocol uses a future-dated DDL timestamp:
- The DDL operation is assigned a timestamp t safely in the future (e.g. now + 30 s).
- The schema change is broadcast to every Paxos group leader. Each acknowledges that it has received the new schema definition.
- Once all leaders have ack'd, the change becomes effective at t: any transaction with timestamp ≥ t sees the new schema, anything earlier sees the old.
- For schema changes that need data movement (column type widening with a default, index creation), a background populator works in the new schema's view while the old schema continues to serve reads.
The result is online ALTER for everything: add columns, drop columns, create indexes, change types, all with no impact on running queries. A new index becomes visible to queries only after the populator is fully caught up; until then, planner statistics ignore it.
The SQL Frontend — F1's Legacy
F1, described in the 2013 paper, is the SQL layer Google built on top of Spanner to retire its sharded MySQL fleet for AdWords. Modern Cloud Spanner inherits much of F1's design:
- Pushdowns — Predicates, projections, and even partial aggregations are pushed to tablet servers so the SQL coordinator only sees pre-filtered, pre-aggregated data.
- Distributed query execution — Joins are planned with locality in mind (interleaved tables → local hash-merge), shuffles use streaming RPCs across tablet leaders.
- Change streams — Cloud Spanner exposes a CDC API (Spanner Change Streams) that streams committed mutations in commit-timestamp order, similar in spirit to DynamoDB Streams or PostgreSQL logical replication.
- Optimistic + pessimistic concurrency — Read-write transactions take pessimistic locks; read-only transactions never block writers because they read at a chosen past timestamp.
-- Cross-table aggregate that runs locally on each tablet
SELECT
c.country,
COUNT(*) AS order_count,
SUM(o.total) AS revenue
FROM Customers c
JOIN Orders o USING (customer_id)
WHERE o.created_at > '2025-04-01'
GROUP BY c.country; Because Customers and Orders are interleaved, the join executes on each tablet that holds a directory; each tablet emits a partial aggregate, the coordinator merges them. The plan is what you'd expect from any decent MPP planner, but the storage layer makes it cheap.
Strengths and Tradeoffs
Spanner buys you serializability + global distribution + SQL — a combination almost no other system offers. The cost is non-trivial:
- Write latency floor — Cross-zone Paxos commit is ~30-50 ms. Cross-region commits are 100-200 ms. You cannot beat the speed of light, and Spanner does not pretend to. For sub-millisecond write workloads, use a single-region OLTP database.
- Cost — Cloud Spanner pricing scales with provisioned compute capacity (in "processing units" or "nodes"). Even a small instance is hundreds of dollars per month; production instances run into thousands.
- Vendor lock-in — Spanner is GCP-only. The closest open-source analogs are CockroachDB and TiDB, but neither replicates TrueTime's external-consistency guarantees without atomic clocks. They use HLCs and slightly weaker consistency.
- Schema rigidity — Despite online schema changes, you still have a relational schema. Document-shaped data fits awkwardly. Use Firestore or Spanner's JSON column type for flexible attributes.
- Hot rows — A single key under heavy contention bottlenecks on the leader replica. Application-level batching, sharding row counters, or interleaved tables can help.
Spanner vs CockroachDB vs TiDB
| Spanner | CockroachDB | TiDB | |
|---|---|---|---|
| Operator | Managed (GCP only) | Self-host or Cloud | Self-host or PingCAP Cloud |
| Time source | TrueTime (GPS + atomic) | HLC (hybrid logical clock) | TSO (single oracle service) |
| Consistency | External / strict serializable | Serializable (no real-time) | Snapshot isolation |
| Replication | Multi-Paxos per group | Raft per range | Raft per region (TiKV) |
| SQL dialect | GoogleSQL + PostgreSQL dialect | PostgreSQL wire protocol | MySQL wire protocol |
| Schema changes | Online, no locks | Online, mostly no locks | Online, mostly no locks |
| Open source | No | BSL → Apache after delay | Apache 2.0 |
| Best for | Global, mission-critical OLTP | Self-hosted geo-distributed SQL | MySQL drop-in at scale |
The clock is what separates them. Without atomic clocks, CockroachDB and TiDB cannot offer external consistency without extra coordination. CockroachDB compensates with retry-based conflict detection; TiDB uses a placement driver as a global timestamp oracle. Both work, but their consistency stories are weaker than Spanner's by definition.
Frequently Asked Questions
Do I really need atomic clocks to run something like Spanner?
To get external consistency as Spanner defines it (real-time linearizability across all reads and writes globally), yes — or some equivalent of bounded clock uncertainty. CockroachDB uses HLCs and gets serializability without the real-time guarantee, which is sufficient for most applications. AWS's Time Sync service has narrowed the gap by exposing bounded-error PTP clocks, and there's research and prototypes around using these for Spanner-like semantics outside Google. For most workloads, plain serializability is enough.
How does Spanner handle a leader failure?
Each Paxos group runs lease-based leader election; if the leader's lease expires (because it crashed or was partitioned away), a follower campaigns and the quorum elects a new leader. The interruption is typically a few seconds — long enough to be noticed by latency-sensitive clients but short enough that connections retry through it. In-flight transactions on the old leader are aborted and must be retried.
What's the practical write throughput per Spanner instance?
Each "processing unit" or node is rated for around 10K QPS of writes, but the real number depends heavily on transaction shape (single-group vs multi-group), key distribution, and replication topology (regional vs multi-regional). Plan with conservative numbers and load test. Hot rows are the most common cause of disappointing write performance.
How do I choose between regional and multi-regional Spanner?
Regional instances place all replicas in one region (three zones), giving low write latency (~10-30 ms) but no protection against regional outages. Multi-regional instances replicate across at least three regions for read-heavy global workloads, with write latency in the 100+ ms range. Read-only replicas in additional regions cut read latency without paying the full multi-regional write cost.
Can Spanner do JSON / document-style data?
Yes — Cloud Spanner has a JSON column type with indexing on extracted paths, similar to PostgreSQL's JSONB. It's suitable for sparse attributes attached to relational rows, but it's not a replacement for a true document database. For document workloads on GCP, Firestore is the dedicated answer.
What are change streams and how are they different from triggers?
Change streams are an asynchronous CDC API that emits committed mutations in commit-timestamp order. They scale by partition (one stream per tablet group). Unlike SQL triggers, they fire outside the committing transaction and can drop or delay messages without blocking writes. The standard pattern is to consume them with Dataflow or a Cloud Function and propagate changes to BigQuery, Pub/Sub, or another system.
Why are my Spanner queries returning ABORTED?
Spanner uses pessimistic locks plus deadlock detection in read-write transactions. When two transactions conflict, one is aborted and must be retried by the client. The Cloud Spanner client libraries do this automatically with exponential backoff. If you see frequent aborts, the cause is usually contention on a hot row (which you can shard at the application level) or transactions that hold locks too long because they make external RPCs while holding them.