Paxos

Basic Paxos, Multi-Paxos, and Why Lamport's 1989 Paper Is Still the Reference

Paxos is the original solution to the distributed consensus problem: getting a set of unreliable nodes to agree on a single value despite crashes, message loss, and arbitrary delays. Leslie Lamport described it in 1989 (in a parable about a fictional Greek island), then re-described it more carefully in 1998 ("The Part-Time Parliament"), and once more in 2001 ("Paxos Made Simple") because no one understood the first two. It became the foundation of Google's Chubby, Spanner, and arguably every subsequent consensus protocol — Raft, Zab, EPaxos — is a reaction to Paxos's incomprehensibility, not a fundamental departure.

The Problem and the Roles

Paxos has three roles. In practice, every node plays all three.

Key Numbers

1989

Lamport's original "Part-Time Parliament" tech report (rejected as too whimsical)

2 RTTs

Basic Paxos latency for one decision (Phase 1 + Phase 2)

1 RTT

Multi-Paxos with stable leader, after the first decision

⌊N/2⌋+1

Quorum size for N acceptors (majority)

2N-1

Messages per Paxos round (Phase 1 broadcast + Phase 2 broadcast)

2001

"Paxos Made Simple" — the third explanation that finally stuck

2014

Raft published — the "more understandable" alternative

1. Basic Paxos: The Two-Phase Algorithm

Decide on one value. Phase 1 establishes leadership. Phase 2 commits the value.

Each round has a unique proposal number (a monotonically increasing integer, often (round, server_id) tuple to break ties). The proposer drives the round through two phases:

# PHASE 1: Prepare
proposer → ALL acceptors: PREPARE(n)         # "I want to lead with proposal n"

# Each acceptor:
if n > max_seen_n:
    max_seen_n = n
    return PROMISE(n, accepted_n, accepted_v)
    # "OK, I won't accept anything < n.
    #  Here's the highest-numbered value I've already accepted (if any)."
else:
    return REJECT(max_seen_n)

# Once proposer has PROMISE from majority:

# PHASE 2: Accept
v = highest accepted_v among the PROMISES, or your own desired value
proposer → ALL acceptors: ACCEPT(n, v)       # "Vote to accept (n, v)"

# Each acceptor:
if n >= max_seen_n:
    accepted_n = n
    accepted_v = v
    return ACCEPTED(n, v)

# When proposer has ACCEPTED from majority: value is CHOSEN.
# Proposer broadcasts CHOSEN(v) to learners.

The two key invariants:

P1: An acceptor accepts proposal (n, v) only if it hasn't promised a higher number.
P2: If a proposal (n, v) has been chosen, every higher-numbered proposal must propose the same v. The proposer enforces this by reading the highest accepted value from its prepare quorum.

2. Why Two Phases? The Subtle Reason

One phase isn't enough because two competing proposers might each get majority acceptance for different values. Paxos prevents this with the prepare phase: before proposing v, you must learn what (if anything) was already chosen, and propose that value if it exists.

The brilliant part is what happens during contention. Suppose proposer A has its (n=5, v=X) accepted by acceptors 1, 2, 3 (a majority). Then proposer B starts a new round with n=6:

Acceptor 1: PROMISE(6, accepted=(5, X))
Acceptor 2: PROMISE(6, accepted=(5, X))
Acceptor 4: PROMISE(6, accepted=None)         # B's prepare quorum

# B's quorum overlaps A's commit quorum by majority intersection.
# Therefore at least one acceptor in B's quorum already accepted (5, X).
# B sees (5, X) as the highest accepted, MUST propose v=X.
# Result: even if B "wins" the round, X is still chosen.

proposer B → ACCEPT(6, X)        # forced to keep X

The majority intersection guarantee is the entire safety proof: any two majorities of N nodes must overlap by at least one node, so a learner with a fresher round can always see what was chosen previously.

3. Multi-Paxos: One Decision Wasn't Enough

Real systems decide a sequence of values. Run a separate Paxos instance per slot? Way too slow.

Multi-Paxos optimizes for the common case of a stable leader. Instead of running Phase 1 every decision, the leader runs Phase 1 once (covering an entire range of slots), then Phase 2 alone for each subsequent value:

# Multi-Paxos with stable leader L:

# At leader election (rare):
#   L → ALL: PREPARE(n)           # n covers ALL future slots
#   L receives PROMISE from majority, learns highest accepted per slot
#   L "completes" any previously chosen-but-not-fully-replicated values

# Per request (common case):
#   client → L: REQUEST(value)
#   L picks next free slot s
#   L → ALL: ACCEPT(n, s, value)   # phase 2 only — 1 RTT
#   L receives ACCEPTED from majority
#   L → ALL: CHOSEN(s, value)
#   L → client: REPLY(success)

This is why Multi-Paxos is "the same speed as Raft" — the steady-state cost is one RTT plus disk fsyncs, identical to Raft's AppendEntries. The difference is conceptual cleanliness, not performance.

4. The 1989 Insight: Asynchronous Quorums

Before Paxos, the standard approach to consensus was 2-phase commit (2PC) or 3-phase commit (3PC). These require all nodes to participate. If any node is down or slow, the whole system blocks.

Lamport's insight: you don't need everyone — you need a majority. Any two majorities overlap, which is enough to preserve safety across leadership changes. This makes Paxos:

Asynchronous-safe: arbitrary message delays don't cause incorrect decisions.
Live with f=⌊(N-1)/2⌋ failures: as long as a majority stays up, progress is possible.
No coordinator dependency: any node can become a proposer; rounds with higher numbers preempt lower ones.

This was a fundamental shift. Every consensus protocol since (Raft, Zab, View-Stamped Replication, EPaxos, Flexible Paxos) is a refinement of "majority quorum + monotonic round numbers."

5. Comparison with Raft

	Paxos	Raft
Year published	1989/1998/2001	2014
Stated goal	Mathematical correctness	Understandability
Leader election	Implicit (highest proposal number wins)	Explicit RequestVote phase
Log replication	One Paxos instance per slot, can have gaps	Strict prefix matching, no gaps
Membership change	Reconfiguration via consensus on config itself	Joint consensus or single-server change
Steady-state RTT	1 (Multi-Paxos)	1
Implementations	Chubby, Spanner, Megastore	etcd, Consul, CockroachDB, TiKV
Typical reaction	"I think I get it"	"OK, I get it"

The Raft paper (Ongaro & Ousterhout 2014) explicitly framed itself as "Paxos but understandable." It's not faster or more correct — it's just easier to specify, implement, and verify. The cost: Raft requires a strict log without gaps, so a single straggler can stall the entire log.

6. Production Use: Chubby, Spanner, Megastore

Google's Chubby (Burrows 2006) was the first major production Paxos system — a coarse-grained lock service used by GFS, BigTable, and others for leader election and metadata storage. Chubby exposed a Unix-like file API, with each file being a tiny replicated state machine over Paxos.

Spanner (Corbett 2012) uses Paxos per shard ("tablet") to replicate writes across data centers. Spanner's twist is TrueTime — atomic-clock-derived bounded clock uncertainty — used together with Paxos to provide externally-consistent transactions.

Megastore (Baker 2011) ran a Paxos round per write across geographically distributed data centers, sacrificing latency for strong consistency. Largely superseded by Spanner internally at Google.

Outside Google: Apache ZooKeeper uses Zab (a Paxos-like protocol). Apache Cassandra has lightweight transactions (LWT) implemented via Paxos for compare-and-set semantics.

7. View-Stamped Replication: Paxos's Forgotten Cousin

Liskov & Oki published View-Stamped Replication in 1988 — a year before Lamport's Paxos paper, but framed as a database replication protocol rather than consensus. VSR uses identical ideas: monotonic view numbers (~ proposal numbers), majority quorums, and a "view change" protocol equivalent to Paxos's Phase 1.

The 2012 update "VR Revisited" (Liskov & Cowling) is now the cleanest reference for the VSR-style consensus pattern — and Raft's design borrows directly from it.

8. Flexible Paxos and Heidi Howard's Insight

Heidi Howard's 2016 paper "Flexible Paxos" relaxed Paxos's quorum requirement. The key observation: safety only requires that Phase 1 quorums and Phase 2 quorums intersect — not that each is a majority.

# Standard Paxos: |Q1| > N/2 AND |Q2| > N/2
# Flexible Paxos: |Q1| + |Q2| > N

# Concrete example with N=5:
#   Standard: Q1=3, Q2=3
#   Flexible: Q1=4, Q2=2  → committed values need only 2 acceptors!
#
# Trade off availability: now 2 unavailable acceptors block leader election,
# but writes only need 2 acceptors.

This is useful when reads dominate and you can tolerate slower leader changes. WAN-deployed systems sometimes use it: small fast write quorum across nearby data centers, large recovery quorum once during election.

9. Implementation Pitfalls

Going from "Paxos Made Simple" to a working implementation is where most bugs hide:

Persistence: acceptors must persist (maxSeenN, acceptedN, acceptedV) to disk before responding. Otherwise a reboot can violate safety.
Slot completion: a new leader must run Phase 1 then Phase 2 for any "holes" left by previous leaders — even slots where no one has accepted anything yet.
Duplicate filtering: clients must tag requests with monotonic IDs so a retried request doesn't get applied twice in the state machine.
Read-only optimization: reading at the leader still costs an RTT to confirm leadership. Lease-based reads (where the leader holds a time-bounded lease and can serve reads locally) are common and require careful clock-skew accounting.
Membership change: not part of basic Paxos. The standard approach is to make the membership configuration itself a value Paxos decides.

Lamport's "Paxos Made Live" (Chandra et al. 2007) — the Chubby team's experience report — is the best read on what's actually hard. Twenty pages of "we did this, it broke, here's the fix." Required reading before implementing.

10. Single-Decree Paxos vs Replicated Log

Almost no production system uses single-decree Paxos directly. Real systems are replicated state machines: sequence of decisions, each producing a state transition.

# Replicated state machine (RSM) over Paxos:
# - Slot 0: client A wants "set x = 1"  → Paxos decides "set x = 1"
# - Slot 1: client B wants "set y = 2"  → Paxos decides "set y = 2"
# - Slot 2: client A wants "x++"        → Paxos decides "x++"
#
# Each slot is a separate Paxos instance.
# Multi-Paxos pipelines them: leader runs Phase 1 once, then Phase 2 per slot.
#
# State machine applies decisions in slot order:
#   apply slot 0: x = 1
#   apply slot 1: y = 2
#   apply slot 2: x = 2
# Every replica applies in the same order → same final state.

This is the model that Raft inherited and made explicit. Paxos's flexibility (allowing slots to commit out of order) is largely unused in practice because state machines need ordered application anyway.

Tradeoffs

Pro	Con
Mathematically minimal — provably correct	Notoriously hard to specify in detail
Tolerates ⌊(N-1)/2⌋ crash failures	Doesn't tolerate Byzantine faults (BFT-Paxos exists separately)
Asynchronous-safe (timing-agnostic for safety)	Liveness depends on eventual stable leader
Multi-Paxos: one RTT per decision in steady state	Edge cases (leader changes, recovery) hide complexity
Allows out-of-order slot decisions (no log gap blocking)	Out-of-order makes log apply / replay harder

FAQ

Why is Paxos considered hard?

Two reasons: (1) Lamport's original paper used a Greek-island parable that obscured the algorithm, and (2) the canonical statement is minimal — every detail matters but isn't motivated. Going from "Paxos Made Simple" to a working implementation requires re-deriving things the paper leaves implicit (acceptors must persist their state, leaders must replay un-chosen slots, etc.).

Is Multi-Paxos the same as Paxos?

Multi-Paxos is the protocol you actually use. Basic Paxos solves a single decision; Multi-Paxos extends it with a stable leader and per-slot Phase 2-only operations. Most people who say "Paxos" mean "Multi-Paxos."

What's Fast Paxos? Generalized Paxos?

Fast Paxos (Lamport 2006) saves one RTT in the common case by allowing clients to propose directly to acceptors. Generalized Paxos lets non-conflicting commands commit out of order. Both are theoretically interesting but rarely implemented because Multi-Paxos is already fast enough.

How does Paxos compare to 2PC?

2PC blocks on any failure — if the coordinator crashes after PREPARE but before COMMIT, participants don't know what to do. Paxos uses majority quorums, so it tolerates failure of a minority of nodes. 2PC is for atomic commit across services that don't speak Paxos; Paxos is for replicated state machines.

Does Spanner really run Paxos for every write?

Per Paxos group (one per tablet replica set), yes — but the latency is hidden by TrueTime's bounded clock uncertainty. Reads at a known snapshot timestamp don't need consensus; writes do.

Can I replace Paxos with Raft in an existing system?

Mostly yes — Raft and Multi-Paxos solve the same problem with the same complexity class. Migration is hard because of subtle differences in log handling (Raft requires strict prefix matching; Paxos allows gaps). But greenfield, almost everyone picks Raft.