Distributed Consensus
FLP, CAP, Paxos, Raft, Zab, EPaxos — How Unreliable Nodes Agree on a Single Truth
Consensus is the problem of getting a set of nodes — connected by an unreliable network, subject to crashes — to agree on a single value. It is the foundation of every distributed database, every leader-elected service, every replicated state machine. The 1985 FLP impossibility result proved that perfect consensus is impossible in a fully asynchronous system; every real protocol relaxes one of FLP's assumptions to make progress in practice. The two protocols that won — Paxos (1989) and Raft (2014) — solve the same problem, with the same complexity, and differ mainly in how easy they are to specify and verify.
The Consensus Family Tree
Every modern protocol descends from Lamport's Paxos or Liskov's VSR.
Key Numbers
1. The Consensus Problem
Formally, distributed consensus requires four properties:
- Termination: every non-faulty node eventually decides.
- Validity: the decided value was proposed by some node.
- Integrity: each node decides at most once.
- Agreement: no two non-faulty nodes decide differently.
FLP showed that in a purely asynchronous network with even one possible crash failure, no deterministic algorithm can satisfy all four. Every real protocol works around this — Paxos and Raft sacrifice termination during pathological scenarios (split votes), guaranteeing termination only under partial synchrony (timing eventually behaves).
2. CAP Theorem
Eric Brewer's 2000 conjecture, proved by Gilbert & Lynch in 2002: in the presence of a network Partition, you must choose between Consistency (linearizability) and Availability.
| System | Choice | Behavior under partition |
|---|---|---|
| etcd, ZooKeeper, Spanner | CP | Minority partition rejects writes; majority continues |
| Cassandra (default), DynamoDB | AP | Both partitions accept writes; reconcile later |
| CockroachDB, TiDB | CP per range | Inactive ranges unavailable; others continue |
| MongoDB (replica set) | CP | Minority steps down, only primary writes |
The PACELC extension: when there's no partition, you still trade Latency vs Consistency. Strong consistency requires consensus rounds, which always cost RTTs.
3. Paxos vs Raft (the only two you actually use)
| Paxos / Multi-Paxos | Raft | |
|---|---|---|
| Stable-state RTT | 1 | 1 |
| Quorum size | ⌊N/2⌋+1 (per phase, can be relaxed in Flexible Paxos) | ⌊N/2⌋+1 |
| Log structure | Allows gaps in slot order | Strict prefix matching, no gaps |
| Leader election | Implicit (highest proposal number) | Explicit term + RequestVote |
| Membership change | Reconfiguration via consensus | Joint consensus or single-server change |
| Implementation difficulty | Hard — many edge cases | Easier — explicit FSM |
| Production examples | Chubby, Spanner, Megastore | etcd, Consul, CockroachDB, TiKV |
Read more: Paxos deep dive · Raft deep dive · Leader election internals.
4. Zab (ZooKeeper Atomic Broadcast)
Zab is the protocol behind Apache ZooKeeper. Like Multi-Paxos, it has a leader and a log; like Raft, it requires strict prefix matching. Zab is older than Raft and has a similar shape, with one specialty: FIFO client order preservation within a session, even across leader changes.
# Zab phases:
1. Leader election (FastLeaderElection algorithm)
2. Discovery — new leader collects highest accepted state from quorum
3. Synchronization — leader pushes its log to followers
4. Broadcast — steady-state log replication (one RTT per write)
# Each transaction has a globally ordered "zxid" (epoch + counter).
# This zxid maps directly to Raft's (term, index) tuple. Apache Kafka used ZooKeeper for cluster metadata until KRaft (Kafka Raft) replaced it in 2022. Most other Zab usage is legacy (Hadoop, HBase).
5. EPaxos (Leaderless / Egalitarian)
EPaxos (Moraru et al., SOSP 2013) breaks the leader bottleneck: any replica can commit non-conflicting commands in 1 RTT. Conflicting commands need an extra round to resolve dependencies.
# EPaxos commit flow:
1. Client sends command to nearest replica R
2. R sends PreAccept to all peers, including its set of "interfering" past commands
3. If a fast quorum (~3N/4) all return identical interference set:
commit in 1 RTT (fast path)
Else: do an Accept round to settle dependencies (slow path, 2 RTTs)
# Benefits: no leader bottleneck; reads scale linearly with replicas
# Costs: complex dependency tracking; slower under contention EPaxos has theoretical advantages (better tail latency, no leader hotspot) but few production users — the implementation complexity is significant. CockroachDB explored it; ended up sticking with Raft.
6. Leader-Based vs Leaderless Tradeoff
| Leader-based (Raft, Multi-Paxos) | Leaderless (EPaxos) | |
|---|---|---|
| Throughput | Bounded by leader CPU/IO | Scales with cluster size (under low contention) |
| Latency from far client | Always WAN to leader | Local replica → 1 RTT in fast path |
| Implementation | Simpler; well-understood | Complex dependency tracking |
| Behavior under contention | Predictable (serialized at leader) | Slow path more common; latency suffers |
| Production use | Dominant | Rare |
7. Consensus vs 2PC vs 3PC
People conflate these. They solve different problems:
- 2PC (two-phase commit): atomic commit across N participants. Blocking under coordinator failure. Used for distributed transactions.
- 3PC (three-phase commit): 2PC + a pre-commit phase to avoid coordinator-failure blocking. Vulnerable to network partitions; rarely used.
- Consensus (Paxos, Raft): replicated state machine. Tolerates failures via majority quorum. Used to replicate the same state to multiple nodes.
Modern distributed databases combine them: per-shard consensus (Raft) for replication, plus 2PC across shards for transactions that span them. Spanner's "Paxos + 2PC" is the canonical example.
8. When to Use What
| Need | Use |
|---|---|
| Strongly consistent metadata store | etcd, ZooKeeper, Consul |
| Replicated state machine | Raft library (etcd-io/raft, hashicorp/raft) |
| Distributed SQL with transactions | CockroachDB, Spanner, TiDB, YugabyteDB |
| High-throughput KV with eventual consistency | Cassandra, DynamoDB, Riak |
| Leader election only | etcd lease, ZooKeeper ephemeral nodes |
| Atomic commit across services | 2PC (Saga pattern for retries) |
| BFT (untrusted participants) | Tendermint, HotStuff, BFT-SMaRt |
9. Byzantine Fault Tolerance (BFT)
Crash-fault consensus assumes nodes either work or stop. BFT consensus assumes some nodes can be malicious: lie, equivocate, send conflicting messages to different peers. This requires 3f+1 nodes to tolerate f Byzantine failures — a 50% jump from the 2f+1 of crash-fault systems.
| Protocol | Year | Notable |
|---|---|---|
| PBFT (Castro & Liskov) | 1999 | The reference BFT protocol; foundation for everything since |
| Tendermint | 2014 | BFT for blockchains; powers Cosmos |
| HotStuff | 2018 | Three-phase variant; powers Diem (formerly Libra) |
| Honeybadger BFT | 2016 | Fully asynchronous BFT (no timing assumptions) |
| Algorand | 2017 | Random committee selection for scale |
BFT is essential for permissionless blockchains where any node could be hostile. Inside a single organization with trusted operators, crash-fault Raft is enough — and 50% cheaper.
10. Hybrid Logical Clocks and Timestamps
Consensus orders writes within one group. To order events globally — across many Raft groups, or across regions — you need timestamps. The standard tools:
- Lamport timestamps: simple counter incremented on send/receive. Provides causal order, not real time.
- Vector clocks: per-node counters. Detects concurrent events but grows with nodes.
- Hybrid Logical Clocks (HLC): physical time + logical counter. Bounded skew; works without atomic clocks. Used by CockroachDB, MongoDB.
- TrueTime: bounded clock uncertainty from atomic clocks + GPS. Used by Spanner. Allows external consistency without coordination.
Combining timestamps with consensus is what makes distributed transactions tractable. Spanner = Paxos + 2PC + TrueTime. CockroachDB = Raft + parallel commits + HLC. Both achieve serializable transactions across thousands of shards.
11. The Practitioner's Decision Tree
Need to replicate state across nodes?
├─ No → just use a database, you don't need consensus directly
├─ Yes →
│ ├─ Within a single organization (trusted)?
│ │ ├─ Yes → Raft (etcd-io/raft, hashicorp/raft)
│ │ └─ No → BFT (Tendermint, HotStuff)
│ └─ Cross-shard transactions?
│ ├─ No → per-shard Raft is enough
│ └─ Yes → 2PC + per-shard Raft (CockroachDB pattern)
│ OR Spanner-style 2PC + TrueTime
└─ Just need leader election (no log)?
└─ Use etcd or ZooKeeper as a primitive The right answer 95% of the time is "use a Raft library; don't write consensus from scratch." It is genuinely hard to get right; even simple bugs can violate safety in subtle, partition-dependent ways.
12. Reading List
Foundational papers and textbooks for going deeper:
- Lamport, "Paxos Made Simple" (2001) — the readable Paxos explanation. Read after the 1998 paper if you want the full proof.
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (2014) — the Raft paper. Plus Ongaro's PhD thesis for the long-form version including snapshot, membership change, and TLA+ spec.
- Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's experience implementing Chubby. The single best read on what's actually hard.
- Liskov & Cowling, "Viewstamped Replication Revisited" (2012) — a clearer presentation of VSR than the 1988 original.
- Howard, "Distributed Consensus Revised" (2019, PhD thesis) — Flexible Paxos and a unifying view of the family.
- Castro & Liskov, "Practical Byzantine Fault Tolerance" (1999) — PBFT, the foundation of every BFT protocol since.
- Kleppmann, "Designing Data-Intensive Applications" (2017) — chapters 8 and 9 are the best textbook treatment.
Tradeoffs
| Choice | Pro | Con |
|---|---|---|
| Raft over Paxos | Faster to implement and debug | Strict log prefix limits flexibility |
| Larger cluster (5 → 7) | Tolerates more failures (2 → 3) | Higher latency, more network traffic |
| Synchronous replication | Strong consistency | Blocked by slowest replica |
| Leaderless (EPaxos) | No leader bottleneck | Complex; bad under contention |
| Quorum reads | Linearizable | Cost RTT and serialization |
FAQ
Is consensus the same as agreement?
Effectively yes. "Consensus" and "agreement" are interchangeable in distributed systems literature. Both mean: every non-faulty node decides the same value, and that value was actually proposed.
Why do we need consensus if we have linearizable reads?
Linearizable reads are a consequence of consensus, not a substitute. The system has to internally agree (via consensus) on the order of writes; reads then return values consistent with that order. No consensus → no agreed write order → no linearizability.
Can blockchain protocols replace Paxos/Raft?
No, different problems. Blockchain consensus (Nakamoto, Tendermint, HotStuff) is BFT — it tolerates malicious actors, but pays 3f+1 quorum overhead. Paxos/Raft are crash-fault only, much cheaper at 2f+1. Use BFT if and only if you don't trust the participants.
What's the difference between safety and liveness?
Safety: nothing bad happens (no two nodes ever decide different values). Liveness: something good eventually happens (a decision is reached). FLP says you can have safety always, but liveness only under partial synchrony.
How are timestamps used in consensus?
Hybrid logical clocks (HLCs) and TrueTime are not consensus but composable with it. They give consistent global ordering of events without coordination. Spanner uses TrueTime to make 2PC + Paxos provide externally consistent transactions.
Is consensus the same as atomic broadcast?
Equivalent in expressive power, different framings. Consensus = decide one value. Atomic broadcast = deliver messages in the same order to all nodes. Either can implement the other.