Distributed Consensus

FLP, CAP, Paxos, Raft, Zab, EPaxos — How Unreliable Nodes Agree on a Single Truth

Consensus is the problem of getting a set of nodes — connected by an unreliable network, subject to crashes — to agree on a single value. It is the foundation of every distributed database, every leader-elected service, every replicated state machine. The 1985 FLP impossibility result proved that perfect consensus is impossible in a fully asynchronous system; every real protocol relaxes one of FLP's assumptions to make progress in practice. The two protocols that won — Paxos (1989) and Raft (2014) — solve the same problem, with the same complexity, and differ mainly in how easy they are to specify and verify.

The Consensus Family Tree

Every modern protocol descends from Lamport's Paxos or Liskov's VSR.

FLP Impossibility (1985) No deterministic async consensus protocol can guarantee both safety and liveness Paxos (Lamport 1989) VSR (Liskov 1988) Multi-Paxos Zab (ZooKeeper) Raft (Ongaro 2014) Flexible Paxos EPaxos (leaderless) All variants relax timing, fault, or coordination assumptions of pure async consensus.

Key Numbers

1985
FLP impossibility (Fischer, Lynch, Paterson)
⌊N/2⌋+1
Crash-fault quorum size for N nodes
3f+1
Byzantine-fault quorum (PBFT, BFT-Raft)
1 RTT
Steady-state Multi-Paxos / Raft commit latency
2 RTTs
Basic Paxos: prepare + accept
1989
Lamport's original Paxos paper
~1ms
Datacenter consensus latency · ~80ms cross-region

1. The Consensus Problem

Formally, distributed consensus requires four properties:

  • Termination: every non-faulty node eventually decides.
  • Validity: the decided value was proposed by some node.
  • Integrity: each node decides at most once.
  • Agreement: no two non-faulty nodes decide differently.

FLP showed that in a purely asynchronous network with even one possible crash failure, no deterministic algorithm can satisfy all four. Every real protocol works around this — Paxos and Raft sacrifice termination during pathological scenarios (split votes), guaranteeing termination only under partial synchrony (timing eventually behaves).

2. CAP Theorem

Eric Brewer's 2000 conjecture, proved by Gilbert & Lynch in 2002: in the presence of a network Partition, you must choose between Consistency (linearizability) and Availability.

SystemChoiceBehavior under partition
etcd, ZooKeeper, SpannerCPMinority partition rejects writes; majority continues
Cassandra (default), DynamoDBAPBoth partitions accept writes; reconcile later
CockroachDB, TiDBCP per rangeInactive ranges unavailable; others continue
MongoDB (replica set)CPMinority steps down, only primary writes

The PACELC extension: when there's no partition, you still trade Latency vs Consistency. Strong consistency requires consensus rounds, which always cost RTTs.

3. Paxos vs Raft (the only two you actually use)

Paxos / Multi-PaxosRaft
Stable-state RTT11
Quorum size⌊N/2⌋+1 (per phase, can be relaxed in Flexible Paxos)⌊N/2⌋+1
Log structureAllows gaps in slot orderStrict prefix matching, no gaps
Leader electionImplicit (highest proposal number)Explicit term + RequestVote
Membership changeReconfiguration via consensusJoint consensus or single-server change
Implementation difficultyHard — many edge casesEasier — explicit FSM
Production examplesChubby, Spanner, Megastoreetcd, Consul, CockroachDB, TiKV

Read more: Paxos deep dive · Raft deep dive · Leader election internals.

4. Zab (ZooKeeper Atomic Broadcast)

Zab is the protocol behind Apache ZooKeeper. Like Multi-Paxos, it has a leader and a log; like Raft, it requires strict prefix matching. Zab is older than Raft and has a similar shape, with one specialty: FIFO client order preservation within a session, even across leader changes.

# Zab phases:
1. Leader election (FastLeaderElection algorithm)
2. Discovery — new leader collects highest accepted state from quorum
3. Synchronization — leader pushes its log to followers
4. Broadcast — steady-state log replication (one RTT per write)

# Each transaction has a globally ordered "zxid" (epoch + counter).
# This zxid maps directly to Raft's (term, index) tuple.

Apache Kafka used ZooKeeper for cluster metadata until KRaft (Kafka Raft) replaced it in 2022. Most other Zab usage is legacy (Hadoop, HBase).

5. EPaxos (Leaderless / Egalitarian)

EPaxos (Moraru et al., SOSP 2013) breaks the leader bottleneck: any replica can commit non-conflicting commands in 1 RTT. Conflicting commands need an extra round to resolve dependencies.

# EPaxos commit flow:
1. Client sends command to nearest replica R
2. R sends PreAccept to all peers, including its set of "interfering" past commands
3. If a fast quorum (~3N/4) all return identical interference set:
   commit in 1 RTT (fast path)
   Else: do an Accept round to settle dependencies (slow path, 2 RTTs)

# Benefits: no leader bottleneck; reads scale linearly with replicas
# Costs: complex dependency tracking; slower under contention

EPaxos has theoretical advantages (better tail latency, no leader hotspot) but few production users — the implementation complexity is significant. CockroachDB explored it; ended up sticking with Raft.

6. Leader-Based vs Leaderless Tradeoff

Leader-based (Raft, Multi-Paxos)Leaderless (EPaxos)
ThroughputBounded by leader CPU/IOScales with cluster size (under low contention)
Latency from far clientAlways WAN to leaderLocal replica → 1 RTT in fast path
ImplementationSimpler; well-understoodComplex dependency tracking
Behavior under contentionPredictable (serialized at leader)Slow path more common; latency suffers
Production useDominantRare

7. Consensus vs 2PC vs 3PC

People conflate these. They solve different problems:

  • 2PC (two-phase commit): atomic commit across N participants. Blocking under coordinator failure. Used for distributed transactions.
  • 3PC (three-phase commit): 2PC + a pre-commit phase to avoid coordinator-failure blocking. Vulnerable to network partitions; rarely used.
  • Consensus (Paxos, Raft): replicated state machine. Tolerates failures via majority quorum. Used to replicate the same state to multiple nodes.

Modern distributed databases combine them: per-shard consensus (Raft) for replication, plus 2PC across shards for transactions that span them. Spanner's "Paxos + 2PC" is the canonical example.

8. When to Use What

NeedUse
Strongly consistent metadata storeetcd, ZooKeeper, Consul
Replicated state machineRaft library (etcd-io/raft, hashicorp/raft)
Distributed SQL with transactionsCockroachDB, Spanner, TiDB, YugabyteDB
High-throughput KV with eventual consistencyCassandra, DynamoDB, Riak
Leader election onlyetcd lease, ZooKeeper ephemeral nodes
Atomic commit across services2PC (Saga pattern for retries)
BFT (untrusted participants)Tendermint, HotStuff, BFT-SMaRt

9. Byzantine Fault Tolerance (BFT)

Crash-fault consensus assumes nodes either work or stop. BFT consensus assumes some nodes can be malicious: lie, equivocate, send conflicting messages to different peers. This requires 3f+1 nodes to tolerate f Byzantine failures — a 50% jump from the 2f+1 of crash-fault systems.

ProtocolYearNotable
PBFT (Castro & Liskov)1999The reference BFT protocol; foundation for everything since
Tendermint2014BFT for blockchains; powers Cosmos
HotStuff2018Three-phase variant; powers Diem (formerly Libra)
Honeybadger BFT2016Fully asynchronous BFT (no timing assumptions)
Algorand2017Random committee selection for scale

BFT is essential for permissionless blockchains where any node could be hostile. Inside a single organization with trusted operators, crash-fault Raft is enough — and 50% cheaper.

10. Hybrid Logical Clocks and Timestamps

Consensus orders writes within one group. To order events globally — across many Raft groups, or across regions — you need timestamps. The standard tools:

  • Lamport timestamps: simple counter incremented on send/receive. Provides causal order, not real time.
  • Vector clocks: per-node counters. Detects concurrent events but grows with nodes.
  • Hybrid Logical Clocks (HLC): physical time + logical counter. Bounded skew; works without atomic clocks. Used by CockroachDB, MongoDB.
  • TrueTime: bounded clock uncertainty from atomic clocks + GPS. Used by Spanner. Allows external consistency without coordination.

Combining timestamps with consensus is what makes distributed transactions tractable. Spanner = Paxos + 2PC + TrueTime. CockroachDB = Raft + parallel commits + HLC. Both achieve serializable transactions across thousands of shards.

11. The Practitioner's Decision Tree

Need to replicate state across nodes?
├─ No → just use a database, you don't need consensus directly
├─ Yes →
│  ├─ Within a single organization (trusted)?
│  │  ├─ Yes → Raft (etcd-io/raft, hashicorp/raft)
│  │  └─ No  → BFT (Tendermint, HotStuff)
│  └─ Cross-shard transactions?
│     ├─ No  → per-shard Raft is enough
│     └─ Yes → 2PC + per-shard Raft (CockroachDB pattern)
│              OR Spanner-style 2PC + TrueTime
└─ Just need leader election (no log)?
   └─ Use etcd or ZooKeeper as a primitive

The right answer 95% of the time is "use a Raft library; don't write consensus from scratch." It is genuinely hard to get right; even simple bugs can violate safety in subtle, partition-dependent ways.

12. Reading List

Foundational papers and textbooks for going deeper:

  • Lamport, "Paxos Made Simple" (2001) — the readable Paxos explanation. Read after the 1998 paper if you want the full proof.
  • Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (2014) — the Raft paper. Plus Ongaro's PhD thesis for the long-form version including snapshot, membership change, and TLA+ spec.
  • Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's experience implementing Chubby. The single best read on what's actually hard.
  • Liskov & Cowling, "Viewstamped Replication Revisited" (2012) — a clearer presentation of VSR than the 1988 original.
  • Howard, "Distributed Consensus Revised" (2019, PhD thesis) — Flexible Paxos and a unifying view of the family.
  • Castro & Liskov, "Practical Byzantine Fault Tolerance" (1999) — PBFT, the foundation of every BFT protocol since.
  • Kleppmann, "Designing Data-Intensive Applications" (2017) — chapters 8 and 9 are the best textbook treatment.

Tradeoffs

ChoiceProCon
Raft over PaxosFaster to implement and debugStrict log prefix limits flexibility
Larger cluster (5 → 7)Tolerates more failures (2 → 3)Higher latency, more network traffic
Synchronous replicationStrong consistencyBlocked by slowest replica
Leaderless (EPaxos)No leader bottleneckComplex; bad under contention
Quorum readsLinearizableCost RTT and serialization

FAQ

Is consensus the same as agreement?

Effectively yes. "Consensus" and "agreement" are interchangeable in distributed systems literature. Both mean: every non-faulty node decides the same value, and that value was actually proposed.

Why do we need consensus if we have linearizable reads?

Linearizable reads are a consequence of consensus, not a substitute. The system has to internally agree (via consensus) on the order of writes; reads then return values consistent with that order. No consensus → no agreed write order → no linearizability.

Can blockchain protocols replace Paxos/Raft?

No, different problems. Blockchain consensus (Nakamoto, Tendermint, HotStuff) is BFT — it tolerates malicious actors, but pays 3f+1 quorum overhead. Paxos/Raft are crash-fault only, much cheaper at 2f+1. Use BFT if and only if you don't trust the participants.

What's the difference between safety and liveness?

Safety: nothing bad happens (no two nodes ever decide different values). Liveness: something good eventually happens (a decision is reached). FLP says you can have safety always, but liveness only under partial synchrony.

How are timestamps used in consensus?

Hybrid logical clocks (HLCs) and TrueTime are not consensus but composable with it. They give consistent global ordering of events without coordination. Spanner uses TrueTime to make 2PC + Paxos provide externally consistent transactions.

Is consensus the same as atomic broadcast?

Equivalent in expressive power, different framings. Consensus = decide one value. Atomic broadcast = deliver messages in the same order to all nodes. Either can implement the other.