Distributed Consensus

FLP, CAP, Paxos, Raft, Zab, EPaxos — How Unreliable Nodes Agree on a Single Truth

Consensus is the problem of getting a set of nodes — connected by an unreliable network, subject to crashes — to agree on a single value. It is the foundation of every distributed database, every leader-elected service, every replicated state machine. The 1985 FLP impossibility result proved that perfect consensus is impossible in a fully asynchronous system; every real protocol relaxes one of FLP's assumptions to make progress in practice. The two protocols that won — Paxos (1989) and Raft (2014) — solve the same problem, with the same complexity, and differ mainly in how easy they are to specify and verify.

The Consensus Family Tree

Every modern protocol descends from Lamport's Paxos or Liskov's VSR.

Key Numbers

1985

FLP impossibility (Fischer, Lynch, Paterson)

⌊N/2⌋+1

Crash-fault quorum size for N nodes

3f+1

Byzantine-fault quorum (PBFT, BFT-Raft)

1 RTT

Steady-state Multi-Paxos / Raft commit latency

2 RTTs

Basic Paxos: prepare + accept

1989

Lamport's original Paxos paper

~1ms

Datacenter consensus latency · ~80ms cross-region

1. The Consensus Problem

Formally, distributed consensus requires four properties:

Termination: every non-faulty node eventually decides.
Validity: the decided value was proposed by some node.
Integrity: each node decides at most once.
Agreement: no two non-faulty nodes decide differently.

FLP showed that in a purely asynchronous network with even one possible crash failure, no deterministic algorithm can satisfy all four. Every real protocol works around this — Paxos and Raft sacrifice termination during pathological scenarios (split votes), guaranteeing termination only under partial synchrony (timing eventually behaves).

2. CAP Theorem

Eric Brewer's 2000 conjecture, proved by Gilbert & Lynch in 2002: in the presence of a network Partition, you must choose between Consistency (linearizability) and Availability.

System	Choice	Behavior under partition
etcd, ZooKeeper, Spanner	CP	Minority partition rejects writes; majority continues
Cassandra (default), DynamoDB	AP	Both partitions accept writes; reconcile later
CockroachDB, TiDB	CP per range	Inactive ranges unavailable; others continue
MongoDB (replica set)	CP	Minority steps down, only primary writes

The PACELC extension: when there's no partition, you still trade Latency vs Consistency. Strong consistency requires consensus rounds, which always cost RTTs.

3. Paxos vs Raft (the only two you actually use)

	Paxos / Multi-Paxos	Raft
Stable-state RTT	1	1
Quorum size	⌊N/2⌋+1 (per phase, can be relaxed in Flexible Paxos)	⌊N/2⌋+1
Log structure	Allows gaps in slot order	Strict prefix matching, no gaps
Leader election	Implicit (highest proposal number)	Explicit term + RequestVote
Membership change	Reconfiguration via consensus	Joint consensus or single-server change
Implementation difficulty	Hard — many edge cases	Easier — explicit FSM
Production examples	Chubby, Spanner, Megastore	etcd, Consul, CockroachDB, TiKV

4. Zab (ZooKeeper Atomic Broadcast)

Zab is the protocol behind Apache ZooKeeper. Like Multi-Paxos, it has a leader and a log; like Raft, it requires strict prefix matching. Zab is older than Raft and has a similar shape, with one specialty: FIFO client order preservation within a session, even across leader changes.

# Zab phases:
1. Leader election (FastLeaderElection algorithm)
2. Discovery — new leader collects highest accepted state from quorum
3. Synchronization — leader pushes its log to followers
4. Broadcast — steady-state log replication (one RTT per write)

# Each transaction has a globally ordered "zxid" (epoch + counter).
# This zxid maps directly to Raft's (term, index) tuple.

Apache Kafka used ZooKeeper for cluster metadata until KRaft (Kafka Raft) replaced it in 2022. Most other Zab usage is legacy (Hadoop, HBase).

5. EPaxos (Leaderless / Egalitarian)

EPaxos (Moraru et al., SOSP 2013) breaks the leader bottleneck: any replica can commit non-conflicting commands in 1 RTT. Conflicting commands need an extra round to resolve dependencies.

# EPaxos commit flow:
1. Client sends command to nearest replica R
2. R sends PreAccept to all peers, including its set of "interfering" past commands
3. If a fast quorum (~3N/4) all return identical interference set:
   commit in 1 RTT (fast path)
   Else: do an Accept round to settle dependencies (slow path, 2 RTTs)

# Benefits: no leader bottleneck; reads scale linearly with replicas
# Costs: complex dependency tracking; slower under contention

EPaxos has theoretical advantages (better tail latency, no leader hotspot) but few production users — the implementation complexity is significant. CockroachDB explored it; ended up sticking with Raft.

6. Leader-Based vs Leaderless Tradeoff

	Leader-based (Raft, Multi-Paxos)	Leaderless (EPaxos)
Throughput	Bounded by leader CPU/IO	Scales with cluster size (under low contention)
Latency from far client	Always WAN to leader	Local replica → 1 RTT in fast path
Implementation	Simpler; well-understood	Complex dependency tracking
Behavior under contention	Predictable (serialized at leader)	Slow path more common; latency suffers
Production use	Dominant	Rare

7. Consensus vs 2PC vs 3PC

People conflate these. They solve different problems:

2PC (two-phase commit): atomic commit across N participants. Blocking under coordinator failure. Used for distributed transactions.
3PC (three-phase commit): 2PC + a pre-commit phase to avoid coordinator-failure blocking. Vulnerable to network partitions; rarely used.
Consensus (Paxos, Raft): replicated state machine. Tolerates failures via majority quorum. Used to replicate the same state to multiple nodes.

Modern distributed databases combine them: per-shard consensus (Raft) for replication, plus 2PC across shards for transactions that span them. Spanner's "Paxos + 2PC" is the canonical example.

8. When to Use What

Need	Use
Strongly consistent metadata store	etcd, ZooKeeper, Consul
Replicated state machine	Raft library (etcd-io/raft, hashicorp/raft)
Distributed SQL with transactions	CockroachDB, Spanner, TiDB, YugabyteDB
High-throughput KV with eventual consistency	Cassandra, DynamoDB, Riak
Leader election only	etcd lease, ZooKeeper ephemeral nodes
Atomic commit across services	2PC (Saga pattern for retries)
BFT (untrusted participants)	Tendermint, HotStuff, BFT-SMaRt

9. Byzantine Fault Tolerance (BFT)

Crash-fault consensus assumes nodes either work or stop. BFT consensus assumes some nodes can be malicious: lie, equivocate, send conflicting messages to different peers. This requires 3f+1 nodes to tolerate f Byzantine failures — a 50% jump from the 2f+1 of crash-fault systems.

Protocol	Year	Notable
PBFT (Castro & Liskov)	1999	The reference BFT protocol; foundation for everything since
Tendermint	2014	BFT for blockchains; powers Cosmos
HotStuff	2018	Three-phase variant; powers Diem (formerly Libra)
Honeybadger BFT	2016	Fully asynchronous BFT (no timing assumptions)
Algorand	2017	Random committee selection for scale

BFT is essential for permissionless blockchains where any node could be hostile. Inside a single organization with trusted operators, crash-fault Raft is enough — and 50% cheaper.

10. Hybrid Logical Clocks and Timestamps

Consensus orders writes within one group. To order events globally — across many Raft groups, or across regions — you need timestamps. The standard tools:

Lamport timestamps: simple counter incremented on send/receive. Provides causal order, not real time.
Vector clocks: per-node counters. Detects concurrent events but grows with nodes.
Hybrid Logical Clocks (HLC): physical time + logical counter. Bounded skew; works without atomic clocks. Used by CockroachDB, MongoDB.
TrueTime: bounded clock uncertainty from atomic clocks + GPS. Used by Spanner. Allows external consistency without coordination.

Combining timestamps with consensus is what makes distributed transactions tractable. Spanner = Paxos + 2PC + TrueTime. CockroachDB = Raft + parallel commits + HLC. Both achieve serializable transactions across thousands of shards.

11. The Practitioner's Decision Tree

Need to replicate state across nodes?
├─ No → just use a database, you don't need consensus directly
├─ Yes →
│  ├─ Within a single organization (trusted)?
│  │  ├─ Yes → Raft (etcd-io/raft, hashicorp/raft)
│  │  └─ No  → BFT (Tendermint, HotStuff)
│  └─ Cross-shard transactions?
│     ├─ No  → per-shard Raft is enough
│     └─ Yes → 2PC + per-shard Raft (CockroachDB pattern)
│              OR Spanner-style 2PC + TrueTime
└─ Just need leader election (no log)?
   └─ Use etcd or ZooKeeper as a primitive

The right answer 95% of the time is "use a Raft library; don't write consensus from scratch." It is genuinely hard to get right; even simple bugs can violate safety in subtle, partition-dependent ways.

12. Reading List

Foundational papers and textbooks for going deeper:

Lamport, "Paxos Made Simple" (2001) — the readable Paxos explanation. Read after the 1998 paper if you want the full proof.
Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (2014) — the Raft paper. Plus Ongaro's PhD thesis for the long-form version including snapshot, membership change, and TLA+ spec.
Chandra, Griesemer, Redstone, "Paxos Made Live" (2007) — Google's experience implementing Chubby. The single best read on what's actually hard.
Liskov & Cowling, "Viewstamped Replication Revisited" (2012) — a clearer presentation of VSR than the 1988 original.
Howard, "Distributed Consensus Revised" (2019, PhD thesis) — Flexible Paxos and a unifying view of the family.
Castro & Liskov, "Practical Byzantine Fault Tolerance" (1999) — PBFT, the foundation of every BFT protocol since.
Kleppmann, "Designing Data-Intensive Applications" (2017) — chapters 8 and 9 are the best textbook treatment.

Tradeoffs

Choice	Pro	Con
Raft over Paxos	Faster to implement and debug	Strict log prefix limits flexibility
Larger cluster (5 → 7)	Tolerates more failures (2 → 3)	Higher latency, more network traffic
Synchronous replication	Strong consistency	Blocked by slowest replica
Leaderless (EPaxos)	No leader bottleneck	Complex; bad under contention
Quorum reads	Linearizable	Cost RTT and serialization

FAQ

Is consensus the same as agreement?

Effectively yes. "Consensus" and "agreement" are interchangeable in distributed systems literature. Both mean: every non-faulty node decides the same value, and that value was actually proposed.

Why do we need consensus if we have linearizable reads?

Linearizable reads are a consequence of consensus, not a substitute. The system has to internally agree (via consensus) on the order of writes; reads then return values consistent with that order. No consensus → no agreed write order → no linearizability.

Can blockchain protocols replace Paxos/Raft?

No, different problems. Blockchain consensus (Nakamoto, Tendermint, HotStuff) is BFT — it tolerates malicious actors, but pays 3f+1 quorum overhead. Paxos/Raft are crash-fault only, much cheaper at 2f+1. Use BFT if and only if you don't trust the participants.

What's the difference between safety and liveness?

Safety: nothing bad happens (no two nodes ever decide different values). Liveness: something good eventually happens (a decision is reached). FLP says you can have safety always, but liveness only under partial synchrony.

How are timestamps used in consensus?

Hybrid logical clocks (HLCs) and TrueTime are not consensus but composable with it. They give consistent global ordering of events without coordination. Spanner uses TrueTime to make 2PC + Paxos provide externally consistent transactions.

Is consensus the same as atomic broadcast?

Equivalent in expressive power, different framings. Consensus = decide one value. Atomic broadcast = deliver messages in the same order to all nodes. Either can implement the other.