Raft Consensus

Leader Election, Log Replication, Safety, Snapshots — and Why etcd, CockroachDB, and TiKV All Picked the Same Protocol

Raft (Diego Ongaro & John Ousterhout, USENIX ATC 2014) is the consensus protocol that won the implementation race. Its explicit goal was understandability: any competent engineer should be able to read the paper and write a correct implementation. Raft achieves this by decomposing consensus into three sub-problems with crisp boundaries — leader election, log replication, and safety — and adding two operational features that production systems actually need: log compaction (snapshots) and safe membership change. The result: etcd, Consul, CockroachDB, TiKV, RethinkDB, and dozens of other systems all chose Raft over Paxos because writing a correct implementation is simply faster.

The Three Sub-Problems

Raft says: solve these independently, then compose. The composition is the protocol.

Leader Election Randomized timeouts Term + RequestVote One leader per term Log Replication AppendEntries RPC Commit by majority Strict prefix matching Safety Up-to-date check Term comparison No log overwrite Log Compaction Snapshots truncate prefix InstallSnapshot RPC for slow followers Membership Change Joint consensus (C_old,new) or single-server change Three core sub-problems + two operational extensions = production-ready Each independently understandable. Composition is the safety proof.

Key Numbers

2014
"In Search of an Understandable Consensus Algorithm" published
150–300ms
Typical randomized election timeout
~50ms
Typical AppendEntries heartbeat interval
⌊N/2⌋+1
Quorum size for N voting members
1 RTT
Steady-state commit latency (leader → followers → ack)
5
Typical Raft cluster size · tolerates 2 failures
10MB
Default snapshot threshold in etcd · log compacted past this

1. Leader Election

Three states (follower, candidate, leader). Randomized timeouts. Term numbers as logical clock.

Each node tracks currentTerm, increments it whenever it becomes a candidate. Higher term wins; equal term goes to whoever asks first. Stale leaders that wake up will discover their term is old and step down.

# Follower election timeout fires (~200ms with no heartbeat)
state = CANDIDATE
currentTerm += 1
votedFor = self
votes = 1                                  # vote for self

# Send RequestVote to all peers
for peer in cluster:
    peer.send(RequestVote(
        term = currentTerm,
        candidateId = self,
        lastLogIndex = log.length,
        lastLogTerm = log.lastTerm,
    ))

# On receiving votes:
if votes > cluster_size / 2:
    state = LEADER
    send_heartbeats()                      # immediately, to suppress others

# Or if we see a higher term:
if rpc.term > currentTerm:
    currentTerm = rpc.term
    state = FOLLOWER
    votedFor = None

The randomization (each follower picks a timeout uniformly in [150, 300]ms) prevents simultaneous candidacies. With 5 nodes, the probability of split vote is <5% per round, and it recovers in another ~150ms.

See: deep-dive on leader election.

2. Log Replication via AppendEntries

The leader is the sole writer. Clients send commands to the leader, which appends to its log and replicates via AppendEntries RPCs to all followers:

// AppendEntries RPC (also serves as heartbeat with empty entries)
type AppendEntriesArgs struct {
    Term         int        // leader's term
    LeaderId     int        // for client redirection
    PrevLogIndex int        // index immediately preceding new entries
    PrevLogTerm  int        // term of PrevLogIndex
    Entries      []LogEntry // empty for heartbeat
    LeaderCommit int        // leader's commitIndex
}

// Follower behavior:
if args.Term < currentTerm:        return false
if args.Term > currentTerm:        currentTerm = args.Term; state = FOLLOWER
if log[args.PrevLogIndex].Term != args.PrevLogTerm:
    return false                   // log inconsistency, leader will retry
                                   // with PrevLogIndex - 1

// Append entries (overwriting any conflicts past PrevLogIndex)
log = log[:PrevLogIndex+1] + args.Entries
commitIndex = min(args.LeaderCommit, log.length - 1)
return true

The PrevLogIndex/PrevLogTerm consistency check is what enforces strict prefix matching. If a follower disagrees, the leader decrements PrevLogIndex and retries — eventually finding the longest matching prefix and overwriting the rest.

3. The Commitment Rule

An entry is committed when it has been replicated to a majority. Once committed, it can be applied to the state machine and the result returned to the client. Critically:

  • Only entries from the leader's current term can be committed by majority replication.
  • Entries from previous terms are committed indirectly when a current-term entry above them is committed.

Why? Without this rule, you can construct scenarios where a previously-committed entry gets overwritten — the famous "Figure 8" of the Raft paper. The fix: a leader must commit at least one entry of its own term before earlier entries are safe.

// On the leader, after receiving successful AppendEntries replies:
N = highest index such that majority of matchIndex[i] >= N
    AND log[N].Term == currentTerm           // KEY constraint

if N > commitIndex:
    commitIndex = N
    apply(log[oldCommit+1 .. N]) to state machine

4. Safety: The Up-to-Date Check

When a candidate asks for a vote, the voter only grants it if the candidate's log is at least as up-to-date as its own:

// "Up-to-date" comparison:
if candidate.lastLogTerm > voter.lastLogTerm:
    grant_vote = true
elif candidate.lastLogTerm == voter.lastLogTerm
        and candidate.lastLogIndex >= voter.lastLogIndex:
    grant_vote = true
else:
    grant_vote = false   // candidate is missing committed entries

This is the leader-completeness property: any newly-elected leader must have all entries committed in previous terms. Combined with the commitment rule, this gives the safety guarantee that once an entry is committed, it persists in all future leaders' logs.

5. Log Compaction (Snapshots)

Logs grow forever. At some point, the prefix is no longer needed because the state machine has applied it. Raft's solution: snapshot the state machine, then truncate the log up to the snapshot point.

// Periodic snapshot (when log size exceeds threshold)
func takeSnapshot():
    state := stateMachine.serialize()       // app-specific
    snapshot := Snapshot{
        LastIncludedIndex: lastApplied,
        LastIncludedTerm:  log[lastApplied].Term,
        State:             state,
    }
    persist(snapshot)
    log = log[lastApplied+1:]               // truncate prefix

// For followers far behind the leader's current log:
//   leader sends InstallSnapshot RPC instead of AppendEntries
type InstallSnapshotArgs struct {
    Term              int
    LastIncludedIndex int
    LastIncludedTerm  int
    Data              []byte    // serialized state, possibly chunked
}

etcd takes a snapshot every ~10K entries by default. Snapshots are sent in chunks for large states (Kubernetes etcd installs can have multi-GB snapshots).

6. Membership Changes

Adding or removing voting members safely is non-trivial because old and new majorities might disagree. Raft offers two solutions:

Joint consensus

Switch through a transitional configuration that requires majorities of both the old and new sets. Once committed, transition to the new-only configuration.

// Phase 1: replicate config "C_old,new" — majority requires
// agreement from both C_old majority AND C_new majority

// Phase 2: once C_old,new is committed, replicate "C_new"
// New configuration becomes effective for all subsequent operations.

Single-server change

Restrict each membership change to add or remove exactly one server at a time. Then any two majorities of consecutive configurations always overlap, so no joint phase is needed. This is what etcd uses — simpler, safer, just slower if you need to swap many nodes.

7. Production Implementations

SystemRaft libraryUse
etcdetcd-io/raft (Go)Kubernetes control plane, service discovery
Consulhashicorp/raft (Go)Service mesh, KV store
CockroachDBetcd-io/raft, forkPer-range Raft for distributed SQL
TiKV / TiDBtikv/raft-rs (Rust)Per-region Raft, MVCC backend
RethinkDBCustom C++Cluster metadata
Apache RatisJavaOzone, Atomix, etc.
Kafka KRaftCustom JavaKafka without ZooKeeper (KIP-500)

The etcd-io/raft library is the de-facto reference implementation — most other Go projects either use it directly or fork it. CockroachDB runs thousands of independent Raft groups (one per range of keys), demonstrating how well Raft scales horizontally.

8. Read Optimizations: ReadIndex and Leases

A naive linearizable read in Raft costs an AppendEntries round to confirm leadership. Two optimizations are standard in production:

// ReadIndex (used by etcd, TiKV)
// 1. Leader records current commitIndex as readIndex
// 2. Leader sends a heartbeat round to confirm it's still leader
// 3. Once heartbeat succeeds, wait until applied >= readIndex
// 4. Read from local state machine (now safe to return)

// Lease read (faster but requires bounded clock skew)
// 1. Leader holds a time-bounded "lease" granted via heartbeats
// 2. Within the lease, leader serves reads locally without confirmation
// 3. Lease is shorter than election timeout, so safe

// Lease wins on latency but requires you to trust your clocks
// to within (lease - max_clock_skew). Most production etcd setups
// use ReadIndex for safety.

Follower reads (CockroachDB-style) are also possible: the follower contacts the leader once to learn the current commitIndex, then can serve reads at any committed index. Useful for geo-distributed reads where WAN to leader is too slow.

9. Operational Concerns

  • Disk persistence is mandatory: every AppendEntries reply must come after the entry hits the WAL with fsync. SSDs with battery-backed cache are essential; spinning disks make Raft unusable.
  • Snapshot frequency tuning: too rare and follower catch-up is slow; too frequent and snapshot writes themselves cost. etcd's default of "every 10K entries" is a good starting point.
  • Cluster-wide upgrades: rolling upgrades work because Raft tolerates one node down. Mixing protocol versions requires backwards-compatible AppendEntries — most libraries handle this with explicit version negotiation.
  • Voter vs learner nodes: learners replicate the log but don't count toward quorum. Used to bring up new replicas without temporarily lowering availability — promote to voter only after caught up.
  • Pre-vote: prevents a partitioned node from term-bumping when it rejoins, which would force a needless re-election. Most production Raft libraries enable it by default.

10. Common Implementation Bugs Found by TLA+

The Raft paper came with a TLA+ specification. Production implementations have been formally verified against it; the bugs found include:

  • etcd missed a case where a partial write to the WAL could leave the log inconsistent on restart.
  • Multiple implementations forgot the rule that a leader can only commit entries from its current term — a violation produces the "Figure 8" anomaly.
  • HashiCorp's Raft library had a membership-change race where a node demoted from voter to learner could still cast a vote in flight.
  • CockroachDB's per-range Raft caught a bug where a range split could lose entries if both halves crashed simultaneously.

Lesson: even a "simple" protocol like Raft has many edge cases. TLA+ specifications are the only practical way to verify them all.

Tradeoffs

ProCon
Easy to specify and verify (TLA+ specs exist)Strict prefix matching: one slow follower stalls a slot
Single leader simplifies reasoning about readsLeader is throughput bottleneck; can't scale writes horizontally
Heartbeats double as failure detectionHigher heartbeat rate = more network traffic at scale
Membership change is safe by constructionSingle-server change is slow for cluster-wide operations
Massive ecosystem of mature librariesLots of subtle implementation bugs (TLA+ found many)

FAQ

How is Raft different from Multi-Paxos?

Mostly conceptual. Both pick a stable leader and run one RTT per decision in steady state. Raft adds: strict log prefix matching, explicit term-based leader election, and clean membership-change semantics. Paxos allows out-of-order commits and is more flexible, at the cost of being harder to reason about.

What's the largest practical Raft cluster size?

5 voters. 7 is acceptable. Beyond that, election convergence and AppendEntries fanout dominate latency. For larger groups, use sharding (CockroachDB's per-range Raft) or learner nodes (non-voting replicas that don't count toward quorum).

What are pre-vote and check-quorum?

Optional safety extensions. Pre-vote prevents a partitioned node from term-bumping back to a higher term and disrupting the cluster. Check-quorum forces a leader to step down if it can't reach a majority. Most production Raft libraries enable both by default.

How does Raft handle network partitions?

The minority partition cannot commit (no majority). The majority partition continues. When the partition heals, the minority's leader sees a higher term, steps down, and catches up via AppendEntries.

Can I read from a follower in Raft?

Yes, with caveats. Raft itself only guarantees linearizable reads from the leader (after a heartbeat round). Follower reads can be stale. Stale reads are still useful for monitoring or eventual-consistency workloads. CockroachDB and TiDB support follower reads with explicit timestamp constraints.

Is Raft Byzantine-fault-tolerant?

No. Raft assumes crash-stop failures only. A malicious node could violate safety. BFT-Raft variants exist (Tangaroa, others) but pay the same 3f+1 overhead as PBFT.