Raft Leader Election

How Distributed Nodes Agree on a Single Leader β€” and What Happens When It Crashes

In Raft, all writes flow through a single leader. If the leader crashes, the cluster must elect a new one before it can accept writes. Leader election is the mechanism that keeps Raft available β€” and the source of its most common failure modes. The key insight: randomized election timeouts prevent split votes and ensure fast convergence even under failure.

Leader Election Visualizer

Watch the election cycle: follower timeout β†’ candidate β†’ vote request β†’ leader elected. Step through each phase or auto-play.

Stable state: Leader sends periodic heartbeats to followers.
RPC Log
Node States
NodeStateTermVotes
ALeaderT2β€”
BFollowerT2β€”
CFollowerT2β€”

Election Timeout: The Randomization Trick

Raft avoids split votes using randomized timeouts β€” the key design choice that makes leader election both simple and robust.

1

Heartbeat Interval

The leader sends AppendEntries heartbeats at a fixed interval β€” typically 150ms. As long as followers receive heartbeats within their election timeout, they stay followers.

2

Election Timeout

Each follower picks a random timeout in the range 150–300ms (or 150–500ms in some configs). When the timeout fires with no heartbeat, the follower becomes a candidate.

import random
# Typical Raft election timeout
election_timeout = random.uniform(150, 300)  # ms
3

Why Randomization Works

If all followers had the same timeout, they'd all become candidates simultaneously β†’ split vote. Randomization ensures that with high probability, one node's timeout fires first, giving it a head start to collect a majority before others wake up.

4

Recovery After Partition

When a partition heals, previously isolated nodes rejoin. The new leader's higher term causes old followers to update. Any partially-replicated log entries on former leaders are preserved (not overwritten) until the new leader confirms they're committed.

Fixed vs Randomized Timeout

Fixed TimeoutRandomized (Raft)
Split vote probabilityHighVery Low
Convergence timeO(nΒ²) worst caseO(1) expected
ImplementationSimple but fragileSimple and robust
Works under loadNoYes

RequestVote RPC: The Vote Request

When a candidate starts an election, it sends RequestVote RPCs to all peers. The receiver grants a vote only if the candidate's log is at least as up-to-date.

RequestVote Request

// Go struct from the Raft paper
type RequestVoteRequest struct {
    Term         int  // candidate's current term
    CandidateId  int  // who wants votes
    LastLogIndex int  // index of candidate's last log entry
    LastLogTerm  int  // term of candidate's last entry
}

The candidate increments its term and votes for itself before sending.

β†’

RequestVote Response

type RequestVoteResponse struct {
    Term        int  // current term (for candidate to update itself)
    VoteGranted bool // true means "I voted for you"
}

If term > candidate's term, candidate immediately reverts to follower.

When does a peer grant a vote?

βœ“
Term β‰₯ current term (candidate is not stale)
βœ“
Haven't voted for anyone this term yet
βœ“
Candidate's last log entry term β‰₯ our last entry term
βœ“
Candidate's last log entry index β‰₯ our last entry index

The log up-to-date check ensures the new leader has all committed entries β€” a safety invariant that prevents data loss during elections.

Failure Scenarios & Edge Cases

Elections don't always produce a winner on the first try. Raft handles these cases gracefully.

πŸ”¨ Leader Crash

All followers miss heartbeats. One becomes candidate, wins majority, becomes new leader. Writes resume immediately.

t+0ms Leader crashes t+200ms B's timer fires first, becomes candidate t+250ms B collects 2/3 votes (A, B, C) t+260ms B becomes leader, sends first heartbeat

πŸ”€ Split Vote (No Majority)

Two candidates with overlapping timers. Both get 1 vote, neither gets majority. Term increments, new election starts.

t+200ms A and B become candidates simultaneously t+220ms A→C: vote granted; B→C: vote denied (C already voted) t+250ms A has 2 votes, B has 1 — A wins t+270ms B reverts to follower, resets timer

πŸ“‘ Network Partition

Partition separates leader from followers. Minority leader can't commit, steps down when it sees higher term. New partition elects its own leader.

Partition A (leader T2) isolated from B,C t+250ms B becomes candidate, wins partition election Rejoin A receives AppendEntries(T3) β†’ reverts to follower

πŸ”ƒ Stale Term (Old Leader)

Former leader hasn't heard from new leader. When it contacts the old leader, the old leader's lower term causes it to immediately step down.

New leader B elected with term 5 Old leader A tries AppendEntries(term=3) A receives response with term=5 β†’ reverts to follower

Pre-Vote: Preventing Unnecessary Disruptions

Standard Raft has one edge case: a follower with an outdated term that doesn't know a leader exists may start a disruptive election. The Pre-Vote extension fixes this.

// Pre-Vote: ask "would you vote for me?" before changing term
// If majority responds YES β†’ proceed with real RequestVote
// This prevents term inflation when the node would lose anyway

type PreVoteRequest struct {
    Term          int  // candidate's term (uncommitted)
    CandidateId   int
    LastLogIndex  int
    LastLogTerm   int
}

// A follower responds YES to PreVote if:
// - It would grant a vote in a real election
// - It has not received a heartbeat from a valid leader recently
Candidate β†’ All peers Send PreVote
β†’
If majority says YES Increment term, send RequestVote
β†’
If majority grants vote Become Leader

Pre-Vote is not part of the original Raft paper but is implemented in etcd, CockroachDB, and most production Raft libraries. It dramatically reduces election disruptions caused by isolated nodes.

Leader Election in Production: etcd

etcd uses Raft to maintain consistent state for Kubernetes. Every kube-apiserver read/writes through the etcd leader.

etcd-0
LEADER
WAL writer Β· MVCC store
β†’
etcd-1
FOLLOWER
Applies committed entries
β†’
etcd-2
FOLLOWER
Applies committed entries
etcd ParameterValueEffect
Heartbeat interval100msLeader-to-follower heartbeat frequency
Election timeout1000msFollower waits this long before starting election
Election timeout rangeΒ±500ms randomizationPrevents split votes
Snapshot intervalEvery 5 minutes or 10K opsCompacts Raft log
Learner nodesSupported since etcd 3.4Non-voting followers for read scaling

In Kubernetes, the kube-apiserver is stateless β€” any apiserver can become leader. etcd ensures a single leader at any time. If etcd leader fails, Kubernetes control plane is briefly unavailable until new election completes (~1-2 seconds).