Raft Leader Election
How Distributed Nodes Agree on a Single Leader β and What Happens When It Crashes
In Raft, all writes flow through a single leader. If the leader crashes, the cluster must elect a new one before it can accept writes. Leader election is the mechanism that keeps Raft available β and the source of its most common failure modes. The key insight: randomized election timeouts prevent split votes and ensure fast convergence even under failure.
Leader Election Visualizer
Watch the election cycle: follower timeout β candidate β vote request β leader elected. Step through each phase or auto-play.
Election Timeout: The Randomization Trick
Raft avoids split votes using randomized timeouts β the key design choice that makes leader election both simple and robust.
Heartbeat Interval
The leader sends AppendEntries heartbeats at a fixed interval β typically 150ms. As long as followers receive heartbeats within their election timeout, they stay followers.
Election Timeout
Each follower picks a random timeout in the range 150β300ms (or 150β500ms in some configs). When the timeout fires with no heartbeat, the follower becomes a candidate.
import random
# Typical Raft election timeout
election_timeout = random.uniform(150, 300) # ms Why Randomization Works
If all followers had the same timeout, they'd all become candidates simultaneously β split vote. Randomization ensures that with high probability, one node's timeout fires first, giving it a head start to collect a majority before others wake up.
Recovery After Partition
When a partition heals, previously isolated nodes rejoin. The new leader's higher term causes old followers to update. Any partially-replicated log entries on former leaders are preserved (not overwritten) until the new leader confirms they're committed.
Fixed vs Randomized Timeout
| Fixed Timeout | Randomized (Raft) | |
|---|---|---|
| Split vote probability | High | Very Low |
| Convergence time | O(nΒ²) worst case | O(1) expected |
| Implementation | Simple but fragile | Simple and robust |
| Works under load | No | Yes |
RequestVote RPC: The Vote Request
When a candidate starts an election, it sends RequestVote RPCs to all peers. The receiver grants a vote only if the candidate's log is at least as up-to-date.
RequestVote Request
// Go struct from the Raft paper
type RequestVoteRequest struct {
Term int // candidate's current term
CandidateId int // who wants votes
LastLogIndex int // index of candidate's last log entry
LastLogTerm int // term of candidate's last entry
} The candidate increments its term and votes for itself before sending.
RequestVote Response
type RequestVoteResponse struct {
Term int // current term (for candidate to update itself)
VoteGranted bool // true means "I voted for you"
} If term > candidate's term, candidate immediately reverts to follower.
When does a peer grant a vote?
The log up-to-date check ensures the new leader has all committed entries β a safety invariant that prevents data loss during elections.
Failure Scenarios & Edge Cases
Elections don't always produce a winner on the first try. Raft handles these cases gracefully.
π¨ Leader Crash
All followers miss heartbeats. One becomes candidate, wins majority, becomes new leader. Writes resume immediately.
π Split Vote (No Majority)
Two candidates with overlapping timers. Both get 1 vote, neither gets majority. Term increments, new election starts.
π‘ Network Partition
Partition separates leader from followers. Minority leader can't commit, steps down when it sees higher term. New partition elects its own leader.
π Stale Term (Old Leader)
Former leader hasn't heard from new leader. When it contacts the old leader, the old leader's lower term causes it to immediately step down.
Pre-Vote: Preventing Unnecessary Disruptions
Standard Raft has one edge case: a follower with an outdated term that doesn't know a leader exists may start a disruptive election. The Pre-Vote extension fixes this.
// Pre-Vote: ask "would you vote for me?" before changing term
// If majority responds YES β proceed with real RequestVote
// This prevents term inflation when the node would lose anyway
type PreVoteRequest struct {
Term int // candidate's term (uncommitted)
CandidateId int
LastLogIndex int
LastLogTerm int
}
// A follower responds YES to PreVote if:
// - It would grant a vote in a real election
// - It has not received a heartbeat from a valid leader recently Pre-Vote is not part of the original Raft paper but is implemented in etcd, CockroachDB, and most production Raft libraries. It dramatically reduces election disruptions caused by isolated nodes.
Leader Election in Production: etcd
etcd uses Raft to maintain consistent state for Kubernetes. Every kube-apiserver read/writes through the etcd leader.
| etcd Parameter | Value | Effect |
|---|---|---|
| Heartbeat interval | 100ms | Leader-to-follower heartbeat frequency |
| Election timeout | 1000ms | Follower waits this long before starting election |
| Election timeout range | Β±500ms randomization | Prevents split votes |
| Snapshot interval | Every 5 minutes or 10K ops | Compacts Raft log |
| Learner nodes | Supported since etcd 3.4 | Non-voting followers for read scaling |
In Kubernetes, the kube-apiserver is stateless β any apiserver can become leader. etcd ensures a single leader at any time. If etcd leader fails, Kubernetes control plane is briefly unavailable until new election completes (~1-2 seconds).