Raft Leader Election

How Distributed Nodes Agree on a Single Leader — and What Happens When It Crashes

In Raft, all writes flow through a single leader. If the leader crashes, the cluster must elect a new one before it can accept writes. Leader election is the mechanism that keeps Raft available — and the source of its most common failure modes. The key insight: randomized election timeouts prevent split votes and ensure fast convergence even under failure.

Leader Election Visualizer

Watch the election cycle: follower timeout → candidate → vote request → leader elected. Step through each phase or auto-play.

Stable state: Leader sends periodic heartbeats to followers.

RPC Log

Node States

ALeaderT2—

BFollowerT2—

CFollowerT2—

Election Timeout: The Randomization Trick

Raft avoids split votes using randomized timeouts — the key design choice that makes leader election both simple and robust.

Heartbeat Interval

The leader sends AppendEntries heartbeats at a fixed interval — typically 150ms. As long as followers receive heartbeats within their election timeout, they stay followers.

Election Timeout

Each follower picks a random timeout in the range 150–300ms (or 150–500ms in some configs). When the timeout fires with no heartbeat, the follower becomes a candidate.

import random
# Typical Raft election timeout
election_timeout = random.uniform(150, 300)  # ms

Why Randomization Works

If all followers had the same timeout, they'd all become candidates simultaneously → split vote. Randomization ensures that with high probability, one node's timeout fires first, giving it a head start to collect a majority before others wake up.

Recovery After Partition

When a partition heals, previously isolated nodes rejoin. The new leader's higher term causes old followers to update. Any partially-replicated log entries on former leaders are preserved (not overwritten) until the new leader confirms they're committed.

Fixed vs Randomized Timeout

	Fixed Timeout	Randomized (Raft)
Split vote probability	High	Very Low
Convergence time	O(n²) worst case	O(1) expected
Implementation	Simple but fragile	Simple and robust
Works under load	No	Yes

RequestVote RPC: The Vote Request

When a candidate starts an election, it sends RequestVote RPCs to all peers. The receiver grants a vote only if the candidate's log is at least as up-to-date.

RequestVote Request

// Go struct from the Raft paper
type RequestVoteRequest struct {
    Term         int  // candidate's current term
    CandidateId  int  // who wants votes
    LastLogIndex int  // index of candidate's last log entry
    LastLogTerm  int  // term of candidate's last entry
}

The candidate increments its term and votes for itself before sending.

→

RequestVote Response

type RequestVoteResponse struct {
    Term        int  // current term (for candidate to update itself)
    VoteGranted bool // true means "I voted for you"
}

If term > candidate's term, candidate immediately reverts to follower.

When does a peer grant a vote?

✓

Term ≥ current term (candidate is not stale)

✓

Haven't voted for anyone this term yet

✓

Candidate's last log entry term ≥ our last entry term

✓

Candidate's last log entry index ≥ our last entry index

The log up-to-date check ensures the new leader has all committed entries — a safety invariant that prevents data loss during elections.

Failure Scenarios & Edge Cases

Elections don't always produce a winner on the first try. Raft handles these cases gracefully.

🔨 Leader Crash

All followers miss heartbeats. One becomes candidate, wins majority, becomes new leader. Writes resume immediately.

t+0ms Leader crashes t+200ms B's timer fires first, becomes candidate t+250ms B collects 2/3 votes (A, B, C) t+260ms B becomes leader, sends first heartbeat

🔀 Split Vote (No Majority)

Two candidates with overlapping timers. Both get 1 vote, neither gets majority. Term increments, new election starts.

t+200ms A and B become candidates simultaneously t+220ms A→C: vote granted; B→C: vote denied (C already voted) t+250ms A has 2 votes, B has 1 — A wins t+270ms B reverts to follower, resets timer

📡 Network Partition

Partition separates leader from followers. Minority leader can't commit, steps down when it sees higher term. New partition elects its own leader.

Partition A (leader T2) isolated from B,C t+250ms B becomes candidate, wins partition election Rejoin A receives AppendEntries(T3) → reverts to follower

🔃 Stale Term (Old Leader)

Former leader hasn't heard from new leader. When it contacts the old leader, the old leader's lower term causes it to immediately step down.

New leader B elected with term 5 Old leader A tries AppendEntries(term=3) A receives response with term=5 → reverts to follower

Pre-Vote: Preventing Unnecessary Disruptions

Standard Raft has one edge case: a follower with an outdated term that doesn't know a leader exists may start a disruptive election. The Pre-Vote extension fixes this.

// Pre-Vote: ask "would you vote for me?" before changing term
// If majority responds YES → proceed with real RequestVote
// This prevents term inflation when the node would lose anyway

type PreVoteRequest struct {
    Term          int  // candidate's term (uncommitted)
    CandidateId   int
    LastLogIndex  int
    LastLogTerm   int
}

// A follower responds YES to PreVote if:
// - It would grant a vote in a real election
// - It has not received a heartbeat from a valid leader recently

Candidate → All peers Send PreVote

→

If majority says YES Increment term, send RequestVote

→

If majority grants vote Become Leader

Pre-Vote is not part of the original Raft paper but is implemented in etcd, CockroachDB, and most production Raft libraries. It dramatically reduces election disruptions caused by isolated nodes.

Leader Election in Production: etcd

etcd uses Raft to maintain consistent state for Kubernetes. Every kube-apiserver read/writes through the etcd leader.

etcd-0

LEADER

WAL writer · MVCC store

→

etcd-1

FOLLOWER

Applies committed entries

→

etcd-2

FOLLOWER

Applies committed entries

etcd Parameter	Value	Effect
Heartbeat interval	100ms	Leader-to-follower heartbeat frequency
Election timeout	1000ms	Follower waits this long before starting election
Election timeout range	±500ms randomization	Prevents split votes
Snapshot interval	Every 5 minutes or 10K ops	Compacts Raft log
Learner nodes	Supported since etcd 3.4	Non-voting followers for read scaling

In Kubernetes, the kube-apiserver is stateless — any apiserver can become leader. etcd ensures a single leader at any time. If etcd leader fails, Kubernetes control plane is briefly unavailable until new election completes (~1-2 seconds).

Leader Election Visualizer

Election Timeout: The Randomization Trick

Heartbeat Interval

Election Timeout

Why Randomization Works

Recovery After Partition

Fixed vs Randomized Timeout

RequestVote RPC: The Vote Request

RequestVote Request

RequestVote Response

When does a peer grant a vote?

Failure Scenarios & Edge Cases

🔨 Leader Crash

🔀 Split Vote (No Majority)

📡 Network Partition

🔃 Stale Term (Old Leader)

Pre-Vote: Preventing Unnecessary Disruptions

Leader Election in Production: etcd

🔗 Related Topics