Replication
Leaders, Followers, and the In-Sync Replica Set
Every Kafka partition has one leader and zero or more followers. Producers and consumers only talk to the leader. Followers replicate by fetching from the leader using the same protocol as a consumer — there is no separate replication network.
The set of replicas that are caught up to the leader is the In-Sync Replica set (ISR).
ISR membership is the basis for durability: with acks=all, the leader waits for every
ISR member to acknowledge before responding. The trick is that ISR is dynamic — followers can fall
out and rejoin — so the meaning of "all" varies in real time.
Leader/Follower Architecture
Key Numbers
The High Watermark
The HW is the offset up to which all ISR members have replicated. Consumers can only read records below the HW.
# Each replica tracks two offsets per partition:
LEO = Log End Offset — next offset to write
HW = High Watermark — last offset that is "committed" (replicated to all ISR)
# Leader after each follower fetch:
HW = min(LEO across ISR members, including the leader)
# Example progression:
t=0 leader: LEO=100, HW=100, ISR=[1,2,3]
follower 2: LEO=100, follower 3: LEO=100
t=1 producer sends 5 records
leader: LEO=105, HW=100 (followers haven't fetched yet)
t=2 follower 2 fetches up to 105
leader: LEO=105, HW=100 (still waiting on follower 3)
follower 2: LEO=105
t=3 follower 3 fetches up to 105
leader: LEO=105, HW=105 <-- now committed
consumers can read records 100..104 ISR Membership: replica.lag.time.max.ms
A follower is removed from ISR if it has not fetched within this window OR has not caught up to the leader's LEO at any time within this window.
# Check ISR for a topic
$ kafka-topics.sh --bootstrap-server kafka-1:9092 --describe --topic orders
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: orders Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3 <-- broker 1 lagging
Topic: orders Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
# Two ways a follower exits ISR:
# 1. Has not sent a FetchRequest in >30s (broker is slow or down)
# 2. Has been continuously behind leader's LEO for >30s (broker is catching up but slowly)
# ISR change is recorded in __cluster_metadata (KRaft) or via controller (ZK)
# Leader sends ISR shrink request when condition is met min.insync.replicas: The Durability Floor
# Set per topic at creation
$ kafka-topics.sh --bootstrap-server kafka-1:9092 --create \
--topic orders \
--replication-factor 3 \
--partitions 12 \
--config min.insync.replicas=2
# Or alter existing topic
$ kafka-configs.sh --bootstrap-server kafka-1:9092 \
--entity-type topics --entity-name orders \
--alter --add-config min.insync.replicas=2
# Behavior with min.insync.replicas=2 + acks=all:
# ISR=3: writes succeed, wait for 3 acks
# ISR=2: writes succeed, wait for 2 acks
# ISR=1: writes FAIL with NotEnoughReplicasException The standard production setup is replication.factor=3, min.insync.replicas=2, acks=all. This tolerates one broker failure with full durability and surfaces a problem when the cluster degrades to one replica. The application must be prepared to handle NotEnoughReplicasException — typically by retrying or by surfacing the failure rather than silently losing durability.
Unclean Leader Election
# Disabled by default (correct choice for almost all topics)
unclean.leader.election.enable=false
# Scenario: ISR=[1,2,3], all three brokers crash
# Brokers 2 and 3 come back. Broker 1 has lost its disk.
# Broker 2's last LEO is 50000. Broker 3's last LEO is 49000.
# Broker 1 (when it had data) was at LEO=50500.
# With unclean.leader.election.enable=false:
# Partition stays offline (NoLeader). You must wait for broker 1
# or manually intervene. No data is silently lost.
# With unclean.leader.election.enable=true:
# Broker 2 is elected leader (highest available LEO).
# 500 messages (from offset 50000 to 50499) are silently lost.
# Producers and consumers continue without error. Use unclean leader election only for topics where you genuinely prefer some loss to a stalled partition (metric streams, non-critical telemetry). Never enable it cluster-wide for a financial system.
Controlled Shutdown
controlled.shutdown.enable=true # default
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backoff.ms=5000
# When the broker is asked to shut down:
# 1. Broker sends ControlledShutdownRequest to controller
# 2. Controller selects new leaders for partitions where this broker leads
# (preferring in-sync followers)
# 3. Controller notifies the broker that all leadership has migrated
# 4. Broker stops processing client requests and exits
# Without controlled shutdown:
# - Broker dies abruptly (kill -9)
# - Controller detects via session timeout (zookeeper.session.timeout.ms or KRaft equivalent, ~30s)
# - Forced leader election runs after timeout
# - Clients see >30s of partition unavailability Replica Placement
# Rack-aware placement (recommended for multi-AZ deployments)
broker.rack=us-east-1a # set per broker
# When creating a topic, Kafka spreads replicas across racks
$ kafka-topics.sh --create --topic orders --replication-factor 3 --partitions 12
Topic: orders Partition: 0 Leader: 1 (rack 1a) Replicas: 1,2,3 (1a,1b,1c)
Topic: orders Partition: 1 Leader: 4 (rack 1b) Replicas: 4,5,6 (1b,1c,1a)
# Result: losing one entire rack still leaves 2/3 replicas alive per partition
# Manual reassignment if rack-awareness drifts (broker added/removed)
$ kafka-reassign-partitions.sh --bootstrap-server kafka-1:9092 \
--reassignment-json-file reassign.json --execute Tradeoffs and When Not to Use
Standard durable config
- replication.factor=3, min.insync.replicas=2
- acks=all on producer
- unclean.leader.election.enable=false
- Rack-aware placement across 3 AZs
- Controlled shutdown enabled
Anti-patterns
- RF=2 (one failure = unavailable; cannot rebalance)
- RF=3, min.insync.replicas=1 (silent durability loss)
- Cross-region replicas (huge ack latency; use MirrorMaker instead)
- RF=3 on a 3-broker cluster (every broker hosts every partition)
Frequently Asked Questions
What does it mean for a follower to be in-sync?
A replica is in-sync if it has fetched from the leader within replica.lag.time.max.ms (default 30 seconds). The metric is time-based, not offset-based: a follower can be 1000 messages behind and still be in-sync if it last fetched 5 seconds ago. The older offset-based check (replica.lag.max.messages) was removed because a temporary spike in producer rate would unfairly evict followers. Time-based lag is robust to bursty workloads.
What happens if the ISR shrinks to just the leader?
With min.insync.replicas=1, writes continue with acks=all behaving like acks=1 — durability is silently reduced. With min.insync.replicas=2, the leader rejects writes (NotEnoughReplicasException) until at least one follower catches up. Setting min.insync.replicas=replication.factor-1 is the standard production pattern: you tolerate one failure but refuse to operate with no redundancy.
What is unclean leader election?
If all in-sync replicas die and only out-of-date followers remain, you have a choice: refuse writes (default, unclean.leader.election.enable=false) or elect a stale replica as leader, accepting some message loss. The lost messages are those that were committed (acked) by the original leader but never reached the surviving replica. Unclean election trades durability for availability — only enable it for topics where you genuinely prefer some data loss to a stalled partition.
How does controlled shutdown work?
When a broker shuts down cleanly (kafka-server-stop.sh or SIGTERM), it asks the controller to migrate any leadership it holds to other replicas before stopping. This avoids forced leader elections on shutdown — followers that were already up-to-date take over with zero message loss and minimal client disruption. Without controlled shutdown, the cluster pays a 30-second timeout per leadership before electing a new leader.
Are there 'observer' replicas like in some other systems?
Vanilla Kafka does not have observers. Confluent's Multi-Region Clusters (MRC) added observer replicas — async followers that do not count toward ISR. Observers are useful for cross-region replication where you want a copy in DC2 without forcing acks=all to wait for cross-region latency. In open-source Kafka, MirrorMaker 2 is the common alternative.