etcd

Raft-Based KV Store, MVCC bbolt Backend, Watch API — The Database Underneath Kubernetes

etcd is a strongly-consistent, distributed key-value store designed for configuration and metadata, not user data. Born at CoreOS in 2013, adopted by Kubernetes in 2015, donated to the CNCF, etcd is now the database underneath every Kubernetes cluster on earth — every Pod, Service, ConfigMap, and Secret lives there. The architecture is intentionally narrow: one Raft group, one MVCC bbolt B+tree, a watch stream, and gRPC. No sharding, no SQL, no complex queries — just a watchable, transactional key range that can survive the loss of a minority of nodes.

Architecture

A single Raft group writes to a B+tree backing store. Watches stream changes to clients.

Key Numbers

8 GB

Default DB size limit · raise to 100 GB max

10K writes/s

Single-cluster ceiling on NVMe SSD with 3 nodes

~10ms

Median write latency (consensus + fsync)

3 / 5

Typical voting cluster size

1.5 MB

Default per-key value cap (raise carefully)

100K-200K

Watch streams supported per node

~3×

Write amplification: WAL + B+tree + snapshot

1. Raft, but Concrete

etcd's Raft implementation is the reference. It powers itself, plus CockroachDB, TiKV, and many others.

Each etcd write goes through the Raft pipeline:

# Put("foo", "bar") on the leader:
1. Append to in-memory Raft log
2. Persist log entry to WAL on disk (fsync)
3. Send AppendEntries to all followers
4. Wait for majority to ack (ack means: persisted to follower's WAL)
5. Update commitIndex
6. Apply to MVCC store (bbolt write transaction)
7. Reply to client

# Read modes:
# - Linearizable (default): leader confirms it's still leader via heartbeat round
# - Serializable: read directly from local node, may be stale

The fsync on the WAL is the single biggest latency contributor. SSD with battery-backed write cache (or NVMe) is essential — spinning disks make etcd unusable.

Deeper dive: Raft and leader election.

2. MVCC and bbolt

etcd's MVCC store keeps every revision of every key, indexed by a monotonic revision integer. The backing store is bbolt (a fork of Boltdb) — a single-file B+tree with copy-on-write semantics.

# Internal layout (simplified):
# bbolt has multiple "buckets" (B+tree subtrees per name)
#
# bucket "key":   key → revisions sorted index
# bucket "meta":  current revision, compacted revision, lease ids
# bucket "lease": lease id → lease metadata
# bucket "auth":  users, roles

# Each KV record:
#   key:   (revision, sub_revision)        # globally ordered
#   value: KeyValue{key, value, lease, mod_revision, create_revision}

# Reading "foo" at revision N:
#   walk the revisions index for "foo"
#   find the largest revision <= N
#   return that KeyValue

The "kept history" is the foundation of the watch API. Compaction (manual or automatic, default every 5 minutes) drops revisions older than a watermark — but client watches that started before the watermark may fail with ErrCompacted.

3. The Watch API

Clients can subscribe to a key range and receive every change in order. This is the core feature that makes etcd a control plane database — Kubernetes apiserver uses it to stream resource changes to controllers.

// gRPC Watch API
service Watch {
  rpc Watch(stream WatchRequest) returns (stream WatchResponse);
}

// WatchRequest:
message WatchCreateRequest {
  bytes key = 1;
  bytes range_end = 2;          // for prefix watch
  int64 start_revision = 3;     // resume from here
  bool prev_kv = 6;             // include previous value
}

// WatchResponse:
message Event {
  enum EventType { PUT = 0; DELETE = 1; }
  EventType type = 1;
  KeyValue kv = 2;              // current state
  KeyValue prev_kv = 3;         // before the change
}

Watches are implemented as in-memory subscribers fed from the MVCC store's commit notifier. A watch on a single key costs ~few KB; thousands of watches per node is normal. Watches survive the disconnect of one server (clients reconnect to another and resume from start_revision).

4. Leases

A lease is a TTL-bound time-to-live attached to one or more keys. When the lease expires (or the client stops keepalive), all attached keys are deleted atomically.

# Common pattern: leader election
lease, _ := client.Grant(ctx, 10)              // 10-second TTL
client.Put(ctx, "/leader", "node-A", clientv3.WithLease(lease.ID))

# Heartbeat in background:
go func() {
    for {
        client.KeepAliveOnce(ctx, lease.ID)
        time.Sleep(3 * time.Second)
    }
}()

# If node-A crashes, lease expires after 10s, /leader is auto-deleted.
# Other candidates can detect this via watch and try to acquire.

This is the primitive used by every Kubernetes leader-election library, and by Consul/etcd-based locking. The cost: every keepalive is a write through Raft, so very tight TTLs (sub-second) are expensive.

5. Transactions (Compare-and-Swap)

etcd's transaction API is intentionally minimal: a list of comparisons + then-actions + else-actions. No nested transactions, no joins, no SQL.

// Atomic compare-and-swap
txn := client.Txn(ctx)
resp, err := txn.
    If(clientv3.Compare(clientv3.Version("/lock"), "=", 0)).
    Then(clientv3.OpPut("/lock", "owner", clientv3.WithLease(lease.ID))).
    Else(clientv3.OpGet("/lock")).
    Commit()

if resp.Succeeded {
    // Got the lock
} else {
    // Read the current owner from the Else result
}

This is enough to implement distributed locking, leader election, semaphores, and barriers. Concurrency is optimistic — clients re-read on contention rather than blocking. The transaction is one Raft round, even with many comparisons and actions.

6. Performance Characteristics

Workload	NVMe 3-node	Network 1ms	Network 10ms
Sequential 256B Put	~10K/sec	~5K/sec	~500/sec
Single-key Get (linearizable)	~30K/sec	~15K/sec	~3K/sec
Single-key Get (serializable)	~80K/sec	~80K/sec	~80K/sec
Range Get (1000 keys)	~500/sec	~300/sec	~50/sec
Watch creation	~5K/sec	~5K/sec	~1K/sec

The dominant factor is fsync latency on the leader, which is bounded by storage device. NVMe gives ~100µs per fsync; SATA SSD ~1-2ms; spinning disk >5ms (untenable).

Write amplification is roughly 3×: WAL append + bbolt B+tree update + periodic snapshot. Thus a 1 MB sustained write rate in user data → ~3 MB on disk.

7. Comparison with Consul and ZooKeeper

	etcd	Consul	ZooKeeper
Consensus	Raft	Raft	Zab
API	gRPC	HTTP + DNS	Custom binary protocol
Storage backend	bbolt B+tree	BoltDB	Custom (in-memory + WAL)
Data model	KV with revisions	KV + service catalog	Hierarchical znodes
Watches	Streaming watches	Long-poll blocking queries	One-shot triggers
Leases	First-class	Sessions	Ephemeral nodes
Heaviest user	Kubernetes	HashiCorp stack, service mesh	Hadoop, Kafka (legacy)

etcd won the Kubernetes ecosystem because of three concrete advantages: streaming watches (vs ZooKeeper's one-shot), gRPC-native API, and tighter integration with Go. Consul has more service-discovery features but lighter consistency requirements; ZooKeeper is older and slower-evolving.

8. Compaction and Defragmentation

etcd's MVCC store keeps every revision until compacted. There are two distinct operations that get confused:

Compaction (logical): drop old revisions from the MVCC index. After compaction, you can no longer query at those revisions or watch from before them. Frees logical space; doesn't shrink the file.
Defragmentation (physical): reclaim free pages in the bbolt B+tree by rewriting the file. Required to actually shrink disk usage after compaction. Happens online but adds load.

# Compaction: keep only revisions > current - 100000
etcdctl compact <rev>

# Defrag: rewrite the bbolt file (per node, not cluster-wide)
etcdctl defrag --endpoints=<node1>
etcdctl defrag --endpoints=<node2>
etcdctl defrag --endpoints=<node3>

# Common pattern: periodic auto-compact + nightly defrag

Without compaction, a Kubernetes cluster's etcd grows unbounded. Without defragmentation, even a compacted etcd can stay close to its quota and trigger alarms. Both are required for stable operation.

9. Why Kubernetes Picked etcd

The Kubernetes design notes (mid-2014) considered ZooKeeper, Consul, and etcd for the central state store. etcd won because:

Streaming watches: ZooKeeper's one-shot triggers required round-trip after every change; etcd's gRPC stream pushes events with ordering guarantees.
HTTP / gRPC API: ZooKeeper used a custom binary protocol that needed a per-language client. etcd's gRPC just works in Go, Python, Java, Rust.
MVCC and revision-based reads: lets controllers do "list at revision N, then watch from N+1" — the cornerstone of the apiserver's reflector cache.
Go ecosystem: kubelet, kube-controller-manager, and etcd all in Go made integration easier and dependency closure simpler.

The cost: every Kubernetes scaling improvement eventually hits etcd's single-Raft-group ceiling. Sharding etcd has been discussed at SIG Scalability but never landed — the consensus is that "split apiserver into multiple stores" is the better path forward.

10. Failure Modes in the Wild

Disk slow → leader stalls → election storm: a slow disk on the leader causes fsync to block; followers time out and start elections. New leader has the same disk issue. Fix: alarm on disk latency, not just availability.
Quota exceeded → cluster is read-only: writes start failing with "no space" errors until you compact and defrag. Recovery requires raising the quota, then doing the cleanup, then lowering back.
Watch starvation: thousands of watches with slow consumers can pile up unsent events. The server eventually evicts them. Fix: each watcher must drain its stream promptly.
Network partition causing minority leader: rare, but the minority side may briefly accept reads (stale) until it learns it's a minority via heartbeat failure. Use ReadIndex to avoid stale reads if linearizability matters.
Snapshot transfer bottleneck: a follower far behind needs an InstallSnapshot, which can be many GBs in Kubernetes. While transferring, the follower is unavailable.

11. Operational Best Practices

NVMe storage on every voting node. Spinning disks or network-attached storage cause unpredictable fsync latency and election storms.
Dedicated network between cluster members where possible. Cross-AZ etcd is fine; cross-region is risky for write latency.
Auto-compact every 1-5 hours via --auto-compaction-retention. Pair with weekly defragmentation.
Backup the data directory at least daily. etcdctl snapshot save produces a consistent point-in-time snapshot from any node.
Monitor: etcd_disk_wal_fsync_duration_seconds (should be sub-millisecond p99), etcd_server_leader_changes_seen_total (should be zero in steady state), etcd_mvcc_db_total_size_in_bytes (track for capacity planning).
Separate API and peer ports: client requests on 2379, peer-to-peer Raft traffic on 2380. Firewall accordingly.

The etcd team publishes a sizing guide that maps cluster size to recommended hardware. Most Kubernetes-attached etcd clusters need only modest hardware (4 cores, 8 GB RAM, 100 GB NVMe per node) until the cluster gets very large.

Tradeoffs

Pro	Con
Strong consistency by default	Single Raft group caps throughput
Mature, used in every Kubernetes cluster	Tight DB size limits (~8 GB default)
Watches are the killer feature	Watch resumption can fail after compaction
Simple data model — easy to reason about	No SQL or secondary indexes
Predictable latency under stable cluster	fsync-bound: storage choice dominates

FAQ

How big can an etcd cluster get?

By data: 100 GB max with the right settings; 8 GB is the default. By node count: 5 voting members is the sweet spot; 7 is acceptable; beyond that, election convergence and AppendEntries fanout become problematic.

Why does Kubernetes hammer etcd so hard?

Every Pod, Service, Endpoint, ConfigMap, and Secret is a row. The apiserver streams watches to dozens of controllers, each running list-watch loops. A single 5K-pod cluster can produce 10K+ etcd writes/sec under churn.

What's the deal with the 8 GB limit?

It's a soft default to protect against runaway compaction lag. You can raise it via --quota-backend-bytes, but at some point Raft snapshots take minutes and follower catch-up after restart suffers.

Can I use etcd as a general-purpose KV store?

Don't. It's optimized for control-plane workloads (low write rate, high read concurrency, mandatory consistency). For high-throughput KV, use DynamoDB, Cassandra, or RocksDB-based stores.

What happens when the leader's disk fills up?

Writes start failing with "no space" errors; the cluster goes read-only. Manual compaction + defrag can recover. Production setups alarm on bbolt size growth and rotate snapshots more aggressively.

Why three nodes, not one?

One node has no fault tolerance. Three tolerates one failure. Five tolerates two. The math: f failures need 2f+1 nodes. Even-numbered clusters have the same tolerance as the odd one below them, so 4 ≡ 3 in usefulness — odd is canonical.