Distributed Lock — Redlock, ZooKeeper, etcd, fencing tokens

Architecture

Capacity Estimation

Metric	Value	Notes
Lock acquisitions/s	10 K–100 K	Redis SETNX ceiling
etcd writes/s	~10 K	3-node Raft
ZooKeeper writes/s	~10 K	same scale
Lease duration	10–30 s	3× expected work
Heartbeat interval	1/3 lease	renew cushion
Acquire latency (Redis)	~1 ms	same-AZ
Acquire latency (etcd)	~5–10 ms	quorum write

Redlock and Kleppmann's Critique

The Redlock algorithm (Antirez): acquire the lock on a majority of N independent Redis masters by SETNX with TTL; if you got > N/2 within a clock-budget, you hold the lock. Release by deleting on all.

Martin Kleppmann argued this is unsafe: Redlock relies on bounded clock drift and bounded GC pauses for correctness. A long GC pause on the holder lets the lease expire on the servers; meanwhile the holder is still about to do its critical-section write — and a second client has now acquired. Without fencing, both writers commit; data is corrupted.

Antirez's response: bounded clock drift is realistic in monitored deployments. Kleppmann's rebuttal: any algorithm requiring assumptions outside the consistency model is asking for production bugs.

Practical takeaway: Redlock works in practice if (a) you fence (see below), (b) your critical section is short relative to lease duration, and (c) your hosts are time-synced. Without fencing, you are taking a bet that the unsafe case will not bite you.

ZooKeeper Ephemeral Nodes

ZooKeeper's primitive: clients create an ephemeral sequential node under /locks/myresource/. The node with the smallest sequence number holds the lock. Watchers fire on the predecessor node's deletion, so each client sleeps until its turn.

Ephemeral means the node disappears when the client's session expires. Sessions are kept alive by heartbeats (default 4 s, max 40 s). On crash, the session times out and the node vanishes — the next waiter is promoted.

This is correct because ZooKeeper's consensus (Zab, similar to Raft) gives a total order on session events. The lock holder sees a monotonic session ID + sequence, which is its fencing token in disguise.

etcd Leases

etcd's lease primitive: client creates a lease (TTL N seconds), then attaches keys to the lease. The lease is renewed via gRPC heartbeat. On lease expiry, all attached keys are atomically deleted.

Lock = put key with lease + revision-number guard. The guard is a transaction: If create_revision == 0, Then Put with lease, Else fail. The successful put returns a revision — etcd's monotonic counter, perfect as a fencing token.

etcd is the modern choice when you have it (it's already deployed for Kubernetes). Less ceremony than ZooKeeper, similar correctness story.

Fencing Tokens

The single most important practice for safe distributed locks. On every lock acquisition, the lock service issues a monotonically increasing token. The client passes this token with every write to the protected resource. The resource rejects writes with stale tokens.

Example: Alice holds lock with token 42, gets GC-paused. Bob takes over with token 43, writes to the database. Alice resumes, attempts to write with token 42; the database sees 42 < 43 (the latest committed token) and rejects.

Without fencing, the safest lock service in the world cannot prevent corruption from a stale holder. The lock service is not the lock; the resource's rejection of stale tokens is the lock.

Expiration vs Heartbeat

Two failure-detection strategies:

Lease expiration — lock has a TTL; client must renew before TTL. If client crashes, TTL expires, lock released. Pros: simple, no false negatives from network blips (the lease covers them). Cons: a hung client holds the lock until TTL, blocking everyone.
Heartbeat with session — client maintains a session via continuous heartbeats; missed heartbeats invalidate the session, releasing locks attached to it. Pros: faster recovery on crash. Cons: network blips trigger false expiry (mitigated with longer thresholds).

Production systems combine: short heartbeat (2–5 s) detects fast crashes; longer lease (30 s) buffers network glitches. Always pair with fencing.

Clock Drift

Algorithms relying on synchronized clocks (Redlock, naive TTL locks) break when:

Hosts have unsynced clocks — one server believes the lock has 5 s left; the other believes it expired 2 s ago.
VM pauses — a VM live-migrated for hardware maintenance pauses for 10 s, which the in-process lease counter does not see; the lease was already long expired when the process resumes.
NTP step — NTP adjusts the clock backward to sync with a peer; the lease "un-expires" or worse.

Mitigations: use monotonic clocks (CLOCK_MONOTONIC) for all lease comparisons. Stay below a clock-skew threshold via NTP / chrony with monitoring. Better: use a lock service whose correctness does not depend on local clocks (etcd / ZK / Spanner). Best: fence.

Single-purpose vs General-purpose Locks

Most lock use cases are singleflight-shaped: prevent two workers from doing the same idempotent work. For these, a general-purpose distributed lock is overkill. Alternatives:

Database row lock — SELECT FOR UPDATE on a job row. Lock is bound to the work; on crash, the DB connection drops and the lock releases.
Idempotency keys — the work itself is idempotent; concurrent runs converge to the same result. No lock needed.
Compare-and-swap on a status field — UPDATE jobs SET state='running' WHERE id=? AND state='pending'. The first update wins; others see 0 rows affected.
Kafka consumer-group rebalance — Kafka itself ensures exactly-one consumer per partition.

Reach for a general-purpose lock service only when you have cross-system coordination (a lock that spans the DB and an external API), per-resource semaphores, or leader election orthogonal to the work being done.

Failure Modes

GC pause exceeds lease — client pauses 12 s on a 10 s lease; lease expires; another client takes over; original resumes and writes. Fence or die.
Network partition — client thinks it holds, lock service thinks not. The transport layer hides the partition until a write fails. Token check at the resource is the saving grace.
Thundering herd on lease expiry — lease expires; 100 waiters race to acquire. Use queue-based wait (ZK ephemeral sequential) instead of polling.
Lock leaks — client crashes without releasing, but the TTL is too long and resource is blocked. Tune TTL to p99 work duration; alert on long-held locks.

FAQ

Should I use Redlock?

Only with fencing and only when occasional false acquisition is tolerable. If your critical section is short and idempotent, Redis SETNX-with-TTL on a single Redis is simpler than Redlock and equivalent in practice. For correctness-critical locks, prefer etcd/ZooKeeper.

What if my resource cannot check tokens?

Then your "lock" is best-effort. Make the work idempotent so dual execution is harmless, or accept the risk of occasional double-execution and design downstream cleanup.

etcd vs ZooKeeper?

etcd if you're in Kubernetes-land already; gRPC API and lease primitives are clean. ZooKeeper if you have legacy code or want the well-trodden ephemeral-sequential pattern.

How long should a lease be?

3× the expected work duration, with renewal at 1/3 of the lease. Too short = false expiry; too long = slow recovery from crash.