Rate Limiter — Token bucket, sliding window, distributed Redis-backed

Architecture

Capacity Estimation

Metric	Value	Notes
Decisions/s peak	~500 K	per-edge cluster
Decision latency target	< 1 ms	p99 budget
Distinct counter keys	~10 M	per-user × per-route
Redis ops/s	~200 K	with batching
Memory per key	~80 B	token-bucket state
Total memory	~1 GB	10 M keys

Algorithms

Token bucket — bucket holds N tokens, refills at rate R/s. Each request consumes 1 token; if empty, deny. Bursty-friendly: a quiet client accumulates tokens up to N. The default.
Leaky bucket — requests enter a queue, drain at fixed rate R/s. Smoothes traffic to a constant rate; queue full = deny. Good for protecting downstream; harsh on bursts.
Fixed window — count requests in each calendar minute. Simple but allows 2× the limit at the boundary (49 requests at 11:59:59 + 50 at 12:00:00 = 99 in 2 seconds while limit is 50/min).
Sliding window log — store every request timestamp; count those within the last 60 s. Exact, but memory-heavy (bounded by request count).
Sliding window counter — weighted blend of current and previous fixed window: count = curr_count + prev_count * (1 - elapsed_ratio). Approximates the sliding log with O(1) memory; the working compromise.

Token bucket and sliding window counter are the production picks. Token bucket when you want to allow bursts; sliding window counter when you want predictable per-window enforcement.

Distributed Implementation

One process can do per-instance limits with an in-memory counter. For cluster-wide limits, you need shared state:

Redis with atomic Lua — read tokens, refill by elapsed time, decrement, write back, all in one round-trip. The standard pattern (see redis-cell module). ~1 ms per decision in-AZ.
Redis sorted set for sliding window log — ZADD the timestamp, ZREMRANGEBYSCORE drops old entries, ZCARD counts. Memory grows with rate but exact.
Redis HINCR for fixed/sliding counters — one HINCR per request; key per (subject, window). Cheapest, lowest fidelity.
Local + sync — each gateway maintains a local approximation; periodically syncs to a global counter. Sub-millisecond decisions, eventual consistency on the limit. Used by Cloudflare for global limits.

Local Cache + Global Sync

Pure-Redis is fine to ~50 K decisions/s; past that, the round-trip dominates. The two-tier pattern:

L1 (in-process) — each gateway holds its allotment for a (subject, route). On request, check L1 first; if budget remains, accept locally.
Sync — every K requests or T ms, push consumption to Redis; refresh L1 budget from Redis (subtract what others consumed).
Drift — brief over-allocation possible during sync windows. Usually acceptable; lower the L1 cache size to tighten if not.

This is how production rate limiters at scale work: Cloudflare's global rate limiter, Stripe's, GitHub's. The exact-counts purity is sacrificed for sub-millisecond decisions on every request.

Per-user vs Per-route

Real systems stack multiple limits:

Global ceiling — protect the cluster; e.g., 1 M req/s total.
Per-route — expensive endpoints get tighter budgets; e.g., search 50 K/s, login 10 K/s.
Per-tenant / per-user — free tier 100/min, paid tier 10K/min.
Per-IP — bot defense; 1000/min from one IP.

Evaluate cheapest-to-strictest in order; reject early. Compose by AND: a request must pass every applicable limit. Track each limit separately so the response can include X-RateLimit-Remaining for the binding limit.

Soft vs Hard Limits

Hard — deny immediately on overflow with 429. Predictable, harsh.
Soft / queue — over-budget requests wait in a small queue; admit when bucket has room. Smooths traffic; requires bounded queue or you reinvent leaky bucket badly.
Burst credit — allow short bursts above the limit, debit against future budget. Friendly for human-driven UIs (a fast typist hits the autocomplete API quickly).
Tarpit — deliberately slow response (sleep before reply) for abusive clients. Burns their connection slot without obvious denial; effective against scrapers.

When to Return 429

HTTP semantics matter:

429 Too Many Requests — client should retry later. Always include Retry-After (seconds or HTTP date) and X-RateLimit-* headers (limit, remaining, reset).
503 Service Unavailable — system overload, not client's fault. Use for global ceilings the client cannot fix by waiting.
403 Forbidden — permanent; not a rate limit. Use for "your tier doesn't allow this," not "you've made too many calls."

Without Retry-After, well-meaning clients spin retry storms that cement the outage.

Failure Modes

Redis outage — without state, the limiter must fail open (no rate limit) or fail closed (deny all). Both are bad. Mitigate: in-process fallback that uses a degraded local-only limit.
Hot key — one user gets all traffic; one Redis shard saturates. Replicate hot keys; or hash subject+route across shards instead of subject only.
Clock skew — gateways disagree on the window boundary. Use Redis time as the source; or accept drift on the order of clock-skew vs window size (usually negligible).
Misconfigured key — rule says "per X-Forwarded-For" but proxy strips header; everyone shares the limit. Validate at config-time.

FAQ

Where should rate limiting live?

At the edge (CDN / API gateway) for cheap pre-filtering. Per-service for backend-specific limits. Per-database for the deepest defense. Layer all three for resilience.

What about non-uniform request weights?

Charge the bucket variable amounts: search costs 5 tokens, GET costs 1. Token bucket handles this naturally; window counters require a weight argument.

How do I prevent abuse from rotating IPs?

Combine IP, ASN, JA3 fingerprint, and login state. A single signal is gameable; the conjunction is not. Behavioral models on top (human-likeness scoring, CAPTCHA challenges) for the truly determined.

Should I share the limiter across teams?

The infrastructure (Redis), yes. The policies and keys, no — each team owns its limits. Shared policies become political battles in incidents.