Rate Limiter
Rate limiting is how a service refuses bad traffic without falling over: protecting upstream from abuse, expensive endpoints from runaway clients, and the system as a whole from coordinated attacks. The interesting questions are which algorithm (token bucket, leaky bucket, sliding window), where to enforce (edge gateway, application, datastore), distributed coordination (one Redis vs sharded), and policy semantics (per-user, per-route, soft vs hard, when to return 429 vs 503).
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Decisions/s peak | ~500 K | per-edge cluster |
| Decision latency target | < 1 ms | p99 budget |
| Distinct counter keys | ~10 M | per-user × per-route |
| Redis ops/s | ~200 K | with batching |
| Memory per key | ~80 B | token-bucket state |
| Total memory | ~1 GB | 10 M keys |
Algorithms
- Token bucket — bucket holds N tokens, refills at rate R/s. Each request consumes 1 token; if empty, deny. Bursty-friendly: a quiet client accumulates tokens up to N. The default.
- Leaky bucket — requests enter a queue, drain at fixed rate R/s. Smoothes traffic to a constant rate; queue full = deny. Good for protecting downstream; harsh on bursts.
- Fixed window — count requests in each calendar minute. Simple but allows 2× the limit at the boundary (49 requests at 11:59:59 + 50 at 12:00:00 = 99 in 2 seconds while limit is 50/min).
- Sliding window log — store every request timestamp; count those within the last 60 s. Exact, but memory-heavy (bounded by request count).
- Sliding window counter — weighted blend of current and previous fixed window:
count = curr_count + prev_count * (1 - elapsed_ratio). Approximates the sliding log with O(1) memory; the working compromise.
Token bucket and sliding window counter are the production picks. Token bucket when you want to allow bursts; sliding window counter when you want predictable per-window enforcement.
Distributed Implementation
One process can do per-instance limits with an in-memory counter. For cluster-wide limits, you need shared state:
- Redis with atomic Lua — read tokens, refill by elapsed time, decrement, write back, all in one round-trip. The standard pattern (see
redis-cellmodule). ~1 ms per decision in-AZ. - Redis sorted set for sliding window log —
ZADDthe timestamp,ZREMRANGEBYSCOREdrops old entries,ZCARDcounts. Memory grows with rate but exact. - Redis HINCR for fixed/sliding counters — one HINCR per request; key per (subject, window). Cheapest, lowest fidelity.
- Local + sync — each gateway maintains a local approximation; periodically syncs to a global counter. Sub-millisecond decisions, eventual consistency on the limit. Used by Cloudflare for global limits.
Local Cache + Global Sync
Pure-Redis is fine to ~50 K decisions/s; past that, the round-trip dominates. The two-tier pattern:
- L1 (in-process) — each gateway holds its allotment for a (subject, route). On request, check L1 first; if budget remains, accept locally.
- Sync — every K requests or T ms, push consumption to Redis; refresh L1 budget from Redis (subtract what others consumed).
- Drift — brief over-allocation possible during sync windows. Usually acceptable; lower the L1 cache size to tighten if not.
This is how production rate limiters at scale work: Cloudflare's global rate limiter, Stripe's, GitHub's. The exact-counts purity is sacrificed for sub-millisecond decisions on every request.
Per-user vs Per-route
Real systems stack multiple limits:
- Global ceiling — protect the cluster; e.g., 1 M req/s total.
- Per-route — expensive endpoints get tighter budgets; e.g., search 50 K/s, login 10 K/s.
- Per-tenant / per-user — free tier 100/min, paid tier 10K/min.
- Per-IP — bot defense; 1000/min from one IP.
Evaluate cheapest-to-strictest in order; reject early. Compose by AND: a request must pass every applicable limit. Track each limit separately so the response can include X-RateLimit-Remaining for the binding limit.
Soft vs Hard Limits
- Hard — deny immediately on overflow with 429. Predictable, harsh.
- Soft / queue — over-budget requests wait in a small queue; admit when bucket has room. Smooths traffic; requires bounded queue or you reinvent leaky bucket badly.
- Burst credit — allow short bursts above the limit, debit against future budget. Friendly for human-driven UIs (a fast typist hits the autocomplete API quickly).
- Tarpit — deliberately slow response (sleep before reply) for abusive clients. Burns their connection slot without obvious denial; effective against scrapers.
When to Return 429
HTTP semantics matter:
- 429 Too Many Requests — client should retry later. Always include
Retry-After(seconds or HTTP date) andX-RateLimit-*headers (limit, remaining, reset). - 503 Service Unavailable — system overload, not client's fault. Use for global ceilings the client cannot fix by waiting.
- 403 Forbidden — permanent; not a rate limit. Use for "your tier doesn't allow this," not "you've made too many calls."
Without Retry-After, well-meaning clients spin retry storms that cement the outage.
Failure Modes
- Redis outage — without state, the limiter must fail open (no rate limit) or fail closed (deny all). Both are bad. Mitigate: in-process fallback that uses a degraded local-only limit.
- Hot key — one user gets all traffic; one Redis shard saturates. Replicate hot keys; or hash subject+route across shards instead of subject only.
- Clock skew — gateways disagree on the window boundary. Use Redis time as the source; or accept drift on the order of clock-skew vs window size (usually negligible).
- Misconfigured key — rule says "per X-Forwarded-For" but proxy strips header; everyone shares the limit. Validate at config-time.
FAQ
Where should rate limiting live?
At the edge (CDN / API gateway) for cheap pre-filtering. Per-service for backend-specific limits. Per-database for the deepest defense. Layer all three for resilience.
What about non-uniform request weights?
Charge the bucket variable amounts: search costs 5 tokens, GET costs 1. Token bucket handles this naturally; window counters require a weight argument.
How do I prevent abuse from rotating IPs?
Combine IP, ASN, JA3 fingerprint, and login state. A single signal is gameable; the conjunction is not. Behavioral models on top (human-likeness scoring, CAPTCHA challenges) for the truly determined.
Should I share the limiter across teams?
The infrastructure (Redis), yes. The policies and keys, no — each team owns its limits. Shared policies become political battles in incidents.