URL Shortener

A URL shortener trades a long URL for a short alias and a redirect. The system is read-heavy (many redirects per write), latency-sensitive (the user is waiting for the redirect to resolve), and deceptively simple. The interesting decisions are how to mint short keys (base62, hash, sequential ID), how to scale the redirect path (Redis cache, KV store), what to track per click (HyperLogLog for unique counts), and how to keep abuse off your domain (everyone tries to shorten phishing links through you).

Architecture

Capacity Estimation

Metric	Value	Notes
New URLs/day	~10 M	Bitly-scale
Redirects/day	~10 B	1000:1 read/write
Peak read QPS	~250 K	10× daily peak
Storage / 5 yr	~3 TB	500 B/row × 15 B URLs
Cache (20% hot)	~600 GB	memory across cluster
Key length	7 chars	62⁷ ≈ 3.5 T URLs
Redirect p99	< 20 ms	edge cache + DB

Key Generation: Base62 vs Hash vs Sequential

Base62 of an auto-increment ID — allocate ID from a counter (or pre-allocated batches via ZooKeeper / Redis), encode in [0-9a-zA-Z]. Pros: collision-free, compact, predictable length. Cons: enumerable (an attacker scrapes consecutive IDs); reveals creation order.
Hash of URL — SHA-256 truncated to 8 chars in base62. Pros: idempotent (same URL = same key, dedup for free). Cons: collisions at ~4 billion, requires collision handling.
Random key — generate random 7-char base62 string; SETNX in DB; retry on collision. Pros: not enumerable, no central counter. Cons: at high write rate, collision probability grows; must handle collision retry.
Snowflake-style — (timestamp || machine || sequence), encoded base62. Combines random-ish with monotonic; works for distributed allocation without coordination.

Bitly uses base62 of a sequential ID with allocation via a coordination service. Sentinel hash dedup is added on the application side: same target URL by same user returns the same short. Different users hashing the same URL get different shorts (so each user sees their own analytics).

Storage and Cache Layer

Working set is the small fraction of URLs that are actively trafficked. For redirects:

Redis as L1: keyed by short, value = long_url. ~200 GB cluster covers the hot 20% of all-time URLs. ~1 ms reads.
DynamoDB / Cassandra as the durable store: partition key = short, attributes long_url, created_at, owner, expires_at. KV access pattern is the dominant query.
CDN edge for the truly hot ones (a viral marketing campaign): cache the 301 response at Cloudflare; redirect resolves at the PoP without hitting your origin.

301 vs 302 Redirects and Analytics

The redirect HTTP code matters for analytics:

301 Moved Permanently — browsers cache the redirect; subsequent clicks bypass your server. Faster UX; loses analytics on repeat visits.
302 Found (or 307) — non-cacheable; every click hits your server. Slower repeat visits; full click visibility.

Most analytics-driven shorteners (Bitly, lnk.in) use 301 with a short cache (Cache-Control: max-age=120) or 302 outright. Pick by whether analytics or latency is the primary product metric.

Click Analytics with HyperLogLog

Per-URL click tracking quickly explodes: 10 B clicks/day, billion of URLs. Two simultaneous problems:

Total clicks — trivial counter; INCR per redirect. Sample if precise count is unnecessary.
Unique visitors — deduplicating IPs naively requires per-URL Set with potentially millions of entries. HyperLogLog approximates the cardinality of a set in ~12 KB with ~2% error. Redis has PFADD / PFCOUNT built in.

For temporal aggregates (hourly clicks, geographic distribution), pipe redirects to Kafka, do streaming aggregation in Flink / Spark Streaming, write hourly rollups to ClickHouse. The redirect path itself stays sub-10 ms; analytics is async.

Custom Vanity Slugs

Premium feature: sho.rt/my-cool-link. Implementation:

Reserve a slug namespace separate from the auto-generated keys (e.g., custom slugs are 4–30 chars; auto are exactly 7).
Atomic CAS: INSERT ... ON CONFLICT DO NOTHING; check rowcount.
Reserved-word blocklist: do not let users register admin, api, login.
Profanity filter (multilingual) at registration time.

Abuse Prevention

Every URL shortener becomes a phishing vector. Defenses:

URL scanning at submit time — check Google Safe Browsing, PhishTank, internal blocklists. Reject or quarantine known-malicious targets.
Domain reputation — rate-limit by submitter IP / account; throttle anonymous submissions hard.
Click-time interstitial — for new / unverified URLs, show "you are about to visit example.com, continue?" Loses some UX, blocks one-click drive-by.
Takedown pipeline — abuse@ inbox → ticket → flip the URL to a warning page within hours, not days.
SOC2 / abuse reports — without an active abuse program, your own domain ends up on Safe Browsing, breaking every legitimate user.

Link Expiration

Some use cases need TTL: marketing campaign for 30 days, password reset for 1 hour. Implementation: expires_at column with TTL on cache + DB. Background sweep deletes expired rows; redirect path checks expires_at > now() and returns 410 Gone.

DynamoDB's native TTL works; for Cassandra, use TTL on the row. Avoid scanning the whole table to find expired keys.

The 6→8 Character Migration

Bitly's historical pain: original keys were 6 characters. 62⁶ = 56 B URLs — not enough at long-term scale. They migrated to 7–8 character keys for new URLs while keeping old 6-char URLs forever. Lessons:

Variable-length keys are required from day one — never assume fixed length.
Prefix-based dispatching — the 6-char and 7-char ranges occupy disjoint namespaces (different starting characters or explicit prefix), so old and new coexist without ambiguity.
Cache invalidation is irrelevant — old keys stay valid; you do not migrate values, only the allocator.

Failure Modes

Hot key — viral link gets 1 M req/s on one cache shard. Replicate hot key to N shards; CDN-cache at the edge; rate-limit per source.
ID allocator outage — cannot mint new keys. Pre-allocate batches per writer process so the allocator can be down for hours without affecting writes.
Database unreachable on redirect path — cache miss + DB down = 5xx. Serve stale-while-error from cache; never block the redirect on the DB if cache had it recently.
SEO / link rot — old shortened URLs across the web are invaluable; one bad migration kills tens of millions of inbound links. Treat short codes as a permanent commitment.

FAQ

Why not just use UUIDs?

UUIDs are 36 chars and not URL-friendly. The whole point is short. Base62 of a numeric ID gives 7 chars for the same uniqueness range.

How do you handle deletion?

Soft delete (set deleted_at) so you can recover from accidental purges; redirect to a 410 page. Never reuse the short key for a different long URL — clobbers cached state on every browser.

Should I store the entire long URL?

Yes; do not normalize or prettify. Users paste full URLs and expect them back unchanged. The DB row is small even at 2 KB long URLs.

Geo-aware redirects?

Optional premium feature: route to different long URLs by visitor country / language. Requires a more complex storage row and edge-aware redirect.