Design a Ticket Booking System
Virtual Queues, Seat-Level Locks, and the Taylor Swift Problem
Ticketmaster's 2022 Taylor Swift Eras Tour on-sale put 14 million users in the queue simultaneously. That single event — one tour, one platform, one button — is a cleaner formulation of the capacity problem than any abstract system-design hypothetical. The core insight: most users will not get a ticket, and the system's job is to disappoint them gracefully without crashing. This drives the architecture: a virtual waiting room in front of the actual booking flow, aggressive bot prevention, distributed locks at the per-seat granularity with TTL-based recovery, atomic payment + ticket-issue transactions, and a queueing-theoretic capacity model that admits new users only as fast as the booking system can serve them. Hotel reservations and ticket booking look similar; they are not.
Why Ticket Booking Is Not Hotel Reservation
A hotel sees a few hundred bookings/second at peak across millions of properties — demand is spread out. A marquee concert on-sale sees millions of bookings attempted in the first minute, all for the same 60K-seat venue. The system serves a 1000x spike, then drops to zero for the same SKU.
"1 deluxe room" is fungible; "Section 102, Row K, Seat 14" is not. Inventory is identified at the leaf, not aggregated. The data model is millions of (event, seat) tuples per major venue. Concurrent seat selection requires per-seat atomicity.
Resale value of premium tickets is 5–20x face. Bots will spend hours of compute on CAPTCHA-solving, residential-proxy rotation, and inventory polling for a single high-value event. The system cannot just be performant — it has to be hostile to bots, every layer.
High-Level Architecture
Key Numbers
The Virtual Waiting Room
The fundamental scaling insight: the booking system can serve N seats/second. If you let 10N users in, N succeed and 9N see errors. So the front door of the system becomes a queue that admits users at exactly the rate the booking system can serve them.
# Queue token issuance (when user opens the onsale page)
POST /api/queue/<event_id>/enter
→ server issues a signed token containing:
{ user_id, event_id, position, issued_at, signature }
# The queue page polls every 5-15 seconds:
GET /api/queue/<event_id>/status?token=...
→ { status: "waiting", position: 184227, eta_seconds: 1800 }
or
{ status: "admitted", access_token: "JWT-with-expiry" }
# Admission rate is a function of how fast the booking system processes bookings.
# Typical Ticketmaster: 200-500 admissions/second per event.
# Redis structure:
# ZADD queue:event:<event_id> <timestamp> <token_id> -- queue ordering
# Position = ZRANK queue:event:<event_id> <token_id>
# Admission = ZRANGE 0 <rate*delta> -- pop the front
# Move admitted tokens to: SET admitted:<event_id>:<token_id> ... EX 600 The queue serves multiple purposes: (1) it absorbs the load spike behind a thin layer; (2) it gives users a UX of "you're in line, wait" instead of "site is broken"; (3) it lets the system measure admission rate vs sell-through rate and back-pressure dynamically. Modern queue services (Queue-it, Cloudflare Waiting Room, Akamai Cloudlets) implement this as an edge-side product.
Seat-Level Distributed Locks
Once admitted, the user picks seats. Each seat selection is a distributed lock with TTL:
# Hold a seat (SETNX with TTL is the simplest correct primitive)
def hold_seat(event_id, seat_id, user_id):
key = f"hold:{event_id}:{seat_id}"
success = redis.set(key, user_id, nx=True, ex=300) # 5-minute hold
return bool(success)
# Release a seat (only if owned by us -- Lua script for atomicity)
RELEASE_SCRIPT = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
def release_seat(event_id, seat_id, user_id):
key = f"hold:{event_id}:{seat_id}"
return redis.eval(RELEASE_SCRIPT, 1, key, user_id)
# The seat-map service serves "available" view:
# Available = (all_seats) - (sold_in_DB) - (held_in_redis)
# Why TTL not manual release: if the user closes browser, hold expires
# automatically. No cleanup job needed for the happy path; sweeper for
# edge cases (zombies, partial failures). For events with adjacent-seat requirements ("4 seats together"), the hold becomes a multi-key operation. Redis supports this via Lua scripting:
-- Hold 4 adjacent seats atomically; either all or none
HOLD_GROUP_SCRIPT = """
for i, seat in ipairs(KEYS) do
if redis.call('exists', seat) == 1 then
-- conflict; release any acquired
for j = 1, i-1 do redis.call('del', KEYS[j]) end
return 0
end
end
for i, seat in ipairs(KEYS) do
redis.call('set', seat, ARGV[1], 'EX', tonumber(ARGV[2]))
end
return 1
"""
keys = [f"hold:{evt}:S102-K-14",
f"hold:{evt}:S102-K-15",
f"hold:{evt}:S102-K-16",
f"hold:{evt}:S102-K-17"]
ok = redis.eval(HOLD_GROUP_SCRIPT, len(keys), *keys, user_id, 300) The trade-off vs Redlock: a single Redis primary with synchronous replication to a replica is good enough for most ticketing systems. Redlock (multi-Redis-node consensus) adds complexity to handle the case where the primary fails mid-hold, but the failure mode of a "lost hold" is acceptable in ticketing — the user gets bumped out of checkout and re-queued, not double-charged.
Bot Prevention: A Layered Defense
| Layer | Defense | What it stops |
|---|---|---|
| Edge (CDN/WAF) | IP rate-limit, geo-blocks, ASN reputation | Crude scrapers from datacenter IPs |
| Bot management | Cloudflare Bot Management, Akamai Bot Manager: ML on browser fingerprints, mouse jitter, JS execution timing | Headless Chrome, Playwright bots |
| JS challenge | Cryptographic proof-of-work, Turnstile invisible CAPTCHA | Bots without full JS engine |
| Account requirements | Verified email, phone (SMS), prior account history | Burner accounts |
| Verified Fan | Pre-registration; SMS/email verification; ML scoring of likelihood | Bulk script-driven account creation |
| Card velocity rules | Same card across N attempts in M minutes → flag | Bots reusing cards |
| Device fingerprinting | FingerprintJS, ThreatMetrix — canvas, audio, webgl signatures | Single-machine bot farms |
| Behavioral analytics | Time-on-page, scroll patterns, click positions on seat map | Bots with mouse-replay scripts |
Ticketmaster's "Verified Fan" program is a quasi-lottery: users pre-register with verified contact info; the system selects who gets queue codes based on heuristics (account age, prior purchase history, ML-predicted humanness). The selected users get codes via email; only with a code can you enter the queue. This shifts the bot battle from "minute zero" to "registration period," which spans days and gives more signal time.
Atomic Checkout
At checkout, multiple operations must succeed or all fail:
# Pseudo-code for atomic ticket checkout
def checkout(user_id, held_seat_ids, payment_token, idempotency_key):
# 1. Verify all holds are still ours
for seat_id in held_seat_ids:
owner = redis.get(f"hold:{event_id}:{seat_id}")
if owner != user_id:
raise SeatNoLongerHeldError(seat_id)
# 2. Idempotency check (24h TTL)
cached = redis.get(f"idem:{idempotency_key}")
if cached:
return json.loads(cached)
# 3. Charge payment (synchronous, ~3s)
charge = stripe.charge(payment_token, amount=total, idempotency_key=idempotency_key)
if charge.status != "succeeded":
raise PaymentFailedError(charge.failure_reason)
# 4. Atomically: mark seats sold + delete holds + write booking
try:
with db.transaction():
for seat_id in held_seat_ids:
affected = db.execute("""
UPDATE seats SET sold = TRUE, sold_to = %s, sold_at = NOW()
WHERE event_id = %s AND seat_id = %s AND sold = FALSE
""", (user_id, event_id, seat_id))
if affected != 1:
raise SeatRaceError(seat_id) # extremely rare; sweeper bug?
booking_id = db.execute("""
INSERT INTO booking (user_id, event_id, seat_ids, total, payment_id)
VALUES (...) RETURNING id
""").scalar()
# 5. Release Redis holds (no longer needed)
for seat_id in held_seat_ids:
redis.delete(f"hold:{event_id}:{seat_id}")
result = {"booking_id": booking_id, "status": "confirmed"}
redis.set(f"idem:{idempotency_key}", json.dumps(result), ex=86400)
return result
except Exception:
# Compensation: refund payment
stripe.refund(charge.id, idempotency_key=f"refund:{idempotency_key}")
raise The compensation step (refund on DB-write failure) is rare but mandatory. It's the failure mode that caused TicketSwap's 2018 outage public-relations disaster — payments were captured but the DB layer failed; the system had no compensation logic; thousands of users were charged for tickets they never received.
Inventory Partitioning
At the data layer, the seat table is partitioned by event_id:
-- Postgres / CockroachDB schema
CREATE TABLE event (
event_id BIGINT PRIMARY KEY,
venue_id BIGINT,
starts_at TIMESTAMPTZ,
total_seats INT,
onsale_at TIMESTAMPTZ
);
CREATE TABLE seat (
event_id BIGINT NOT NULL,
seat_id TEXT NOT NULL, -- "S102-K-14"
section TEXT,
row_num TEXT,
seat_num TEXT,
price_tier SMALLINT,
sold BOOLEAN NOT NULL DEFAULT FALSE,
sold_to BIGINT,
sold_at TIMESTAMPTZ,
PRIMARY KEY (event_id, seat_id)
) PARTITION BY HASH (event_id); -- distribute hot events across shards
CREATE TABLE booking (
booking_id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
event_id BIGINT NOT NULL,
seat_ids TEXT[] NOT NULL,
total NUMERIC(10,2),
payment_id TEXT,
status TEXT NOT NULL,
idempotency_key TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON booking (user_id, created_at DESC);
-- For a 60K-seat event:
-- 60K rows in seat table, all hot during onsale
-- < 1 minute sell-through means ~60K updates in ~60s = 1000 UPS
-- Single-shard Postgres handles this; CockroachDB if you need geo-replication Dynamic Pricing (StubHub / Vivid Seats Model)
On the resale marketplace, prices float — sellers set asking prices, buyers bid. On the primary market, dynamic pricing has become controversial (Bruce Springsteen's 2022 prices spiked to $5K+ via Ticketmaster's "Platinum" dynamic pricing), but the algorithm is similar to airline yield management:
# Simplified dynamic pricing logic
def price_seat(event_id, seat_id, base_price):
# Track demand signals in real time
views_per_min = redis.incrby(f"views:{event_id}:{tier}", 1)
holds_per_min = redis.incrby(f"holds:{event_id}:{tier}", 1)
sold_so_far = redis.get(f"sold:{event_id}:{tier}") or 0
inventory_left = total_at_tier - int(sold_so_far)
# Demand pressure (views vs inventory)
pressure = (views_per_min + holds_per_min * 5) / max(inventory_left, 1)
# Multiplier (clamped 0.7 - 4.0)
multiplier = max(0.7, min(4.0, 1.0 + pressure / 50))
return base_price * multiplier
# Triggered re-price every ~5 seconds based on rolling window stats.
# Some venues opt out of dynamic pricing; the platform makes it configurable. Resale (StubHub, Vivid Seats, SeatGeek) is a different system: a marketplace where individual sellers list tickets at set prices. The platform takes a transaction fee (~25% combined buyer + seller fee). The architecture overlap is significant — the same seat-hold + payment + atomic-transfer primitives apply — but the inventory model is different (each ticket is owned by a seller_id; transfer of ownership at sale).
Capacity Planning for the Eras Tour
# The Taylor Swift Eras Tour onsale (November 2022)
Concurrent users at 10am ET = ~14,000,000
Total tickets across all dates = ~2,400,000
Tickets per minute the booking can issue = ~3,000-5,000 (Ticketmaster claim)
Time to sell out at that rate = 8-13 hours total across all events
# The mismatch:
14M users / (4K issuance/min * 60min) = 58 hours of queue time at peak
Reality: queues took 2-8 hours; many users got error pages
Why: the queue layer ITSELF buckled (a different problem from
the booking layer)
# Lessons (after-action analysis):
1. Queue must be horizontally scaled separately from booking.
Ticketmaster's queue layer was undersized.
2. Pre-event load testing must use realistic geo-distribution and
user-agent diversity, not synthetic traffic from one datacenter.
3. The "1B+ bot requests" claimed by Ticketmaster suggests bot
detection was overwhelmed -- humans got stuck behind bot floods.
4. Verified Fan reduced bot share but didn't eliminate it; 14M
verified codes was already too many for the queue capacity. Hotel vs Ticket Booking: Side-by-Side
| Hotel Reservation | Ticket Booking | |
|---|---|---|
| Inventory unit | Room-night (date-bound, fungible) | Specific seat at event (event-bound, identified) |
| Concurrency profile | Smooth; tens of thousands of properties | Cliff-shaped; one event at a time |
| Search vs book ratio | ~100:1 search-heavy | ~10:1 if not queued; users come ready to buy |
| Cancellation policy | Often free 24h before | Usually non-refundable; resale marketplace |
| Hold TTL | 5–15 min | 5–10 min (forced shorter due to demand) |
| Bot threat | Low (no resale arbitrage on rooms) | Critical (10x face-value resale) |
| Queueing | Rarely; only flash-sales | Always for marquee events |
| Inventory sharing | GDS, OTAs, channel manager | Mostly platform-exclusive |
Tradeoffs & Failure Modes
- Queue layer collapse. If the queue service itself fails under load, the system is worse than no queue (users get raw rejections instead of orderly waits). Run the queue layer at 10x peak capacity; use Cloudflare Waiting Room or similar edge product, not in-region Redis alone.
- Hold leakage. A user grabs 4 seats and disappears; the holds expire eventually, but for 5 minutes those seats are unavailable. In a sold-out event, this means 4 lost sales per abandonment. Solution: aggressive heartbeats (require client to ping every 30s; release on missed heartbeats faster than full TTL).
- Browser back button. User completes payment, hits back, retries. Without idempotency, double-charge. Idempotency key in client state (sessionStorage) survives back/forward navigation if the SPA is built right; URL-based state survives even reloads.
- Refund storms. Cancelled event → millions of refunds. Stripe and Braintree have rate limits on refund volumes. Process in batches over hours/days; communicate timeline to users.
- Dynamic pricing PR backlash. The Springsteen / Bad Bunny / Beyonce Platinum pricing controversies show the customer-experience cost of unfettered yield-management. Most platforms now expose price-floor controls per artist/event.
- Wallet integration race. Apple Wallet / Google Wallet ticket issuance is async; the user receives confirmation email before the wallet object exists. They scan an email QR at the gate, the system says "ticket already used" because the wallet ticket was the canonical version. Use unified ticket_id; treat email QR and wallet QR as views of the same record.
- Resale fraud. Tickets resold via screenshots that turn out invalid at gate. Mitigation: rotating QR codes (refreshed every minute via app), official transfer-only systems (Ticketmaster's SafeTix), barcode delivery only at T-2 hours.
- Database hot row. A single popular event's seat table becomes the hottest partition in the cluster. If Postgres, vertical-scale that node; if CockroachDB, consider per-event physical sharding via separate databases for marquee events.
FAQ
Why use Redis for holds instead of just locking rows in the booking database?
Two reasons. (1) Hold TTL: Redis EX gives free expiry; Postgres requires a sweeper job to release stale row locks. (2) Latency: a held seat is read on every seat-map render (potentially 10K+ users staring at the seat map of an active event), and Redis serves these reads at sub-ms, whereas Postgres reads on a hot table with active locks can spike. The DB row stays untouched until checkout, where it gets a single update.
How do you keep the seat-map view fresh for thousands of concurrent viewers?
WebSocket or Server-Sent Events from a seat-state service. When a hold or sale event happens, publish the seat_id state change to a Pub/Sub channel; clients viewing that event subscribe to the channel and update the seat-map UI in real time. Without this, you get the "I clicked the seat someone else just bought" experience — the hold attempt fails server-side, but the UX is bad.
What stops a user from holding 100 seats and never buying?
Per-user hold limits enforced server-side (e.g., max 8 seats per event). The UI also enforces it, but the server is the source of truth. Combined with the TTL, even an attacker who maxes out their limit only locks 8 seats for 5 minutes. Fraud-detection flags accounts that exhibit "hold and abandon" patterns; repeat offenders are throttled.
Why do queues sometimes show 100,000 ahead but admit you in 5 minutes?
Two reasons. (1) Bot tokens get invalidated as bot detection catches up — the queue pops them without admitting them, so the line appears to move faster than the admission rate. (2) Many users abandon (close tab) without releasing their queue position; the queue service times them out after a heartbeat interval and pops them.
How does Verified Fan work technically?
Pre-registration period (often 1–2 weeks): users register with email + phone, optionally link a fan account or social media. Server collects signals: account age, prior Ticketmaster purchase history, IP geo (must match the registration country), device fingerprint. ML scores each registration as "likely human fan" vs "likely bot/reseller." Selected users receive a one-time code via SMS at onsale time; only that code unlocks the queue.
What does "atomic" mean across Redis, Stripe, and Postgres — you can't have a single transaction.
You can't. The system instead uses a saga: each step has a compensating action. If payment succeeds but DB write fails, refund the payment. If the saga itself crashes mid-step, the orchestrator (or replay log) re-runs the step on recovery, using idempotency keys to make re-runs safe. "Atomic" here means "from the user's perspective, either the booking happened completely or didn't happen at all" — with the caveat that in transient failure modes, the system may need to issue refunds.
How would you load-test a system like this before an actual onsale?
Distributed load gen from many regions (Locust + EC2 fleet across 10+ regions). Realistic user behavior: queue entry, polling, seat-map browse, hold attempts, abandons, checkouts. Mix in synthetic bot traffic at 30% to test bot defense. Run the test against a production-shaped staging environment, not a scaled-down one. Measure not just success rate but tail latency — p99.9 matters more than mean for "users seeing site is broken."
How do you handle a venue layout change (a row was removed, seats renumbered)?
Inventory is per-event, not per-venue. When the event is created, the venue's current seat map is snapshotted into the seat table. If the venue layout changes after onsale, you accept that the booked seat IDs no longer match physical reality and reconcile manually at the gate. For changes detected before tickets are sold, just re-snapshot. Most platforms version the venue layout and bind the version into the event row.