Design a Hotel Reservation System
Inventory, Double-Booking Prevention, the Saga, and Why Search Is Half the Problem
Hotel reservation systems sit at the intersection of two unforgiving constraints: inventory cannot be sold twice, and users will abandon search results that take more than 200 ms. The naive "one row per room" data model breaks under both. The right shape is room-night inventory — treating each (hotel, room-type, date) tuple as the unit of sale — combined with a careful locking strategy, idempotent bookings, and a saga pattern that ties booking, payment, and confirmation together with compensating actions when any step fails. On top of that, the search layer must serve sub-200ms autocomplete and faceted filtering across millions of properties, which is a different system entirely (Elasticsearch, geospatial indexing, dynamic pricing). Booking.com and Expedia are 20-year case studies in solving these together.
Why This Problem Is Different from E-Commerce
A widget on Amazon is fungible: ship one, decrement count. A hotel room on Tuesday Mar 4 cannot be substituted with the same room on Mar 5. Inventory is a 2D matrix of (room, date), and the booking spans a contiguous range. Searching for "3 nights Mar 4–7" requires checking that the room is available for all three dates, atomically.
Two users on different devices can simultaneously click "Book" for the last room. Without locking, both succeed; the hotel discovers the double-booking at check-in. The cost of double-booking is high (refunds, walk fees, brand damage), so the system errs hard on the side of preventing ambiguity even at the cost of false rejections.
Most queries are searches; only a small fraction become bookings. Search must be fast (sub-200ms), faceted (price, stars, amenities, distance from POI), and personalized (user's currency, preferred chains). Search infrastructure is decoupled from the booking-of-record system: an Elasticsearch index updated asynchronously from the source-of-truth inventory store.
High-Level Architecture
Key Numbers
The Inventory Data Model: Room-Nights as the Unit
The naive model is "rooms with availability flags." It's wrong — you can't represent "available Mar 4 but booked Mar 5" without per-date state. The right model is a 2D inventory grid:
-- Postgres: source-of-truth for bookings
CREATE TABLE booking (
booking_id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
hotel_id BIGINT NOT NULL,
room_type_id BIGINT NOT NULL,
check_in DATE NOT NULL,
check_out DATE NOT NULL,
num_rooms SMALLINT NOT NULL,
total_price NUMERIC(10,2) NOT NULL,
currency CHAR(3) NOT NULL,
status TEXT NOT NULL, -- pending | confirmed | cancelled | failed
idempotency_key TEXT NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON booking (user_id, created_at DESC);
CREATE INDEX ON booking (hotel_id, check_in);
-- DynamoDB: room-night inventory (hot path for availability checks)
-- PK = hotel#<hotel_id>#room#<room_type_id>
-- SK = night#YYYY-MM-DD
-- Attributes: total_inventory, sold, held, version (for OCC)
{
"PK": "hotel#H123#room#deluxe",
"SK": "night#2026-03-04",
"total_inventory": 50,
"sold": 42,
"held": 3,
"version": 8194
}
# Available = total_inventory - sold - held
# A 3-night booking checks 3 rows: 2026-03-04, 2026-03-05, 2026-03-06
# All must have available >= num_rooms_requested DynamoDB is a great fit for the inventory grid: keys are deterministic, all reads are point lookups or sorted-by-date range scans within a hotel, and the partition key (hotel#H123#room#deluxe) distributes evenly. The trade-off is that DynamoDB transactions are limited to 100 items, which caps one booking at 100 nights — fine for hotels, would be a problem for vacation rentals with multi-month stays.
Preventing Double-Booking: Optimistic vs Pessimistic vs Distributed Lock
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Optimistic concurrency (OCC) | Read row + version, increment sold, write with version-check | No lock contention; great for low-conflict cases | Retries on conflict; bad for last-room scenarios |
| Pessimistic (SELECT ... FOR UPDATE) | Take row lock at the start of transaction | Simple, no retries | Holds lock for entire txn duration; queues form |
| Distributed lock (Redis Redlock) | Acquire named lock SET hotel:H123:room:deluxe NX EX 10 | Survives DB connection issues; cross-service | Lock service is a SPOF; clock skew can cause split-brain |
| Atomic counter (Redis or DDB) | DECR available if > 0 | Sub-ms, scales to 100K+ ops/s | Can over-decrement on contention without conditional check |
The pragmatic choice in production is OCC with a hold step. The booking flow becomes:
# Step 1: HOLD (5-minute TTL) -- user clicks "Book Now"
# Atomic DynamoDB conditional update:
UPDATE inventory
SET held = held + 1, version = version + 1
WHERE PK = "hotel#H123#room#deluxe" AND SK = "night#2026-03-04"
AND (total_inventory - sold - held) >= 1
AND version = :prev_version
# Repeat for each night in the stay.
# If ANY night fails the conditional check -- release prior holds + return 409.
# Hold is recorded in Redis with TTL: SET hold:<booking_id> ... EX 300
# Step 2: PAYMENT -- user enters card, presses Confirm
# Stripe charge -- ~3 seconds typical
# Step 3: CONFIRM -- inside DB transaction:
# sold += 1, held -= 1, write booking row
# delete Redis hold
# Step 4: If user abandons (timeout) -- background job:
# For all expired holds: held -= 1 (compensate)
# Inventory becomes available again The 5-minute hold TTL is the system's cheapest defense against double-booking. It gives the user time to enter payment without tying up the row in a database transaction. If the user abandons, the hold expires and inventory returns — no manual cleanup needed. The TTL is set per-platform: Booking.com uses ~10 minutes for the booking funnel; airlines often use 20 minutes because seat selection takes longer.
The Booking Saga: Coordinating Inventory + Payment + Confirmation
Booking is a multi-step transaction across services that don't share a database. Two-phase commit across DynamoDB, Stripe, and Postgres is impossible. The saga pattern is the canonical answer: each step has a forward action and a compensating action; if any forward step fails, run the compensations in reverse order.
Hold inventory in DynamoDB for each night. Compensate: release the holds.
Stripe payment intent with capture_method=manual. Compensate: cancel the payment intent.
Atomically: convert holds to sold (DDB), capture payment (Stripe), write booking row (Postgres). Compensate: if Stripe capture succeeds but DB write fails — issue refund.
Send confirmation email/SMS, push to GDS/PMS. Failure here doesn't roll back the booking; it's retried separately. The customer has a confirmation number.
# Choreographed saga via Kafka (alternative to orchestrated):
booking_requested -> inventory_held -> payment_authorized -> booking_confirmed -> notification_sent
# Each step is its own consumer; on failure, emits compensation event:
inventory_hold_failed -> (no compensation needed, nothing yet committed)
payment_auth_failed -> release_inventory_hold
booking_confirm_failed -> release_inventory_hold + cancel_payment_authorization
# Orchestrated alternative (AWS Step Functions, Temporal):
# Single workflow definition with try/catch per step + retry policies.
# Easier to reason about; harder to scale across teams. Idempotency: The Single Most Important API Contract
Mobile networks drop. Users tap "Book" twice. Retries happen. Without idempotency, a flaky network
becomes a double-charged customer. The contract: every booking request includes an
Idempotency-Key header (UUID generated by client). The server records this key with
the booking outcome; future requests with the same key return the same response without re-executing.
# Server-side handler skeleton
def create_booking(request, idempotency_key):
# Lookup in Redis cache (5-minute TTL) for hot path
cached = redis.get(f"idem:{idempotency_key}")
if cached:
return json.loads(cached)
# Atomic insert + lookup in Postgres
try:
with db.transaction():
db.execute("""
INSERT INTO booking_idempotency (key, status, created_at)
VALUES (%s, 'in_progress', NOW())
ON CONFLICT (key) DO NOTHING RETURNING key
""", (idempotency_key,))
row = db.fetchone()
if row is None:
# Another request claimed this key. Wait + return its result.
return wait_for_idempotency_result(idempotency_key)
# We own this key -- run the saga.
result = run_booking_saga(request)
db.execute("""
UPDATE booking_idempotency SET status = 'completed', response = %s
WHERE key = %s
""", (json.dumps(result), idempotency_key))
redis.set(f"idem:{idempotency_key}", json.dumps(result), ex=86400)
return result
except Exception as e:
db.execute("""
UPDATE booking_idempotency SET status = 'failed', error = %s
WHERE key = %s
""", (str(e), idempotency_key))
raise Stripe's idempotency model is the canonical one: keys are valid for 24 hours, and each key permanently maps to one outcome. Any retry returns the cached response, even on failure — if the original request charged the card and returned 500, the retry returns the same 500 (with the charge having succeeded). This is correct: the system has no way to "redo" the operation safely.
Search: Sub-200ms Across Millions of Properties
The search problem decomposes:
- Autocomplete (50ms budget): "Par…" → "Paris, France." Trie-backed prefix index in Redis or a tuned Elasticsearch completion suggester. Personalized by user history.
- Geo + filter search (150ms budget): Find hotels within 5 km of (48.8566,
2.3522), 4+ stars, $100–$300/night, with pool. Elasticsearch
geo_distancequery + filter clauses + facet aggregations. - Availability filter (joined at runtime): Search returns candidates; per-hotel availability is checked against DynamoDB inventory in parallel (~10 ms per check, 100 candidates, bounded by service mesh fanout).
- Pricing (per-result): Dynamic price = base × demand_multiplier × seasonality × user_currency_FX × (1 + tax_rate). Computed at query time, not stored.
# Elasticsearch query for "hotels in Paris, 4+ stars, March 4-7, 2 guests"
POST /hotels/_search
{
"query": {
"bool": {
"must": [
{ "geo_distance": { "distance": "5km", "location": { "lat": 48.8566, "lon": 2.3522 } } },
{ "range": { "stars": { "gte": 4 } } }
],
"filter": [
{ "term": { "amenities": "pool" } },
{ "term": { "active": true } }
]
}
},
"sort": [{ "_score": "desc" }, { "rating": "desc" }],
"size": 50,
"aggs": {
"price_buckets": { "histogram": { "field": "min_price", "interval": 50 } },
"amenity_facets": { "terms": { "field": "amenities", "size": 30 } },
"stars_facets": { "terms": { "field": "stars" } }
}
}
# Return: 50 candidate hotel IDs.
# Then: parallel-fetch availability for each candidate against DynamoDB.
# Discard those without inventory for Mar 4-7. Return first 25 that pass. Dynamic Pricing
Hotel rates are not static. They depend on demand, lead time, day-of-week, length-of-stay, channel (direct vs OTA vs corporate), loyalty status, and the hotel's revenue management strategy. The pricing service typically:
# Pseudo-pricing pipeline
def price_room(hotel_id, room_type, dates, user_context):
base_rate = rate_card[hotel_id][room_type] # baseline
occupancy = forecast_occupancy(hotel_id, dates) # ML model
demand_mult = 1 + 0.5 * (occupancy - 0.7) # surge if >70% booked
lead_time = (dates.check_in - today).days
lead_mult = 1.0 if lead_time > 30 else (0.9 if lead_time > 7 else 1.1)
los_discount = 0.95 if (dates.check_out - dates.check_in).days >= 5 else 1.0
channel_mult = 1.0 if user_context.channel == "direct" else 1.18 # OTA commission
loyalty_disc = 0.93 if user_context.loyalty_tier == "gold" else 1.0
raw = base_rate * demand_mult * lead_mult * los_discount * channel_mult * loyalty_disc
tax = raw * tax_rate(hotel_id.country)
fx = raw * fx_rate(hotel_currency, user_currency)
return {"raw": raw, "tax": tax, "total_user_currency": fx + tax_in_user_currency}
# Cached per (hotel, room_type, date, channel) for ~5 minutes
# Invalidated by inventory writes The price returned in search results is a forecast. The price quoted at booking step is locked for the duration of the hold. If 5 minutes pass and the user re-attempts, the price may have changed — the system must show the new price and ask for confirmation before charging.
GDS & PMS Integration: The Other Half of the System
For real-world hotel reservations, the booking system rarely owns inventory exclusively. Inventory is shared with:
- GDS (Global Distribution Systems): Sabre, Amadeus, Galileo/Travelport. Travel agents and airlines query these. Originally telex-based, modernized to SOAP/XML and now REST.
- OTAs (Online Travel Agents): Booking.com, Expedia, Hotels.com pull inventory and push reservations.
- PMS (Property Management Systems): Opera, Sihot, Mews — the on-premise software the hotel front desk runs. Source of truth for the hotel itself.
The integration shape is "channel manager": a service that synchronizes inventory and rates across all distribution channels. When a booking happens on Booking.com, the channel manager:
- Receives the booking via Booking.com webhook.
- Decrements local inventory.
- Pushes the inventory update to all other channels (Expedia, Sabre, hotel's own website) so the room can't be double-booked there.
- Pushes the reservation into the hotel's PMS so the front desk sees it.
The latency tolerance is loose — channel synchronization is typically eventually consistent within 10–60 seconds. The double-booking risk during this window is mitigated by overbooking policies (selling 105% of capacity, walking the unlucky 5% to a sister property) and pull-based rate-limiting.
Comparison with Airline Reservation Systems
| Hotel Reservation | Airline Reservation | |
|---|---|---|
| Inventory unit | Room-night | Seat on a flight leg |
| Booking duration | 1–30 nights typical | Single transaction (round-trip) |
| Cancellation policy | Often free <24h before | Strict; many fares non-refundable |
| Overbooking | Common (~5%) | Common (~5–10%) |
| Hold duration | 5–15 min | 15–30 min (seat map UX) |
| Seat selection | Room-type only (no specific room) | Specific seat assigned at booking |
| Pricing volatility | Hourly | Minutely (Sabre updates) |
| Standard protocol | OTA XML, GDS, OpenTravel | EDIFACT, NDC (newer), GDS |
Airlines have a harder concurrency problem because seats are individually identified. A hotel booking for "1 deluxe room" doesn't need to specify which deluxe room until check-in; the front desk assigns at arrival. An airline booking specifies seat 14A, and 14A cannot be sold to two passengers. The data model becomes (flight_segment, seat_id) with strict per-seat atomicity. SABRE was effectively the first computer reservation system in the 1960s precisely because airlines needed this consistency at scale.
Tradeoffs & Failure Modes
- Search-vs-availability skew. Elasticsearch is updated asynchronously. A search result may show a room that was just sold seconds ago. Mitigation: re-check availability at booking step; show "no longer available" if the row is gone. Cost: a small percentage of failed booking attempts after click-through.
- Hold leakage. A worker dies after taking a hold but before recording the booking_id mapping. The hold expires correctly, but the user thinks they have a confirmation. Mitigation: write the hold record + booking_id atomically (DDB conditional write with both keys).
- Payment retry under partial failure. Payment succeeded; booking confirmation write failed. The user sees an error; the card has been charged. Mitigation: refund through compensation; alert customer service. This is the worst case for customer experience.
- Time zone bugs. "Mar 4 night" means different real moments in different time zones. Inventory keys must use the hotel's local timezone, not UTC. Off-by-one errors here become actual double-bookings.
- Cancellation and refund races. A user cancels at minute 5; the system processes cancellation. At minute 5.5, a delayed event from the original booking arrives. Without idempotency, double-cancellation. Use event versioning + idempotent cancellation keys.
- Hot hotel partition. Times Square hotels for New Year's Eve get pummeled. DynamoDB hot-partition warnings; latency spikes. Mitigation: write-shard popular hotels (split into N virtual partitions, fan-in on read) or use adaptive capacity.
- Channel manager loops. If channel A pushes an update to channel B, which pushes back, you can get oscillation. Mitigation: channel-manager-as-source-of-truth; channels only emit events that go through the manager once.
- Currency rounding. Quote 89.99 EUR, charge in USD via Stripe FX, exchange moves between quote and charge: customer disputes "you charged $0.10 more than the quote." Lock FX rate at quote time; store both currencies on the booking record.
FAQ
Why not use a relational DB for inventory? It supports SELECT FOR UPDATE.
You can — many hotel systems do exactly this. The trade-off is write throughput and horizontal scaling. SELECT FOR UPDATE on Postgres holds row locks until commit; under burst load, waits queue up and connection pools exhaust. DynamoDB's atomic conditional updates give you equivalent semantics with O(1) per-item latency at any scale, at the cost of a less expressive query language. For systems doing >1K bookings/sec at peak, DDB's predictable latency wins.
How do you handle a hotel that wants to sell different room types from the same physical pool?
Hotels often have "1 king bed" and "1 king bed with view" sharing the same physical inventory. Model with a "physical pool" key in the inventory grid (PK = pool_id) and "virtual products" that map to one pool with constraints. Booking either type decrements the same pool. This is where the data model gets messy and most off-the-shelf engines either require scripting or fall back to overbooking + manual reconciliation.
What about long stays that span months? DynamoDB's 100-item transaction limit?
Two options: (1) Move inventory to a relational DB for that hotel and lose horizontal scale. (2) Use a "stay-level" lock instead of per-night locks: a single row representing the whole reservation window, updated atomically. (2) is cleaner but requires reasoning about overlapping ranges; the query "is room R available Mar 4 - May 4?" becomes "no overlapping booking exists in (R, [Mar 4, May 4))," which needs a range index. Postgres GIST exclusion constraints solve this natively.
How does cancellation actually flow through the system?
User clicks Cancel. Booking service: (a) update booking.status = 'cancelled' in Postgres
atomically with idempotency key; (b) release inventory in DDB (sold -= 1 for each night); (c)
fire booking_cancelled event to Kafka; (d) downstream consumers: refund payment via
Stripe, send cancellation email, notify GDS/PMS. The order matters: increment inventory only
after the booking row is marked cancelled, so a concurrent rebook race uses the new inventory.
How do you handle group bookings (10 rooms in one transaction)?
One booking row with num_rooms=10. Inventory decrement is "sold += 10" rather than +1. The DDB conditional update fails atomically if there aren't 10 available. UX-wise, you typically degrade gracefully: if 10 aren't available but 8 are, offer the 8. This is application logic on top of the same primitives.
What's the right hold TTL? Why 5 minutes specifically?
A trade-off between user experience (long enough to enter card details, double-check dates) and inventory utilization (short enough that abandoned carts free up rooms quickly). 5 minutes is empirically a sweet spot for hotel booking on mobile/web; airlines use longer (15–30 min) because seat selection takes more clicks. Booking.com extends the hold if the user is actively interacting with the page (heartbeat refresh).
How does the search index stay in sync with inventory?
Inventory writes emit Kafka events. A search-index-updater consumes these events and updates the corresponding Elasticsearch documents. Lag is typically <30 seconds. The search index stores aggregate availability ("at least one room available somewhere in this 30-day window") rather than per-night state, because per-night-per-hotel updates would be too high a write rate for ES.
How would you scale to 100M concurrent users searching during a sale event?
(1) CDN-cache the search results page heavily; even 30 seconds of caching shields ES from huge traffic. (2) Pre-compute popular searches (top destinations + dates) and serve from Redis. (3) Add ES read replicas (search is read-only). (4) Move autocomplete to a per-region edge service. (5) Lottery-based queueing for the booking step (similar to Ticketmaster). The booking path is actually easy to scale — bookings/sec scales linearly with inventory shards. The hard part is search.