Design a Notification System

A notification platform fans out a single business event — an order shipping, a friend posting, a security alert — into millions of per-user deliveries across push (APNs/FCM), email (SES/SendGrid), SMS (Twilio), and in-app channels. The system must respect per-user preferences, deduplicate retries, batch noisy events, throttle to protect downstream providers, and absorb traffic spikes 10× daily peak. Done well, it is invisible; done poorly, it spams users into uninstalling the app and gets your sender reputation blacklisted.

Architecture

The platform sits between producer services (the e-commerce app, the social feed, the security pipeline) and delivery providers (APNs, FCM, SES, Twilio). Producers publish a logical notification; the system is responsible for resolving recipients, splitting per channel, applying preferences and quiet hours, and ensuring at-least-once delivery with deduplication.

Capacity Estimation

Take a consumer app with 100 M MAU, 20 M DAU, average 10 notifications per active user per day across all channels. That gives 200 M notifications/day ≈ 2,300 notif/s average. With a 5× peak factor (event-driven spikes — product launches, breaking news), provision for ~12 K notif/s.

Metric	Value	Notes
Average write throughput	2.3 K events/s	200 M / 86,400 s
Peak write throughput	12 K events/s	5× spike
Push payload size	~256 B	title + body + deep link
Email payload	10–50 KB	HTML + tracked links
Daily egress	~5 TB	weighted across channels
Token storage	~50 GB	500 B/device × 100 M devices
Inbox storage (90 d)	~7 TB	1 KB row × 200 M/day × 90
Provider rate ceiling	~9 K/s per APNs HTTP/2 conn	need 2–5 conns per worker

Rule of thumb: budget 10× daily peak for retries and provider backpressure, and assume the 99th-percentile user (push-spam celebrity, ops alerting on incident) drives 1000× the median.

Logical Model: Event → Notification → Delivery

Three nouns, in this order:

Event — what happened in the business: order.shipped(order_id=42). One per real-world occurrence; producers must publish exactly once with an idempotency key.
Notification — the user-facing message generated from an event after preference resolution: "Your order is on its way." There can be 0..N per event (silenced, batched, or fanned out to followers).
Delivery — one attempt to ship a notification through one channel: a push to APNs, an email through SES, an SMS via Twilio. A single notification can become 1–3 deliveries depending on user channel preferences.

Splitting these layers makes preferences, dedup, and analytics tractable: you measure event-to-notification fan-out, notification-to-delivery success, and delivery-to-engagement (open rate). Fold the layers together and you cannot tell whether a quiet user is opted out, throttled, or seeing failures.

Push: APNs and FCM

Apple Push Notification service (APNs) and Firebase Cloud Messaging (FCM) are the only routes to a phone's lock screen. Both are HTTP/2 services with strict semantics:

Tokens are device-scoped, not user-scoped. A user with three iPhones has three APNs tokens. The token table is (user_id, device_id, platform, token, app_version, last_seen). Tokens expire silently — you only learn when APNs returns 410 Unregistered after a send.
One persistent HTTP/2 connection per worker, with multiplexed streams. APNs caps a single connection at ~9 K req/s; scale by maintaining a pool of connections per (provider, environment) and load-balancing across workers.
Priority and TTL. APNs apns-priority 5 vs 10 controls power-aware delivery; apns-expiration drops stale notifications instead of waking a phone after the user has already moved on.
Token feedback loop. Both services tell you which tokens are dead. Garbage-collect or you pay forever to send to invalid tokens and risk getting throttled.

For Android, FCM offers data messages (silent, app-handled) vs notification messages (system tray). Use data messages for collapsible updates — chat, sports scores — with a collapse_key so only the latest survives in the OS queue. See the WhatsApp design for an example of using data messages to drive in-app sync.

Email: SES, SendGrid, and Sender Reputation

Email is the most regulated channel. Three considerations dominate:

Sender reputation — mailbox providers (Gmail, Outlook, Yahoo) score your sending IP and domain. A spike in complaints (> 0.1%) or bounces (> 5%) on AWS SES will pause your account. Warm up new IPs gradually (1 K/day → 10 K → 100 K over weeks). Use dedicated IPs for transactional traffic and shared for low-volume.
DKIM, SPF, and DMARC are required to land in the inbox. DMARC alignment between the From: domain and the DKIM signing domain is what Gmail and Yahoo enforce as of 2024.
Bounce and complaint handling. SES emits SNS notifications for bounces (hard: address invalid — suppress permanently; soft: mailbox full — retry up to 24 h) and complaints (user clicked "spam"). The notification system must consume these and update the suppression list before the next send to that address.

SES has a hard rate limit (MaxSendRate) per second; SendGrid uses parallel API calls. Either way, the email worker enforces a global token bucket per (provider, region) and rejects to a holding queue when saturated.

SMS: Twilio, the cost gradient, and OTP fraud

SMS is order-of-magnitude more expensive per message than push or email (US: ~$0.008; international: $0.05–$0.20). Three concerns:

Number pool routing. Twilio Messaging Services rotate sender numbers (long codes, short codes, toll-free) to avoid carrier filtering. For OTP, prefer short codes; for marketing, use 10DLC registered campaigns.
OTP fraud / pumping — bot farms exploit free-tier OTP flows to mint SMS that the attacker shares revenue on. Defenses: per-IP rate limit on the OTP endpoint, country allow-list, recaptcha on signup, and absolute spend cap per route per day.
Status callbacks. Twilio webhooks queued → sent → delivered. delivered is best-effort — carriers in many countries do not return DLR. Treat sent as the durable state and delivered as observational.

SMS belongs only on the highest-priority queue: 2FA, password reset, ride confirmation. Marketing on SMS gets you into legal trouble (TCPA in the US; carrier filtering everywhere).

Priority Queues and Multi-channel Orchestration

A single ingest topic is insufficient: a marketing blast must not delay a 2FA code. Partition the pipeline:

P0 — transactional: OTP, password reset, payment confirmation. Target end-to-end p99 < 5 s. Bypass batching. Dedicated workers, dedicated provider connections.
P1 — user-facing async: chat message, comment reply. Target p99 < 60 s. Standard pipeline, dedup window 10 s.
P2 — bulk/marketing: weekly digest, promotional. Soft deadline measured in hours. Aggressive batching, schedule against quiet hours, send-rate smoothing across the day.

Implement these as separate Kafka topics (or RabbitMQ queues), each with its own consumer group and rate budget — not as priority within a single topic, because head-of-line blocking will starve the high-priority work when the bulk producer floods. See the message queue page for the broader tradeoff space.

Multi-channel orchestration is harder than it looks. The product question is: "Given that I will notify this user about event E, which channel(s) and in what order?" A common policy:

Try push (cheapest, lowest friction). If the device acknowledges within 60 s, stop.
Else try in-app inbox + email after a delay.
Escalate to SMS only for time-critical events the user has not seen on any channel within the SLA.

This requires state per notification, not per delivery: you cannot fire-and-forget. Store notification_id, state, last_attempt, next_action_at; a scheduler scans for due transitions. Task scheduling covers the durable-timer machinery.

Deduplication and Idempotency

Two distinct deduplication problems:

Producer-side idempotency — the producer retries publishing the same event due to network glitch. Guard at the API boundary with an Idempotency-Key (event-scoped UUID): SETNX in Redis with 24-h TTL; if it already exists, return the prior notification_id. Without this, a flaky producer multiplies user pain by N.
Per-user noise dedup — a user gets 50 likes in 3 minutes; you want one notification, not 50. Implement as a coalescing window: hash (user_id, event_class, target_object) into a Redis sorted set; the first event schedules a flush 30 s out; subsequent events extend the count and update the message ("Alex and 49 others liked your post").

For exactly-once delivery to a provider: the provider call is not idempotent (APNs has no idempotency key). The best you can do is at-least-once with a delivery-id ledger: before sending, INSERT into delivery_attempts(notification_id, channel, attempt); on retry, check whether a prior attempt has a terminal status. Duplicates can still reach the user during pathological partitions; the application copy should be written so a duplicate is harmless.

Throttling, Backoff, and the Provider Cliff

Every external provider has an unpublished rate cliff. SES will quietly start 5xx'ing at 1.5× your published quota; APNs will drop streams; Twilio will queue SMS for hours. The notification system must:

Maintain a token bucket per (provider, region) sized at 80% of the official quota. Reject overflow into a delay queue, not a drop.
Implement exponential backoff with full jitter on retry — never synchronized retries, or the next minute will hit 2× the failed minute. See rate-limiter design for token-bucket internals.
Observe provider error codes as a signal: SES Throttling means slow down globally; APNs BadDeviceToken means delete the token, do not retry; Twilio 30007 means the carrier filtered — route around with a different sender number.
Maintain a circuit breaker per provider with a half-open probe. When SES is fully out, mark the breaker open for 30 s, send through SendGrid as a backup, and probe on a single shadow request.

User Preferences and Quiet Hours

The preference service stores, for each user: per-event-class opt-in matrix (e.g., marketing.weekly: email yes, push no, sms no), quiet hours in the user's local timezone, and channel-specific settings (push device tokens, verified phone, email). The preference resolver is on the hot path — every notification queries it — so it must be cached aggressively. Two-tier cache: in-process LRU for the worker (5 min TTL) backed by Redis (1 h TTL) backed by the source-of-truth Postgres row.

Quiet hours present a subtle bug: store the schedule in user-local time with the timezone, then compute "is now in quiet hours?" against the user's wall clock at delivery time. Storing the UTC range freezes when DST shifts.

In-app Inbox

The badge count and the inbox screen are different from push: they are a read model over the user's notification history. Two storage choices:

Wide-row Cassandra / DynamoDB — partition by user_id, sort by created_at desc. Reads are a single query; writes are a single PUT. Used by Twitter, Instagram. See Cassandra for partition-design tradeoffs.
Postgres with archived partitions — simpler for < 1 M users; partition by (user_id % 64, created_at) and prune monthly.

The unread count is a fast-changing aggregate: do not COUNT(*) on every screen open. Maintain a counter on a separate row (or a Redis HASH) with atomic INCR on insert and DECR on read. Reconcile via a daily background job.

Failure Modes and Tradeoffs

Provider outage — APNs has had multi-hour outages. Fan out to a secondary provider only for transactional traffic; let bulk wait. Do not flap between providers within the same retry chain or you double-deliver.
Thundering herd from a marketing blast — a single SQL query "all users in country X" emits 50 M events at once. Defense: the producer-side API has a per-tenant rate limit on event volume.
Loop bug — a notification triggers an event that triggers a notification. Defenses: producers must not subscribe to delivery-status events; the API rejects events with caused_by chains longer than 3.
Spam-flag cascade — users mark a marketing email as spam, mailbox provider drops your reputation, transactional emails start landing in spam, users mark them spam too — your domain is now poisoned. Mitigation: separate sending domains for transactional vs marketing.
Cold-start preference miss — the cache is cold and the preference DB is overloaded. Default to conservative: do not send. Better a missed notification than spamming an opted-out user.

Observability

Track per-channel and per-priority funnels:

Ingest: events received vs accepted vs rejected (preference, dedup, malformed).
Channel funnel: notifications → deliveries → provider 2xx → engagement (open, click).
Latency: event-receive to provider-ack p50/p95/p99, broken down by priority.
Provider health: per-provider error rate, token-bucket utilization, circuit-breaker state.

The single most useful chart is delivery rate by channel by hour; outages and silent regressions show up as a step change. Wire this into Prometheus + Alertmanager, with a 15-minute window and a 20% deviation threshold.

FAQ

Why not just use AWS SNS for everything?

SNS multi-channel is fine for low-volume, single-tenant use cases. It does not solve preferences, dedup, batching, or in-app inbox — you still need most of this system. SNS is one valid backend for the push channel; treat it as such.

Push or email first?

Push if the user has the app installed and accepted permissions. It is cheaper, faster, and the user already gave permission. Fall back to email when push fails or for users who never granted push permission. SMS only as a last resort or for OTP.

How do you handle 100 M-follower celebrities?

Same celebrity problem as social graph fan-out: do not fan-out at write time. Either pre-aggregate (digest) or pull-model the notification at user open. Pure push fan-out for 100 M users will saturate the system for 30+ minutes.

Why partition Kafka by user_id rather than event_id?

So a single user's deliveries process in order on one consumer — you preserve "Alex liked your post" before "Alex commented on your post". Across users, parallelism is fine.

Can the notification system be exactly-once?

End-to-end exactly-once across third-party providers is impossible — APNs gives you no idempotency key. Aim for at-least-once with delivery-id deduplication and write user-facing copy so duplicates are tolerable ("Your order is on its way" is fine to send twice).

Where do A/B tests fit?

Variant assignment happens in the API layer before enqueue; the variant is part of the notification payload. Engagement events flow back into a separate analytics pipeline (not the delivery store) so experiment metrics do not bloat the hot path.