Design WhatsApp
End-to-End Encryption, WebSocket Messaging, Delivery Receipts, Group Fan-out, and Presence at Scale
WhatsApp handles 100B+ messages per day across 2B+ monthly active users. The core challenges: implementing end-to-end encryption using the Signal Protocol so even the server cannot read messages, maintaining persistent WebSocket connections for real-time delivery, tracking message delivery status (sent, delivered, read) with receipts, supporting group messaging with fan-out to up to 1024 members, building a presence system that tracks online/offline status for billions of users, and handling media transfer (images, video, documents) with separate upload/download paths. At scale, that means millions of concurrent WebSocket connections per server and petabytes of message throughput daily.
End-to-End Encryption Visualizer
WhatsApp uses the Signal Protocol for E2E encryption. Each user generates a public/private key pair. A shared secret is derived via Diffie-Hellman key exchange. The server only sees encrypted ciphertext -- it can never read the plaintext.
------ ----Message Delivery Status Simulator
WhatsApp uses a three-stage delivery receipt system: a single check mark means the server received the message, double check marks mean the recipient's device received it, and blue double check marks mean the recipient opened and read the message.
Capacity Estimation Calculator
Back-of-the-envelope math for a WhatsApp-scale messaging system. Adjust parameters to see how throughput, storage, bandwidth, and infrastructure requirements change.
Group Messaging Fan-out
When a message is sent to a group, the server must deliver it to every member. Compare two strategies: fan-out writes (copy message to each member's queue) vs group-level storage with pointers. Larger groups make the trade-off clearer.
Fan-out on Write
Copy each message to every member's inbox queue
Group Storage + Pointers
Store once, each member holds a pointer to group log
Presence System (Online/Offline)
Tracking who is online is expensive at scale. Each user sends periodic heartbeats; the server marks them offline after a timeout. Optimizations: only track presence for contacts, batch updates, and use a pub/sub model to push status changes to interested subscribers.
High-Level Architecture
WhatsApp's architecture is built around persistent WebSocket connections for real-time delivery, with a message queue for reliable async processing and separate services for users, presence, and media.
Key Design Decisions
WebSocket vs Long Polling
- Full-duplex, persistent connection
- Sub-100ms message delivery
- Server can push without client request
- Efficient for high-frequency messaging
- Connection state must be tracked
- HTTP-based, simpler infra
- Higher latency (seconds)
- Client initiates every request
- Better for low-frequency updates
- Stateless, easier load balancing
Message Storage: Cassandra vs MySQL
- Write-optimized, append-only
- Linear horizontal scaling
- Tunable consistency (ONE for writes)
- Time-series data model fits messages
- No single point of failure
- ACID transactions for critical ops
- Rich query capabilities
- Complex sharding logic needed
- Resharding is painful
- Better for user metadata
Media Handling
Media is handled separately from text messages. The sender uploads the encrypted media file to object storage (S3) via a dedicated media service, receives a URL, and sends the URL as part of the message. The recipient downloads the media independently. This keeps the message pipeline lightweight -- text messages are ~200 bytes while media files average 200KB+. Media is encrypted client-side with a per-file key included in the message metadata.
Connection Routing
Each user's WebSocket connects to one server. Redis maps user_id → server_id for routing. When Alice sends to Bob, the server looks up Bob's server in Redis and forwards. If Bob is on a different datacenter, the message routes via an internal message bus. Connection draining during deploys: new connections go to new servers while existing ones finish gracefully with a 30s timeout.