Design a Chat System
WebSocket, Message Ordering, Presence Detection, and Fan-out Strategies
A chat system (e.g., WhatsApp, Slack, Discord) delivers messages in real time between users β 1:1 or in groups. The core challenges: maintaining persistent connections at scale (millions of concurrent WebSockets), ensuring message ordering and delivery guarantees, tracking online presence, and handling media attachments. At billions of messages/day, you need stateful chat servers, a message queue for fan-out, and a combination of MySQL (user data) + KV store (message history) for storage.
Message Flow Visualizer
Watch how messages flow in 1:1 and group chat. Click "Send Message" to trigger a message and see it route through the system.
WebSocket vs Polling Comparison
Compare connection overhead, latency, and bandwidth between WebSocket, Long Polling, and Short Polling. Adjust messages per second to see the impact.
WebSocket
Long Polling
Short Polling
Capacity Estimation
Estimate resources needed for a large-scale chat system.
Architecture
Key Design Decisions
1:1 Chat vs Group Chat Fan-out
- Direct delivery via WebSocket
- Simple β one sender, one receiver
- Store in per-conversation KV partition
- Fan-out write to each member's inbox
- Message queue for async delivery
- Limit group size (e.g. 500) to bound fan-out
Message Storage: MySQL vs KV Store
- User profiles, contacts, groups
- Strong consistency, ACID
- Moderate read/write volume
- Append-heavy, sequential reads
- Partitioned by (chat_id, timestamp)
- Billions of rows β Cassandra/HBase
Online Presence
Users send heartbeat every 5s via WebSocket. If no heartbeat for 30s, mark offline. For groups, fan-out presence updates only to online members. Use a pub/sub channel per user so friends subscribe to each other's status changes β avoids polling.
Message Sync & Ordering
Each device tracks a max_message_id. On reconnect, fetch messages where id > max_message_id. Use a Snowflake-like ID generator (timestamp + sequence) to ensure global ordering within a chat. For cross-chat ordering, rely on client-side timestamps.
End-to-End Encryption
Use the Signal Protocol (Double Ratchet + X3DH key exchange). Server stores only ciphertext β cannot read messages. Key challenge: multi-device support requires syncing pre-keys across devices. Group chat E2E uses sender keys distributed via pairwise channels.
Media Handling
Upload media to object storage (S3) via a dedicated upload service. Return a media URL/ID. Message contains the reference, not the blob. Use a CDN for download. Compress images server-side, generate thumbnails. For E2E: encrypt media with a random key, share key in the message.