Design a Chat System

WebSocket, Message Ordering, Presence Detection, and Fan-out Strategies

A chat system (e.g., WhatsApp, Slack, Discord) delivers messages in real time between users β€” 1:1 or in groups. The core challenges: maintaining persistent connections at scale (millions of concurrent WebSockets), ensuring message ordering and delivery guarantees, tracking online presence, and handling media attachments. At billions of messages/day, you need stateful chat servers, a message queue for fan-out, and a combination of MySQL (user data) + KV store (message history) for storage.

Message Flow Visualizer

Watch how messages flow in 1:1 and group chat. Click "Send Message" to trigger a message and see it route through the system.

WebSocket vs Polling Comparison

Compare connection overhead, latency, and bandwidth between WebSocket, Long Polling, and Short Polling. Adjust messages per second to see the impact.

WebSocket

Latency~5ms
HTTP overhead/msg2 bytes
Bandwidthβ€”
Connectionsβ€”

Long Polling

Latency~100ms
HTTP overhead/msg~800 bytes
Bandwidthβ€”
Connectionsβ€”

Short Polling

Latency~1-5s
HTTP overhead/msg~800 bytes
Bandwidthβ€”
Req/secβ€”

Capacity Estimation

Estimate resources needed for a large-scale chat system.

Messages/dayβ€”
Text storage/dayβ€”
Media storage/dayβ€”
Peak QPSβ€”

Architecture

Sender
β†’
API Gateway
β†’
Chat Server
↓
Message Queue
↓
Chat Server
↓
Receiver
↓
Message Store
KV Store (Cassandra)
↓
Presence Server
Heartbeat / Status
↓
Push Notification
Offline Users
Media Storage
S3 / CDN
User DB
MySQL (profiles, groups)

Key Design Decisions

1:1 Chat vs Group Chat Fan-out

1:1 Chat
  • Direct delivery via WebSocket
  • Simple β€” one sender, one receiver
  • Store in per-conversation KV partition
vs
Group Chat
  • Fan-out write to each member's inbox
  • Message queue for async delivery
  • Limit group size (e.g. 500) to bound fan-out

Message Storage: MySQL vs KV Store

MySQL (user data)
  • User profiles, contacts, groups
  • Strong consistency, ACID
  • Moderate read/write volume
vs
KV Store (messages)
  • Append-heavy, sequential reads
  • Partitioned by (chat_id, timestamp)
  • Billions of rows β€” Cassandra/HBase

Online Presence

Users send heartbeat every 5s via WebSocket. If no heartbeat for 30s, mark offline. For groups, fan-out presence updates only to online members. Use a pub/sub channel per user so friends subscribe to each other's status changes β€” avoids polling.

Message Sync & Ordering

Each device tracks a max_message_id. On reconnect, fetch messages where id > max_message_id. Use a Snowflake-like ID generator (timestamp + sequence) to ensure global ordering within a chat. For cross-chat ordering, rely on client-side timestamps.

End-to-End Encryption

Use the Signal Protocol (Double Ratchet + X3DH key exchange). Server stores only ciphertext β€” cannot read messages. Key challenge: multi-device support requires syncing pre-keys across devices. Group chat E2E uses sender keys distributed via pairwise channels.

Media Handling

Upload media to object storage (S3) via a dedicated upload service. Return a media URL/ID. Message contains the reference, not the blob. Use a CDN for download. Compress images server-side, generate thumbnails. For E2E: encrypt media with a random key, share key in the message.

A classic system design interview question β€” master real-time communication, presence, and message fan-out.