Skip to main content

Design a Chat System (WhatsApp)

End-to-end design of a real-time encrypted messaging platform serving 2 billion users and 100 billion messages per day. Covers WebSocket connection management, the Signal Protocol for E2EE, store-and-forward offline delivery, and group message fan-out.

45 min read 10 sections 6 interview questions
WebSocketsE2EEErlangCassandraKafkaReal-TimeFan-OutSignal Protocol

Why Chat Systems Are Hard

WhatsApp's scale — 2 billion users, 100 billion messages/day (~1.15M/sec peak) — is extreme, but the hard problems exist even at moderate scale. Three make this uniquely challenging:

  1. Persistent bidirectional connections — unlike HTTP request-response, chat requires a live socket per connected user, tying up server resources indefinitely.
  2. Online/offline state — delivery semantics must change based on whether the recipient is reachable right now.
  3. End-to-end encryption — the server must route ciphertext it cannot read, requiring careful key exchange infrastructure.

Functional Requirements

01

1-to-1 Messaging

Send text, images, videos, and documents between two users. Messages must be delivered in order. Delivery and read receipts must flow back to the sender.

02

Group Messaging

Groups up to 1,024 members. Every message must be fanned out to all online members immediately, and queued for offline members.

03

Online Presence

Show 'online', 'last seen at', and typing indicators. Sub-second freshness for online status.

04

End-to-End Encryption

All messages encrypted client-side. Server routes ciphertext only — never has access to plaintext. Uses Signal Protocol (X3DH key exchange + Double Ratchet Algorithm).

05

Offline Delivery

Messages queued on server when recipient is offline. Delivered with push notification (APNs/FCM) to wake the app. Deleted from server after client acknowledgment.

06

Media Sharing

Images/videos uploaded client-side encrypted to S3/Blob storage. Only the download URL + decryption key travel through the message path.

High-Level Architecture

Rendering diagram...

WebSocket Gateway & Erlang/OTP

WhatsApp's gateway is built on Erlang/OTP — chosen for its actor-based concurrency model where each connection is a lightweight Erlang process using only ~2KB of memory. A single node handles 2 million concurrent connections. Crash isolation means one connection failure cannot affect others ('let it crash' philosophy). Hot code loading allows deployments without disconnecting users. The gateway speaks a custom binary protocol based on XMPP, optimized for mobile bandwidth. Incoming messages are encrypted payloads — the gateway is a dumb forwarder, never decrypting content.

Message Delivery Flow

Rendering diagram...

End-to-End Encryption (Signal Protocol)

WhatsApp uses the Signal Protocol for end-to-end encryption. Key Exchange: X3DH (Extended Triple Diffie-Hellman) — each user uploads an identity key, a signed pre-key, and a set of one-time pre-keys to the key server during registration. When Alice wants to message Bob, she fetches Bob's pre-key bundle and derives a shared secret without Bob being online. Message Encryption: The Double Ratchet Algorithm derives a new encryption key for every single message, providing forward secrecy (past keys cannot decrypt future messages) and break-in recovery (future keys are not exposed if a session key is compromised). Group Encryption: The Sender Keys protocol — each member generates a sender key, distributes it to all group members via pairwise encrypted channels, then uses it for efficient broadcast encryption. The server stores only encrypted key material.

Data Model

TableStoreKey DesignNotes
usersPostgreSQLuser_id UUID PKPhone, profile, registration keys
pre_keysPostgreSQL(user_id, key_id) PKOne-time pre-keys for X3DH. Deleted after use.
messages (inbox)CassandraPARTITION KEY (recipient_id), CLUSTERING (created_at DESC)Store-and-forward queue. Deleted after ACK.
group_metadataPostgreSQLgroup_id UUID PKName, avatar, member list up to 1,024
presenceRedisKey: user:{user_id}:online, TTL: 30sHeartbeat refreshes TTL. Expire = offline.
media_refsCassandra(message_id) PKS3 object key + encrypted AES key

Group Message Fan-Out

For groups up to 1,024 members, the Group Service performs fan-out: it fetches the member list, filters to online members, and pushes the message directly via their gateway sockets. For offline members, it writes to each user's Cassandra inbox individually. This is O(members) writes per group message — for a 1,024-member group with 100 messages/sec, that's 102,400 writes/sec. Cassandra's LSM-tree write path handles this easily. WhatsApp avoids a 'fan-out at read time' approach because that would require fetching from multiple users' timelines on every group message view, which is far slower.

TIP

Presence Service Design

Presence is stored in Redis with a TTL of 30 seconds. Every connected client sends a heartbeat every 15 seconds. If the TTL expires (client disconnects or crashes), the user is automatically marked offline. 'Last seen' timestamps are written to PostgreSQL on disconnect. This avoids explicit disconnect handling — the TTL mechanism is self-healing.

IMPORTANT

Key Architectural Trade-offs

  • Erlang over Java/Go: Unmatched concurrency for connection-per-process workloads. Trade-off: smaller talent pool, limited ecosystem, performance ceiling for CPU-bound tasks.
  • Transient message storage: Messages deleted after delivery saves storage and enhances privacy. Trade-off: no server-side history, multi-device sync is complex (must send to all registered devices simultaneously).
  • Custom binary protocol over standard XMPP: Efficient mobile bandwidth usage. Trade-off: not interoperable, requires maintaining custom client libraries.
  • Cassandra for message queues: Excellent write throughput, partition-by-user scales horizontally. Trade-off: eventual consistency means a brief delivery window where ACKs might need retry.
  • E2EE with Signal Protocol: Server is fully blind to message content — maximally private. Trade-off: server-side spam filtering, content moderation, and message search are impossible.

Interview Questions

Click to reveal answers
Test your knowledge

Sign in to take the Quiz

This topic has 20 quiz questions with instant feedback and detailed explanations. Sign in to unlock quizzes.

Sign in to take quiz →