Design Facebook Messenger (Chat)

[!IMPORTANT] In this lesson, you will master:

  1. System Architecture: Designing a real-time message routing system.
  2. Connection Protocols: Understanding WebSockets vs. Long Polling.
  3. Data Model: Handling high-throughput message storage using wide-column stores.
  4. Reliability: Message synchronization and End-to-End Encryption (E2EE).

1. What is a Chat System?

A chat system allows users to send text, images, and videos to each other in real-time. It supports 1-on-1 conversations and Group chats.

Real-World Examples

  • WhatsApp / Signal: Store messages on device (mostly). End-to-end encrypted. Architecture is “Store-and-Forward”.
  • FB Messenger / Slack: Store messages on server. Multi-device sync is critical. We will focus on this Cloud-Based architecture.

Try it yourself

[!TIP] Try it yourself: Open WhatsApp Web and your phone. Send a message from your phone. It appears on the Web instantly. Turn off your phone’s internet. The Web version says “Phone not connected” (for WhatsApp) or continues working (for Messenger). Why?


2. Process Requirements

Functional Requirements

  • 1-on-1 Chat: Low latency delivery.
  • Group Chat: Up to 500 members.
  • Presence: “Online”, “Last Seen”, “Typing…”.
  • Receipts: Sent, Delivered, Read.
  • Multi-Device: Sync messages across phone and desktop.

Non-Functional Requirements

  • Low Latency: Real-time experience is key (< 100ms).
  • Consistency: Messages must appear in order (FIFO).
  • Availability: High.
  • Security:
  • TLS: Encryption in transit.
  • E2E (End-to-End): Using the Signal Protocol (Double Ratchet Algorithm). Keys are stored on user devices, not servers. The server only sees encrypted blobs.

3. Estimate

  • DAU: 2 Billion.
  • Messages: 100 Billion per day.
  • Storage:
  • Avg msg size = 100 Bytes.
  • 100B * 100B = 10TB / day.
  • 5 Years = 18 PB.
  • We need a massive, write-heavy database.
  • Bandwidth: 10TB / 86400 seconds ≈ 115 MB/s on average. Peak traffic might be 2x or 3x this, reaching ~345 MB/s.
  • Hardware Reality (Concurrent Connections): This is the real bottleneck of real-time systems. If 10% of our 2B DAU are online simultaneously, that is 200 Million concurrent WebSockets. A heavily optimized Linux server (with tuned ulimit and ephemeral ports) can hold around 500k connections. We will need at least 200M / 500k = 400 massive, memory-heavy connection servers just to maintain these open sockets!

4. Data Model

System APIs

We need a mix of REST (for profile/auth) and WebSocket (for chat).

REST API (HTTP)

Method Endpoint Description
POST /v1/chat/create Start a new chat (returns chat_id).
POST /v1/group/add_member Add user to group.

WebSocket Events (Bidirectional)

  • sendMessage(chat_id, content)
  • receiveMessage(sender_id, content)
  • userStatusChanged(user_id, status)

Database Design

We need a database that handles extremely high write throughput and efficient range queries (to load chat history).

MySQL?

  • Possible, but index maintenance becomes heavy at 100B writes/day.

HBase / Cassandra (Wide Column Store)

  • Winner (labeled in diagram).
  • Key: chat_id (Partition Key).
  • Clustering Key: message_id (Sort Key, Snowflake).
  • Values: sender_id, content_text, media_url, created_at.
  • Allows very fast “Get me the last 50 messages for Chat 123” queries by scanning the sequential clustering keys within a single partition.

5. Architecture

Real-Time Message Routing Architecture.

Analogy: The Post Office vs. The Dedicated Courier
Traditional HTTP requests are like sending a letter through the post office—you drop it off, wait days, and periodically check your mailbox (polling). WebSockets, however, are like having a dedicated courier standing at your door with a direct radio link to the sender. The moment a message is dispatched, the courier hands it to you instantly. The challenge in system design is managing millions of these dedicated couriers without running out of "radio frequencies" (server ports and memory).
System Architecture: Real-Time Chat System
WebSocket | Stateful Chat Servers | Service Discovery (Zookeeper)
WebSocket
Message Route
Discovery
Clients
Chat Cloud
Storage & Discovery
Sender
Receiver
Load Balancer
Chat Server 1
(User A)
Chat Server 2
(User B)
Zookeeper
Discovery
HBase
Storage

6. Localized Details

Connection Protocols

How do we keep the connection open? See Polling vs Push.

A. Polling (The Old Way)

  • Client asks “Any new messages?” every 1 second.
  • Pros: Simple HTTP.
  • Cons: Wasted resources. Server load is high even if no one is talking. High latency.

B. Long Polling

  • Client asks “Any new messages?”. Server holds the connection open until a message arrives (or timeout).
  • Pros: Less load than polling.
  • Cons: Connection setup overhead is still there.

C. WebSockets (The Modern Way)

  • Bidirectional, persistent TCP connection.
  • Server can push to Client instantly.
  • Pros: Lowest latency, lowest overhead.
  • Cons: Needs stateful servers (Server must know who is connected to it).

Group Chat Optimization

Group chats introduce a “Fanout” problem.

Small Groups (< 500 members)

  • Push Model: When User A sends a message, the server loops through all 500 members, finds their WebSocket connections, and pushes the message.
  • Why: 500 lookups is fast.

Large Groups / Channels (> 5000 members)

  • Pull Model (or Hybrid): We don’t push to everyone.
  • Online users might be listening to a Pub/Sub channel (e.g., Kafka topic).
  • Inactive users will just fetch the history when they open the app next time.

Message Synchronization & Reliability

How do we ensure messages are delivered in order and synced across devices?

Sequence IDs & Multi-Device Sync

We cannot rely on timestamp (clock skew). We use Sequence IDs per chat.

The Sync Protocol:

  1. State: Each device (Phone, Laptop) maintains a local LastReadID.
  2. Reconnect: When a device connects, it sends SYNC(LastReadID).
  3. Fetch: Server queries HBase/Cassandra: SELECT * FROM msgs WHERE chat_id=123 AND msg_id > LastReadID.
  4. Forward: Server pushes missing messages to that specific socket.

[!TIP] War Story: The New Year’s Eve Outage WhatsApp once faced massive delays during New Year’s Eve due to ID generation bottlenecks. By moving from a global sequence generator to localized sequence IDs (scoped to each chat room), they eliminated the global lock and handled millions of messages per second seamlessly.

The Sequence ID Solution

Why not just use time? Because server clocks drift.

War Story: The Time-Drift Phantom
In early versions of distributed chat systems, relying on NTP (Network Time Protocol) for timestamping led to a notorious bug: messages arriving "from the future" or "out of order". If Server A was 50ms ahead of Server B, a reply processed by Server A might get a timestamp earlier than the original message processed by Server B. This forced engineering teams to abandon absolute timestamps for ordering, shifting entirely to logical clocks and sequence IDs to guarantee causal ordering.
  • Logical Clocks: A Sequence ID is a counter (1, 2, 3…) unique to a Chat Room.
  • Implementation: This is tricky. You can’t use a global counter (slow).
  • Solution: Since one person writes at a time (usually), the client can propose a temp ID, and the server assigns the final authoritative ID.
  • K-V Store: We store Current_Max_ID in Redis for each Chat ID. INCR is atomic.

Active Session Sync:

  • If User A is online on both Phone and Laptop.
  • The Chat Server detects two active WebSocket connections for UserA.
  • When a message arrives for UserA, the server fans out the message to all active sockets.
  • Both devices receive the message instantly.
Client State: ID: 5
--- Sync (ID: 5) --->
Server DB: [6, 7, 8]

Presence Service

  • Heartbeat: Client sends a heartbeat every 5s over the WebSocket.
  • Redis: Presence Service updates “Last Seen” in Redis with TTL = 10s.
  • If Heartbeat stops, TTL expires → User is Offline.
  • Fanout: When User A comes online, we fanout this status to all their friends (Pub/Sub).

End-to-End Encryption (Signal Protocol)

How does WhatsApp ensure End-to-End Encryption (E2EE) where even the server cannot read the messages?

It uses the Signal Protocol, which combines X3DH (Extended Triple Diffie-Hellman) and the Double Ratchet Algorithm.

1. Key Generation (X3DH)

When Alice installs the app, she generates keys and uploads the public parts to the server:

  • Identity Key (IK): Long-term identity.
  • Signed PreKey (SPK): Medium-term key.
  • One-Time PreKeys (OPK): A stack of single-use keys.

2. Session Setup (Async)

When Bob wants to message Alice:

  1. Bob fetches Alice’s public keys (IK, SPK, OPK) from the server.
  2. Bob performs X3DH to derive a shared secret SK.
  3. Bob encrypts the first message using SK and sends it (along with his public keys).
  4. Crucially: Alice does NOT need to be online.

3. Double Ratchet (Forward Secrecy)

For every new message, the encryption key changes.

  • KDF Chain: A Key Derivation Function generates a new key for each message.
  • Forward Secrecy: If a hacker steals the key for Message 50, they cannot derive the key for Message 49 (Backward Secrecy) or Message 51 (Forward Secrecy).
  • Self-Healing: If a key is compromised, the next Diffie-Hellman exchange resets the chain.

[!TIP] Interview Tip: Mention “Double Ratchet” for E2EE questions. It’s the gold standard (used by WhatsApp, Signal, Facebook Secret Conversations).


Interactive Decision Visualizer

Polling vs WebSocket Simulator

[!TIP] Try it yourself: Click the buttons to compare network overhead. Notice how Polling creates many requests (packets) even when no data is available.

HTTP Polling 🐢

Client
Server
Requests: 0

WebSocket ⚡

Client
Server
Requests: 1

7. Scale

Observability (RED Method)

  • Rate: Messages Sent per Second.
  • Errors: Failed Message Deliveries / WebSocket Disconnects.
  • Duration: End-to-End Latency (Sender → Server → Receiver).

Key Metrics:

  • active_connections: Number of open WebSockets per server (Capacity Planning).
  • message_queue_depth: If using Kafka for async tasks.

Deployment Strategy

  • Connection Draining: When deploying a new Chat Server, we cannot just kill the old one (it has active WebSockets). We must stop accepting new connections and wait for old ones to close (or force reconnect).
  • Blue/Green: Essential for zero-downtime updates of stateful services.

8. Summary & Evaluation

Requirements Traceability

Requirement Design Decision Justification
Real-Time Delivery WebSockets Only persistent connections can achieve <100ms latency.
Message Ordering Sequence IDs Time-sortable IDs (Snowflake) per chat ensure FIFO.
Scalability (Storage) HBase / Cassandra Wide-column stores handle billions of small writes efficiently.
Group Chat Hybrid Fanout Push for small groups, Pull for large channels.
Presence Redis + Heartbeat Ephemeral keys with TTL map perfectly to “Online Status”.

Interview Gauntlet

Q1: How do you handle “Read Receipts” in a Group of 500?

  • Answer: We do NOT send a separate event for every read. We aggregate them. The client sends “Read up to ID 100”. The server updates the “Read Watermark” for that user in that group.

Q2: What happens if a user is offline?

  • Answer: The message is stored in the DB (HBase). When the user connects, they send their LastReadID. The server queries HBase for WHERE chat_id = X AND msg_id > LastReadID.

Q3: How do you store images?

  • Answer: Never in the WebSocket. Upload to S3 (HTTP) → Get URL → Send URL in WebSocket message.

Whiteboard Summary

Real-Time Arch

  • Protocol: WebSockets (Stateful).
  • Discovery: Zookeeper (User → Server Map).
  • Sync: Sequence IDs (Logical Clock).

Storage & Ops

  • DB: HBase (Write-Heavy, Range Scan).
  • Presence: Redis (Heartbeat TTL).
  • Media: Store-and-Forward (S3).