AlgoMaster Logo

Design WhatsApp

Ashish

Ashish Pratap Singh

easy

In this chapter, we will dive into the high-level design of a messaging system like WhatsApp.

This problem is a favorite in system design interviews because it tests your understanding of real-time communication, connection management, message ordering, and handling the complexities of a truly global-scale system.

Let's start by clarifying the requirements.

1. Clarifying Requirements

Before diving into the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.

Here is an example of how a discussion between the candidate and the interviewer might unfold:

After gathering the details, we can summarize the key system requirements.

1.1 Functional Requirements

  • One-on-One Chat: Users can send and receive messages in real-time with other users.
  • Group Chat: Users can create groups and send messages to multiple recipients (up to 500 members).
  • Message Delivery Status: Users can see delivery receipts (sent, delivered, read).
  • Online Presence: Users can see if their contacts are online, offline, or their last seen time.
  • Message History: Users can access their message history and sync across multiple devices.
  • Push Notifications: Offline users receive push notifications for new messages.

1.2 Non-Functional Requirements

  • Low Latency: Messages should be delivered within milliseconds for online users. Target: p99 < 100ms for message delivery.
  • High Availability: The system must be highly available (99.99% uptime). Users expect messaging to work 24/7.
  • Reliability: Messages must never be lost. Once sent, a message should eventually be delivered, even if the recipient is offline.
  • Scalability: Support 500M+ daily active users and 20B+ messages per day.
  • Ordering: Messages within a conversation should appear in the correct order.
  • Consistency: Eventually consistent for presence, strong consistency for message delivery.

2. Back-of-the-Envelope Estimation

To understand the scale of our system, let's make some reasonable assumptions.

Message Throughput

  • Total messages per day: 500M users x 40 messages = 20 billion messages/day
  • Average messages per second: 20B / 86,400 = ~230,000 messages/second
  • Peak load (3x factor): ~700,000 messages/second

Connection Load

  • Concurrent connections: If 10% of DAU are online at any time = 50 million concurrent connections
  • Peak concurrent connections: ~100 million

Each connection requires maintaining a persistent WebSocket, which is a significant infrastructure challenge.

Storage (Per Day)

  • Message storage: 20B messages x 100 bytes = 2 TB/day
  • Annual storage: 2 TB x 365 = 730 TB/year (just for messages)

Bandwidth

  • Incoming bandwidth: 230K msg/sec x 100 bytes = ~23 MB/sec (inbound)
  • Outgoing bandwidth: Higher due to group message fanout. For a message sent to a group of 20, it needs to be delivered 20 times.

3. Core APIs

The messaging system needs a minimal but powerful set of APIs. Below are the core APIs required for the basic functionality.

1. Send Message

Endpoint: WebSocket message or POST /messages

Sends a message from one user to another user or group.

Request Parameters:
  • sender_id (required): ID of the user sending the message.
  • recipient_id (required): ID of the recipient user or group.
  • message_type (required): Type of recipient (user or group).
  • content (required): Message content (text).
  • client_message_id (required): Client-generated unique ID for deduplication.
  • timestamp (required): Client-side timestamp when message was created.
Sample Response:
  • message_id: Server-generated unique message ID.
  • status: Current status (sentdeliveredread).
  • server_timestamp: Server-side timestamp for ordering.
Error Cases:
  • 400 Bad Request: Invalid message format or missing required fields.
  • 403 Forbidden: User not authorized to send to this recipient.
  • 429 Too Many Requests: Rate limit exceeded.

2. Fetch Messages

Endpoint: GET /conversations/{conversation_id}/messages

Retrieves message history for a conversation.

Request Parameters:
  • conversation_id (required): ID of the conversation.
  • cursor (optional): Pagination cursor for fetching older messages.
  • limit (optional): Number of messages to fetch (default: 50, max: 100).
Sample Response:
  • messages: Array of message objects with id, sender, content, timestamp, status.
  • next_cursor: Cursor for fetching the next page.
  • has_more: Boolean indicating if more messages exist.

3. Update Message Status

Endpoint: POST /messages/{message_id}/status

Updates the delivery status of a message (delivered, read).

Request Parameters:
  • message_id (required): ID of the message.
  • status (required): New status (delivered or read).
  • timestamp (required): When the status change occurred.

4. Get User Presence

Endpoint: GET /users/{user_id}/presence

Gets the online status and last seen time of a user.

Sample Response:
  • user_id: ID of the user.
  • status: Current status (onlineoffline).
  • last_seen: Timestamp of last activity (if offline).

4. High-Level Design

At a high level, our system must satisfy three core requirements:

  1. Real-time Message Delivery: Messages should reach online recipients instantly.
  2. Offline Message Handling: Messages for offline users should be stored and delivered when they come online.
  3. Group Message Distribution: A single message should be efficiently distributed to all group members.

The key insight is that messaging is fundamentally a push-based system. Unlike request-response APIs, we need to maintain persistent connections with clients to push messages as they arrive.

4.1 Requirement 1: Real-time One-on-One Messaging

Let's start with the core use case: User A sends a message to User B, who is currently online.

Components Needed

Chat Servers

These are stateful servers that maintain persistent WebSocket connections with clients. Each chat server handles thousands of concurrent connections.

Responsibilities:

  • Maintain WebSocket connections with clients
  • Receive messages from senders
  • Route messages to recipients (directly or via other chat servers)
  • Handle connection lifecycle (connect, disconnect, heartbeat)

Session Service

A fast lookup service that maps user IDs to their currently connected chat server.

Responsibilities:

  • Track which chat server each online user is connected to
  • Update mappings when users connect/disconnect
  • Provide O(1) lookup for message routing

This is typically implemented using Redis for its speed and pub/sub capabilities.

Message Service

Handles message persistence and retrieval.

Responsibilities:

  • Persist messages to the database
  • Generate server-side message IDs and timestamps
  • Handle message status updates

Flow: Sending a One-on-One Message

  1. User A sends a message through their WebSocket connection to Chat Server 1.
  2. Chat Server 1 sends the message to the Message Service for persistence.
  3. Message Service stores the message in the database and returns a server-generated message ID and timestamp.
  4. Chat Server 1 queries the Session Service to find which server User B is connected to.
  5. Session Service returns that User B is connected to Chat Server 2.
  6. Chat Server 1 forwards the message to Chat Server 2 (via internal RPC or message queue).
  7. Chat Server 2 pushes the message to User B through their WebSocket connection.
  8. User B's client sends an acknowledgment back.
  9. The delivery status is updated, and User A sees the "delivered" checkmark.

4.2 Requirement 2: Handling Offline Users

What happens when User B is offline? We need to store the message and deliver it when they come online.

Additional Components Needed

Message Queue

For offline users, messages are queued for later delivery.

Responsibilities:

  • Store messages for offline users
  • Ensure messages are delivered in order when user comes online
  • Handle retry logic for failed deliveries

Push Notification Service

Sends push notifications to offline users' devices.

Responsibilities:

  • Integrate with APNs (iOS) and FCM (Android)
  • Send notifications for new messages
  • Handle notification preferences and quiet hours

Flow: Message to Offline User

  1. User A sends a message to User B.
  2. Chat Server 1 persists the message in the database via Message Service.
  3. Chat Server 1 queries Session Service and finds User B is offline.
  4. The message is added to User B's message queue (pending delivery).
  5. Push Notification Service sends a push notification to User B's device.
  6. When User B comes online:
    • They establish a WebSocket connection to a Chat Server
    • The server fetches all pending messages from the queue
    • Messages are delivered in order
    • Queue entries are cleared after successful delivery

4.3 Requirement 3: Group Messaging

Group messaging introduces a fanout challenge. When a user sends a message to a group of 100 members, the message needs to be delivered to all 100 recipients.

Approaches to Group Message Fanout

Approach 1: Sender-Side Fanout

The sender's chat server handles delivering to all group members.

Pros: Simple to implementCons: Puts heavy load on a single server; doesn't scale for large groups

Approach 2: Message Queue Fanout

Use a message queue with pub/sub capabilities (like Kafka) to distribute the work.

How it works:

  1. Sender publishes message to a group topic
  2. Multiple consumers process the message
  3. Each consumer handles delivery to a subset of group members

Pros: Scales horizontally; work is distributedCons: Adds latency; more complex infrastructure

  • Small groups (< 100 members): Direct fanout from sender's server
  • Large groups (100+ members): Use message queue for distributed fanout

Flow: Group Message Delivery

  1. User A sends a message to Group G.
  2. Chat Server 1 persists the message with group_id in the database.
  3. Chat Server 1 queries the Group Service to get the list of group members.
  4. For each member, it queries Session Service to find their chat server.
  5. Messages are batched by destination chat server and forwarded.
  6. Each chat server delivers to its connected group members.
  7. Offline members' messages go to the message queue for later delivery.

4.4 Putting It All Together

Here's the complete architecture combining all requirements:

ComponentPurpose
Load BalancerDistributes WebSocket connections across chat servers
Chat ServersMaintain persistent connections, route messages in real-time
API GatewayHandles REST API requests for non-real-time operations
Session Service (Redis)Maps users to their connected chat server
Message ServiceHandles message persistence and retrieval
Group ServiceManages group membership and metadata
Message Queue (Kafka)Buffers messages for offline users and handles fanout
Push Notification ServiceSends push notifications via APNs/FCM
Message DatabaseStores message history (Cassandra for scale)
User DatabaseStores user profiles and relationships (PostgreSQL)

5. Database Design

5.1 SQL vs NoSQL

To choose the right database for messages, let's consider the access patterns:

  • Write-heavy workload: 20 billion messages per day
  • Simple queries: Fetch messages by conversation, ordered by time
  • No complex joins: Messages are self-contained with sender/recipient IDs
  • Time-series nature: Recent messages are accessed far more than old ones
  • High availability required: Users expect messaging to always work

Given these points, a wide-column NoSQL database like Apache Cassandra or ScyllaDB is ideal for message storage due to:

  • Excellent write performance
  • Linear horizontal scalability
  • Time-series data optimization
  • Tunable consistency levels

For user and group data, a relational database like PostgreSQL works well due to the need for transactions and complex queries.

5.2 Database Schema

1. Messages Table (Cassandra)

Stores all messages with partition key optimized for conversation-based queries.

FieldTypeDescription
conversation_idUUID (Partition Key)Unique identifier for the conversation
message_idTimeUUID (Clustering Key)Time-based UUID for ordering
sender_idUUIDID of the message sender
contentTextMessage content
message_typeTextType: text, image, video
statusTextDelivery status: sent, delivered, read
created_atTimestampServer timestamp

Partition Key: conversation_id ensures all messages in a conversation are stored together.

Clustering Key: message_id (TimeUUID) ensures messages are sorted by time within each partition.

2. User Conversations Table (Cassandra)

Index table to quickly find all conversations for a user.

FieldTypeDescription
user_idUUID (Partition Key)User ID
conversation_idUUID (Clustering Key)Conversation ID
last_message_atTimestampTime of last message
unread_countIntegerNumber of unread messages
last_message_previewTextPreview of last message

3. Groups Table (PostgreSQL)

Stores group metadata.

FieldTypeDescription
group_idUUID (PK)Unique group identifier
nameVARCHAR(100)Group name
creator_idUUID (FK)User who created the group
created_atTimestampCreation time
member_countIntegerNumber of members

4. Group Members Table (PostgreSQL)

Maps users to groups.

FieldTypeDescription
group_idUUID (PK, FK)Group ID
user_idUUID (PK, FK)User ID
roleVARCHAR(20)Role: admin, member
joined_atTimestampWhen user joined

6. Design Deep Dive

Now that we have the high-level architecture and database schema in place, let's dive deeper into some critical design choices.

6.1 WebSocket vs Long Polling vs Server-Sent Events

Real-time message delivery requires maintaining persistent connections between clients and servers. Let's compare the options.

Approach 1: HTTP Long Polling

The client makes an HTTP request, and the server holds it open until new data is available (or timeout).

How It Works

  1. Client sends HTTP request: "Any new messages?"
  2. Server holds the connection open (up to 30-60 seconds)
  3. When a message arrives, server responds immediately
  4. Client processes the response and immediately makes a new request
  5. If timeout occurs with no messages, server responds empty and client reconnects

Pros

  • Works through all firewalls and proxies
  • Simple to implement on the server side
  • Compatible with existing HTTP infrastructure

Cons

  • High overhead: new TCP connection for each polling cycle
  • Latency: messages wait until next poll cycle
  • Server resource waste: holding many idle connections

Approach 2: Server-Sent Events (SSE)

A one-way channel where the server can push data to the client over a single HTTP connection.

How It Works

  1. Client opens a persistent HTTP connection
  2. Server sends events as they occur
  3. Connection stays open indefinitely
  4. Client sends messages via separate HTTP POST requests

Pros

  • Lower overhead than long polling
  • Automatic reconnection built into the protocol
  • Works with HTTP/2 for multiplexing

Cons

  • Unidirectional: requires separate channel for client-to-server messages
  • Limited browser support for certain features
  • Not ideal for bidirectional real-time communication

A full-duplex, bidirectional communication channel over a single TCP connection.

How It Works

  1. Client initiates WebSocket handshake via HTTP upgrade request
  2. Server accepts and upgrades the connection
  3. Both sides can send messages at any time
  4. Connection stays open until explicitly closed

Pros

  • True bidirectional: Both client and server can send messages anytime
  • Low latency: No HTTP overhead after initial handshake
  • Efficient: Single TCP connection for all messages
  • Real-time: Messages delivered instantly

Cons

  • Requires WebSocket-aware load balancers
  • Stateful connections complicate horizontal scaling
  • Connection management overhead (heartbeats, reconnection)

Summary and Recommendation

ApproachLatencyEfficiencyComplexityBest For
Long PollingHighLowLowLegacy systems, simple notifications
SSEMediumMediumMediumOne-way streaming (news feeds, stock tickers)
WebSocketLowHighHighBidirectional real-time apps (chat, gaming)

Recommendation: Use WebSocket for messaging systems. The bidirectional, low-latency nature is essential for chat applications. Implement fallback to long polling for environments where WebSocket is blocked.

6.2 Message Delivery Guarantees

Users expect three levels of visibility into message status:

  1. Sent (single checkmark): Message reached the server
  2. Delivered (double checkmark): Message reached recipient's device
  3. Read (blue checkmarks): Recipient opened and viewed the message

How Delivery Confirmation Works

Ensuring At-Least-Once Delivery

Messages must never be lost, even if servers crash or networks fail.

Client-Side Retry with Idempotency

  1. Client generates a unique client_message_id before sending
  2. Client sends message to server
  3. If no ACK received within timeout, client retries with same client_message_id
  4. Server uses client_message_id to deduplicate
  5. Duplicate messages are acknowledged but not stored twice

Server-Side Persistence Before Acknowledgment

Critical rule: Never acknowledge a message until it's persisted.

  1. Server receives message
  2. Server writes to database
  3. Only after successful write: Server sends ACK to client

If the server crashes between receiving and persisting, the client will retry.

Handling Out-of-Order Messages

Network conditions can cause messages to arrive out of order. Solutions:

  1. Sequence numbers per conversation: Each message gets an incrementing sequence number
  2. Server-side timestamp: Server assigns authoritative timestamp for ordering
  3. Client-side reordering: Client sorts messages by sequence number before display

6.3 Presence System (Online/Offline Status)

Presence indicates whether a user is currently online, offline, or their last seen time.

Challenges

  • Scale: With 50 million concurrent users, presence updates are frequent
  • Fanout: A user's presence change needs to reach all their contacts
  • Consistency: Status should be reasonably accurate without being perfect

Approach 1: Heartbeat-Based Presence

Clients send periodic heartbeats (every 5-10 seconds) to indicate they're online.

How It Works

  1. Client connects and sends initial "online" signal
  2. Client sends heartbeat every 5 seconds
  3. Server marks user online, sets expiry (e.g., 30 seconds)
  4. If heartbeat stops, user becomes "offline" after expiry
  5. On disconnect, immediate "offline" status

Pros

  • Simple to implement
  • Works across network disruptions (graceful degradation)

Cons

  • Delay in detecting offline status (up to expiry time)
  • Heartbeat overhead for millions of users

Approach 2: Presence Channels with Pub/Sub

Use Redis pub/sub to distribute presence updates to interested parties.

How It Works

  1. User A's contacts subscribe to channel presence:user_a
  2. When User A's status changes, publish to presence:user_a
  3. All subscribers receive the update in real-time

Fanout Optimization

For users with many contacts (e.g., 1000), full fanout is expensive. Solutions:

  • Lazy presence: Only query presence when user opens a chat
  • Presence batching: Batch multiple presence updates together
  • Presence on demand: Contacts request presence only when viewing contact list

Last Seen Timestamp

Instead of binary online/offline, show "last seen at [time]":

  1. Update last_seen timestamp on every user action
  2. When queried, return the timestamp
  3. Client displays relative time ("last seen 5 minutes ago")

This provides useful information without real-time presence overhead.

Recommendation

Use heartbeat-based presence with lazy querying:

  • Update presence on heartbeat (every 10 seconds)
  • Store in Redis with TTL
  • Query presence only when needed (opening chat, viewing contacts)
  • Avoid broadcasting presence to all contacts

6.4 Message Synchronization Across Devices

Users expect their message history to be available across all their devices (phone, tablet, web).

Sync Strategies

Approach 1: Pull-Based Sync

Client pulls messages it doesn't have by requesting messages after a certain timestamp or sequence number.

Pros: Simple, client controls sync timingCons: May miss messages if client is offline for long periods

Approach 2: Push-Based Sync

Server pushes new messages to all connected devices in real-time.

Pros: Instant sync across devicesCons: Requires tracking all device connections per user

Combine both approaches:

  1. Real-time push: When a device is connected, push new messages immediately
  2. Catch-up pull: When a device comes online, pull any messages it missed

Multi-Device Message Delivery

When User A has 3 devices connected:

  1. Message arrives for User A
  2. Session Service returns all 3 device connections
  3. Message is pushed to all 3 devices
  4. Each device sends independent ACK
  5. "Delivered" status is set when any device acknowledges
  6. "Read" status is set when user opens the message on any device

6.5 Scaling Chat Servers

Chat servers are the most resource-intensive component because they maintain millions of persistent WebSocket connections.

Connection Limits

A single server can handle approximately 50,000-100,000 concurrent WebSocket connections (depending on hardware and message throughput).

For 50 million concurrent users, we need: 50M / 50K = 1,000 chat servers

Sticky Sessions

WebSocket connections are stateful. Once established, all messages for that user must go through the same server.

Load balancer configuration:

  • Use consistent hashing based on user_id
  • Or use connection-aware load balancing

Handling Server Failures

When a chat server crashes:

  1. All connected clients detect disconnection
  2. Clients automatically reconnect to another server
  3. New server registers the connection in Session Service
  4. Pending messages are fetched from the message queue
  5. Message delivery resumes

Graceful Shutdown

For planned maintenance:

  1. Stop accepting new connections
  2. Notify connected clients to reconnect elsewhere
  3. Wait for connections to drain (with timeout)
  4. Shutdown server

6.6 End-to-End Encryption (Conceptual)

End-to-end encryption ensures that only the sender and recipient can read messages. Not even the service provider can decrypt them.

High-Level Approach (Signal Protocol)

  1. Key Generation: Each device generates a public/private key pair
  2. Key Exchange: Users exchange public keys when starting a conversation
  3. Message Encryption: Sender encrypts message with recipient's public key
  4. Transmission: Encrypted message travels through servers
  5. Decryption: Only recipient's private key can decrypt

Server's Role

With E2E encryption, the server:

  • Can: Route encrypted messages, store encrypted data, manage delivery status
  • Cannot: Read message content, provide message content to third parties

Trade-offs

  • Security: Strong privacy protection
  • Complexity: Key management across devices, handling key changes
  • Features limited: Server-side search, spam detection more difficult

References

Quiz

Design WhatsApp Quiz

1 / 20
Multiple Choice

For WhatsApp-like messaging, what is the main benefit of using a persistent connection (e.g., long-lived TCP/WebSocket) for online users?