Last Updated: December 28, 2025
WhatsApp is a widely used instant messaging application that enables real-time communication between users through instant message delivery. Users can send text messages, media files, and other content to individuals or groups, with messages delivered within milliseconds.
Loading simulation...
The core idea is deceptively simple: User A sends a message, and User B receives it instantly. However, achieving this at scale with billions of users, while ensuring message delivery guarantees, handling offline users, and supporting group conversations, introduces significant distributed systems challenges.
Other Popular Examples: Facebook Messenger, Telegram, Signal, WeChat
In this chapter, we will dive into the high-level design of a messaging system like WhatsApp.
This problem is a favorite in system design interviews because it touches on so many fundamental concepts: real-time communication, persistent connections, message ordering, delivery guarantees, and the challenges of building a truly global-scale system.
Let's start by understanding what exactly we are building.
Before diving into the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a discussion between the candidate and the interviewer might unfold:
Candidate: "What is the expected scale? How many users and messages per day should the system support?"
Interviewer: "Let's design for 500 million daily active users (DAU) sending an average of 40 messages per day."
Candidate: "Should we support only one-on-one messaging, or also group chats?"
Interviewer: "Both. Group chats should support up to 500 members."
Candidate: "What types of content should messages support? Text only, or also media like images and videos?"
Interviewer: "Focus on text messages for the core design. You can mention media handling at a high level, but detailed media processing is out of scope."
Candidate: "Do we need to show online/offline status and typing indicators?"
Interviewer: "Yes, presence indicators (online/offline/last seen) are important. Typing indicators are nice-to-have."
Candidate: "What about message delivery guarantees? Should users see read receipts?"
Interviewer: "Yes. Users should see when their message is delivered and when it's read. Messages should never be lost."
Candidate: "Should messages be stored permanently, or can they expire?"
Interviewer: "Messages should be stored until explicitly deleted by the user. We need to support message history sync across devices."
Candidate: "What about end-to-end encryption?"
Interviewer: "You can mention it conceptually, but detailed cryptographic implementation is out of scope."
After gathering the details, we can summarize the key system requirements.
To keep our discussion focused, we will set aside a few features that, while important, would take us down rabbit holes:
With our requirements clear, lets understand the scale we are dealing with. In most interviews, you are not required to do a detailed estimation.
We will use these baseline numbers throughout our calculations:
Let's start with the fundamental question: how many messages flow through this system?
Twenty billion. That is 20,000,000,000 messages every single day. Let's convert that to something more tangible:
The 3x multiplier accounts for peak hours when everyone is awake and chatting. Traffic is never uniform throughout the day.
These numbers tell us something important: we are looking at hundreds of thousands of concurrent operations per second. This is not a system where we can make a database query for every message. We need persistent connections, efficient routing, and aggressive caching.
Here is where messaging systems get interesting, and fundamentally different from typical web applications. Unlike a website where users make requests and disconnect, a messaging app needs to push messages to users the instant they arrive.
That means maintaining persistent connections with every online user.
Each of these 50 million connections requires maintaining a persistent WebSocket. This is a fundamentally different challenge from handling 50 million HTTP requests per day. These connections stay open, consuming memory and file descriptors on our servers.
If a single well-tuned server can handle 50,000 concurrent WebSocket connections (a reasonable estimate for modern hardware with proper kernel tuning), we need:
Just for connection handling alone, we need a fleet of a thousand servers.
Storage requirements for a text-only system are more modest than you might expect:
Seven hundred terabytes per year sounds substantial, but it is well within reach of modern distributed databases like Cassandra or ScyllaDB. For context, a single NVMe drive can hold 4 TB, so we are talking about a few hundred drives worth of storage.
The real challenge with storage is not capacity, it is the access patterns. We need to write 230,000 messages per second while simultaneously reading message history and syncing devices. Latency matters more than raw throughput.
Bandwidth is rarely the bottleneck for text messaging, but let's verify:
When a message goes to a group of 20 members, it needs to reach 20 devices. If 30% of messages are group messages, outbound traffic multiplies accordingly.
But even accounting for this, we are looking at hundreds of megabytes per second, easily handled by modern network infrastructure.
Before diving into architecture, it helps to think about the API contract. What operations does our system need to support?
Defining the APIs early forces us to think concretely about what users can do and what data flows through the system.
A messaging system's API is unusual compared to typical web services. Most real-time communication happens over persistent WebSocket connections, not traditional HTTP request-response.
However, we still need REST endpoints for operations that do not require instant delivery, like fetching message history.
Let's walk through the essential APIs.
WebSocket message or POST /messagesThis is the heart of our system. When a user taps send, this API handles getting the message from their device to ours.
In practice, this almost always goes over the WebSocket connection for lowest latency, but having a REST fallback is useful when WebSocket connections fail.
The client_message_id deserves special attention. Networks are unreliable. A user might tap send, their phone loses connectivity for a moment, and the app retries the send.
Without deduplication, the recipient would see the same message twice. By including a client-generated ID, the server can detect and ignore duplicates, ensuring exactly-once delivery semantics.
GET /conversations/{conversation_id}/messagesWhen a user opens an old conversation or logs in on a new device, they need to see their message history. This endpoint retrieves messages for a conversation, typically the most recent ones first.
Notice that we use cursor-based pagination rather than offset-based. With billions of messages, a query like OFFSET 1000000 would be painfully slow, requiring the database to skip over a million rows. Cursor-based pagination uses an indexed value (like a message ID or timestamp) to efficiently jump to the right position.
POST /messages/{message_id}/statusThis is what powers those checkmarks. When a message is delivered to the recipient's device or opened by the user, we need to update its status and notify the sender.
Status updates flow in one direction: sent → delivered → read. We never go backwards. The timestamp helps with edge cases where status updates arrive out of order due to network delays.
GET /users/{user_id}/presenceReturns whether a user is currently online and, if offline, when they were last active. This powers the "online" indicator and "last seen" text in the UI.
Presence is intentionally kept simple. We do not need to know exactly what a user is doing, just whether they are actively connected. Privacy controls allow users to hide their last seen time, in which case we simply omit that field.
With our API contract defined, we have a clear picture of what the system needs to do. Now let's design the architecture that makes these APIs work at scale.
Now we get to the heart of the design. Rather than throwing a complex architecture diagram at you with 15 boxes and wondering what each one does, we are going to build this system incrementally.
We will start with the simplest possible design that solves our first requirement, then add components only as we encounter new challenges. This mirrors how you should think through the problem in an interview.
Our system must ultimately satisfy three core requirements:
Before we dive into the architecture, let's understand the key insight that shapes everything: messaging is fundamentally a push-based system.
Think about how a typical web application works. Your browser requests a page, the server responds, and the connection closes. If you want new data, you request again. This request-response pattern works great for most applications, but it falls apart for messaging.
You cannot expect users to constantly refresh to check for new messages. The moment a message arrives at our servers, we need to push it to the recipient's device immediately.
This push-based nature is why we need persistent WebSocket connections rather than traditional HTTP. And maintaining millions of persistent connections creates a whole set of challenges that we need to address.
Let's start building, one requirement at a time.
Let's start with the simplest possible scenario: User A sends a message to User B, and User B is currently online with the app open.
What do we need to make this work?
The naive approach might be: store the message in a database and have User B periodically check for new messages. But polling introduces latency and wastes resources.
We need to push the message the instant it arrives. This means maintaining a persistent connection between our servers and User B's device.
Let's introduce the components one by one, understanding why each exists.
These are the workhorses of our system. Each chat server maintains persistent WebSocket connections with thousands of clients simultaneously.
When User A opens the messaging app, their phone establishes a WebSocket connection to one of the chat servers. This connection stays open for as long as the app is in use. When User A sends a message, it travels over this existing connection, no need to establish a new one.
Here is an important insight: chat servers are stateful. Unlike typical web servers where any server can handle any request, User B's messages must go to the specific chat server where User B's connection lives. If User B is connected to Chat Server 2, sending their message to Chat Server 1 will not help.
This statefulness creates a routing challenge. When User A sends a message to User B, how does Chat Server 1 know that User B is on Chat Server 2?
This is where the Session Service comes in. It maintains a simple but critical mapping: which user is connected to which chat server.
When User B connects to Chat Server 2, that server registers the connection: "User B is on Chat Server 2." When User A wants to send a message to User B, we query the Session Service: "Where is User B?" It responds: "Chat Server 2."
We typically implement this using Redis because it offers exactly what we need: fast key-value lookups with built-in expiration for handling disconnections. The data structure is simple:
While routing messages in real-time is essential, we also need to persist them. Users expect to see their message history. If User B's phone dies right as a message arrives, we do not want to lose it.
Now let's trace what happens when User A sends "Hey, how's it going?" to User B. Both users are online, connected to different chat servers.
Let's walk through each step to understand what is happening:
Step 1-3: Receive and persist
User A taps send. The message travels over the existing WebSocket connection to Chat Server 1. Before doing anything else, Chat Server 1 asks the Message Service to persist the message. This is critical. If we route the message first and something fails, the message could be lost. By persisting first, we guarantee that no matter what happens next, the message is safely stored.
The Message Service writes the message to the database and returns a server-generated message ID and timestamp. The timestamp is important because the server's clock is the source of truth for message ordering, not the client's clock which might be wrong.
Step 4-5: Find the recipient
With the message safely stored, Chat Server 1 needs to find User B. It queries the Session Service: "Where is User B connected?" The Session Service responds: "Chat Server 2." This lookup takes less than a millisecond thanks to Redis.
Step 6-7: Route and deliver
Chat Server 1 forwards the message to Chat Server 2. This happens over a direct connection between servers, typically using gRPC or a similar efficient protocol. Chat Server 2 receives the message and pushes it to User B over their WebSocket connection.
Step 8-11: Acknowledge and confirm
User B's client receives the message and sends an acknowledgment back. This ACK travels back through the system, updating the message status to "delivered" in the database along the way. Finally, User A's client receives the delivery confirmation and updates the UI to show the double checkmark.
This entire round trip, from User A tapping send to seeing the delivered checkmark, typically completes in under 100 milliseconds when both users are online. That is fast enough that conversations feel instantaneous.
But here is the question that should be nagging at you: what happens when User B is not online?
The flow we just designed works beautifully when both users are online. But real-world messaging is messier. What happens when User B's phone is in airplane mode? What if they have not opened the app in hours? What if they are in a subway tunnel with no signal?
We cannot just drop the message. This would violate our reliability requirement. Users expect that once they tap send, the message will eventually arrive, even if the recipient is unreachable for hours or days.
This requirement forces us to think differently about message delivery. We cannot just push a message and forget about it. We need to track pending deliveries and retry when users come back online.
Let's introduce two new pieces to our architecture.
Think of the message queue as a mailbox. When we discover that User B is offline, instead of dropping the message, we place it in User B's queue. The messages sit there, safe and ordered, until User B comes back online.
We typically use a system like Kafka or Redis Streams for this. The key insight is that this is not the same as our message database. The database is for long-term storage and history. The queue is for pending deliveries, messages that have been persisted but not yet delivered to the recipient's device.
Even though we cannot deliver the message content directly to an offline user, we can still tell them something is waiting. This is where push notifications come in.
Let's trace what happens when User A sends a message but User B is offline.
The beauty of this design is that the message is never lost. Whether User B comes online in 10 seconds or 10 days, the message will be waiting. The queue acts as a reliable buffer between the sender and an unreachable recipient.
We persist the message to the database before adding it to the queue. This means even if the queue itself fails (a rare event), we have not lost the message. The database is our source of truth; the queue is just an optimization for fast delivery.
So far we have handled one-on-one messaging elegantly. But groups introduce a new challenge that fundamentally changes our design: fanout.
Consider this scenario: User A sends "Happy New Year!" to a family group with 50 members. That single message needs to reach 50 different devices, potentially scattered across 20 different chat servers, some members online and some offline, some on fast WiFi and some on spotty mobile networks.
With one-on-one messaging, one input means one output. With groups, one input means many outputs. This multiplier effect is called fanout, and it can easily overwhelm a naive implementation.
Let's visualize what happens when a message goes to a group:
If User A sends a message to a group with 500 members (our maximum), and the sender's chat server has to individually deliver to all 500, we have a problem:
There are several ways to handle fanout. Let's examine each and understand their trade-offs.
The simplest approach is to have the sender's chat server do all the work. When User A sends a group message, Chat Server 1 looks up all group members, finds their chat servers, and delivers to each one.
How it works:
The good:
The bad:
This approach works fine for small groups (under 50-100 members), which are the majority of groups in typical usage patterns.
For larger groups, we can use a message queue to distribute the work across multiple workers.
How it works:
The good:
The bad:
The smart solution is to combine both approaches, choosing based on group size:
The threshold of 100 is not magic; it is a tunable parameter based on your server capacity. The key insight is that different group sizes warrant different delivery strategies. This hybrid approach gives us the best of both worlds: low latency for the common case and scalability for the edge cases.
Let's put it all together and trace a group message from send to delivery:
group_id instead of a single recipient. The message is now durable.This flow handles groups of any size efficiently. For small groups, it completes in tens of milliseconds. For larger groups using the queue-based approach, delivery might take a bit longer but remains reliable.
We have now addressed each requirement incrementally. Let's step back and see the complete picture. This is the architecture you would draw on the whiteboard after explaining each component:
Looking at this architecture, we can identify distinct layers, each with a specific responsibility:
Client Layer: Mobile apps and web browsers connect to our system. From our perspective, they are all just WebSocket clients sending and receiving JSON messages.
Edge Layer: The load balancer distributes incoming connections across chat servers. For WebSocket connections, we typically use sticky sessions (or consistent hashing by user ID) so that a user's connection stays on the same server after initial assignment.
Real-time Chat Layer: The fleet of chat servers handles all the persistent connections. These are stateful servers, meaning they remember which users are connected to them. This is fundamentally different from stateless web servers where any server can handle any request.
Service Layer: These are traditional stateless services handling specific domains: messages, groups, users, and presence. They can scale horizontally without coordination.
Data Layer: Redis provides fast, ephemeral storage for session mappings and presence. Kafka queues messages for reliable delivery. Cassandra stores the actual message history, optimized for write-heavy, time-ordered data. PostgreSQL handles user and group data where we need transactions and complex queries.
| Component | Purpose | Why This Technology? |
|---|---|---|
| Load Balancer | Distributes WebSocket connections across chat servers | Sticky sessions for connection persistence |
| Chat Servers | Maintain persistent connections, route messages in real-time | Stateful, handles 50K+ connections each |
| API Gateway | Handles REST API requests for non-real-time operations | Rate limiting, authentication |
| Session Service (Redis) | Maps users to their connected chat server | Sub-millisecond lookups, pub/sub for presence |
| Message Service | Handles message persistence and retrieval | Decouples chat servers from storage |
| Group Service | Manages group membership and metadata | ACID transactions for consistency |
| Presence Service | Tracks online/offline status | Real-time updates via Redis |
| Message Queue (Kafka) | Buffers messages for offline users, handles fanout | Durability, ordering guarantees |
| Push Notification Service | Sends push notifications via APNs/FCM | Async processing, retry logic |
| Message Database (Cassandra) | Stores message history | Write-optimized, time-series friendly |
| User Database (PostgreSQL) | Stores user profiles and relationships | Complex queries, transactions |
With the high-level architecture clear, let's dive into how we store all this data efficiently.
The database layer can make or break a messaging system. With 20 billion messages per day and 500 million users, we need to make careful choices. The wrong database will become a bottleneck that is painful to fix later.
Let's think through the requirements and choose appropriately.
One of the most common mistakes in system design is treating all data the same. A messaging system has two fundamentally different types of data, and each deserves a different storage strategy.
Think about how we access messages:
Given these patterns, a wide-column NoSQL database like Apache Cassandra or ScyllaDB is the right choice:
Now think about user and group data:
For this, a relational database like PostgreSQL makes more sense:
With our database choices made, let's design the actual schemas. We have three categories of data, each stored in the technology best suited for it:
This is the heart of our storage layer. The schema design is driven by a single question: "What is the most common query we need to answer?"
For a messaging app, that query is: "Get the last 50 messages for this conversation, ordered by time."
We design the entire table around this access pattern:
| Field | Type | Description |
|---|---|---|
conversation_id | UUID (Partition Key) | Unique identifier for the conversation |
message_id | TimeUUID (Clustering Key) | Time-based UUID for ordering |
sender_id | UUID | ID of the message sender |
content | Text | Message content |
message_type | Text | Type: text, image, video |
status | Text | Delivery status: sent, delivered, read |
created_at | Timestamp | Server timestamp |
Let's understand why each field is where it is:
Partition Key (conversation_id): This determines which nodes store the data. All messages in a single conversation live together on the same nodes. When we query "last 50 messages for conversation X", Cassandra knows exactly which nodes to ask. This is what makes reads fast.
Clustering Key (message_id as TimeUUID): Within a partition (a conversation), messages are physically sorted on disk by the clustering key. A TimeUUID is a special UUID that encodes the timestamp, so messages are automatically ordered by time. Fetching "the last 50 messages" becomes a simple range scan, not a full table scan.
The combination of partition key and clustering key means that our most common query, "get recent messages for a conversation", hits a single partition on a small number of nodes and reads data that is already sorted. This is as fast as it gets.
When a user opens the app, the first thing they see is their conversation list. We need to answer: "What are this user's recent conversations, and what was the last message in each?"
| Field | Type | Description |
|---|---|---|
user_id | UUID (Partition Key) | User ID |
conversation_id | UUID (Clustering Key) | Conversation ID |
last_message_at | Timestamp | Time of last message |
unread_count | Integer | Number of unread messages |
last_message_preview | Text | Preview of last message |
Notice that we store last_message_preview directly in this table. This is intentional denormalization. When rendering the conversation list, we can show "Hey, are you coming for lunch..." without querying the messages table at all.
In a normalized design, we would have to join or make a second query. Here, one query gives us everything we need.
This is a common pattern in Cassandra: store the data in the shape you need to read it, even if it means duplicating information across tables.
Group metadata lives in PostgreSQL where we can use proper relational modeling:
| Field | Type | Description |
|---|---|---|
group_id | UUID (PK) | Unique group identifier |
name | VARCHAR(100) | Group name |
creator_id | UUID (FK) | User who created the group |
created_at | Timestamp | Creation time |
member_count | Integer | Number of members |
The member_count is denormalized here even though we could compute it from the members table. This avoids a COUNT query every time we need to display group info.
This is the join table that maps users to groups:
| Field | Type | Description |
|---|---|---|
group_id | UUID (PK, FK) | Group ID |
user_id | UUID (PK, FK) | User ID |
role | VARCHAR(20) | Role: admin, member |
joined_at | Timestamp | When user joined |
The composite primary key (group_id, user_id) serves two purposes:
With these tables, we can handle all group operations with standard SQL queries and proper transaction support. When a user joins a group, we update both the membership table and the group's member_count in a single transaction.
Now let's move on to the most interesting part of the design: the deep dive into specific challenges.
The high-level architecture gives us the skeleton, but interviewers often want to probe deeper into specific areas. This is where you demonstrate not just that you know what components to use, but that you understand how they work and why certain approaches are better than others.
Let's explore the trickiest aspects of building a messaging system.
We have mentioned WebSocket connections throughout this design, but why WebSocket specifically?
There are several ways to achieve real-time communication between clients and servers. Each has different trade-offs in terms of latency, resource usage, and complexity.
Let's understand them so you can explain the choice in an interview.
Long polling is the oldest technique for achieving real-time-like behavior with plain HTTP. It predates WebSockets and was the backbone of early real-time web apps like Gmail's chat.
The idea is simple: the client makes an HTTP request asking "any new messages for me?" Instead of responding immediately with "no," the server holds the connection open. If a new message arrives while the connection is open, the server responds with it immediately. If nothing happens for 30-60 seconds, the server responds with an empty result, and the client immediately makes another request.
The client creates a continuous loop of requests, effectively creating a "persistent" connection using standard HTTP semantics.
The good:
The bad:
Long polling got us through the early web era, but it is not ideal for a modern messaging system with millions of concurrent users.
SSE improves on long polling by establishing a true persistent connection, but only in one direction. The server can push events to the client continuously, but the client still needs to use regular HTTP requests to send data back.
Think of SSE as a one-way pipe from server to client. The server can push events whenever it wants, but sending a message back requires a separate HTTP POST.
The good:
The bad:
SSE is a good fit for notification streams, live feeds, or stock tickers where the server broadcasts and the client mostly listens. For chat, where both sides constantly send data, we need something better.
WebSocket is the modern solution. It provides a true bidirectional channel where both client and server can send messages at any time, over a single persistent TCP connection.
The connection starts with a standard HTTP request that includes an "Upgrade" header. If the server supports WebSocket, it responds with 101 Switching Protocols, and from that point on, the connection is a full WebSocket. Both sides can send frames whenever they want, there is no request/response dance.
The good:
The bad:
| Approach | Latency | Overhead | Bidirectional | Best For |
|---|---|---|---|---|
| Long Polling | High | High | No | Legacy systems, fallback |
| SSE | Medium | Low | No | Notifications, live feeds |
| WebSocket | Lowest | Lowest | Yes | Chat, gaming, collaboration |
For a messaging system, WebSocket is the clear winner. The bidirectional nature matches how chat works: both users send and receive constantly. The low latency means conversations feel instant. The single-connection efficiency means we can handle more users per server.
The main challenge with WebSocket is the stateful nature, but we have already addressed this with our Session Service design. We accept the added complexity because the user experience benefits are substantial.
Always implement long polling as a fallback. Some corporate networks and older proxies still block WebSocket connections. Your client should detect this and gracefully fall back to long polling
Everyone who has used WhatsApp knows the checkmarks: one gray for sent, two gray for delivered, two blue for read. These simple icons hide a lot of complexity.
How do we track message state reliably across unreliable networks, flaky mobile connections, and devices that go offline unpredictably?
Let's break down what each state means and how we guarantee correct transitions:
Getting these states right requires careful engineering. Networks fail, devices go offline mid-delivery, and the same message might be sent twice due to retries. Let's see how to handle this.
The cardinal rule of messaging: messages must never be lost. A user who sees the "sent" checkmark should be confident that their message will eventually reach its destination, even if networks fail, servers crash, or the recipient's phone runs out of battery.
Achieving this requires a combination of two techniques: the client retries aggressively, and the server deduplicates.
Here is the key insight that makes reliable messaging possible: if the server can detect duplicate messages, the client can safely retry as many times as needed without fear of the message appearing twice.
Here is how this works in practice:
client_message_id (typically a UUID). This ID is the message's fingerprint.client_message_id.client_message_id before?"This pattern is called idempotent delivery. The same operation can be performed multiple times with the same result. The client can retry as aggressively as it needs to, and the server guarantees that duplicate messages are detected and discarded.
There is one more critical rule for reliable messaging: never acknowledge a message until it is persisted to durable storage.
If the server crashes between receiving a message and persisting it, the message is lost. By only sending ACK after persistence, we guarantee that any acknowledged message is safely stored.
Networks don't guarantee ordering. If User A sends "Hello" then "How are you?", network conditions might deliver them in reverse order.
The solution involves multiple mechanisms:
When the client receives a message, it doesn't immediately display it. Instead, it inserts it into the correct position based on sequence number, ensuring messages always appear in order regardless of arrival order.
The green "online" dot and "last seen at 3:45 PM" text seem like simple features. But think about what they require at scale: tracking 50 million concurrent users, notifying their contacts when status changes, and doing it all without overwhelming the system.
This is a classic trade-off between accuracy and efficiency. Perfect real-time presence would require broadcasting every status change to potentially hundreds of contacts, generating massive network traffic. We need a smarter approach.
The core challenges with presence are:
The practical solution is a combination of heartbeats for tracking and lazy queries for display.
The mechanism is elegantly simple:
presence:user_123 = online with a TTL of 30 seconds.The 30-second TTL is a deliberate choice. It means users appear offline within 30 seconds of actually going offline, which is acceptable for casual chat. If you needed faster detection (for a stock trading app, say), you could reduce the TTL and heartbeat interval, at the cost of more traffic.
The naive approach, broadcasting presence changes to all contacts, doesn't scale. If a user has 500 contacts, and 10% of users change presence every minute, the fanout traffic explodes.
Solution: Lazy Presence Queries
Instead of broadcasting, query presence only when needed:
When User A opens a chat with User B:
This drastically reduces presence traffic. We only track presence for users the client is actively viewing.
Instead of binary online/offline, many apps show "last seen at [time]":
last_seen timestamp on every meaningful user actionThis provides useful information without the complexity of real-time presence. WhatsApp uses this approach, only showing "online" status for users you're actively chatting with.
Modern users expect their messages on every device: phone, tablet, laptop, web browser. When they read a message on their phone, it should show as read on their laptop too. This is multi-device sync.
When a message arrives for User A, we need to:
The best approach combines real-time push with catch-up pull:
When User A has multiple devices connected, the Session Service tracks all of them:
When a message arrives:
When a device comes online after being offline:
The device sends its last sync timestamp when connecting. The server fetches all messages since then and delivers them in bulk. This ensures no messages are ever missed, regardless of how long the device was offline.
When User A reads a message on their phone:
All of User A's devices see the same read status. The sender (User B) also gets notified that the message was read.
Chat servers are fundamentally different from typical web servers. While a stateless API server can be scaled by simply adding more instances behind a load balancer, chat servers hold state: the WebSocket connections themselves.
Each connection represents a user, and that user's messages must be routed to their specific server. This stateful nature creates unique scaling challenges.
Let's walk through how to handle them.
A well-tuned server with proper kernel configuration can handle 50,000 to 100,000 concurrent WebSocket connections. The limits come from file descriptors, memory, and CPU for processing messages.
For our target of 50 million concurrent users, we need:
Unlike stateless HTTP servers, WebSocket connections are inherently stateful. Once User A connects to Chat Server 1, all their messages must route through that server until they disconnect.
Load balancer configuration options:
What happens when a chat server crashes? With 50,000 users per server, a crash is a significant event.
The recovery flow:
The key insight is that message persistence (in the database and queue) is separate from connection state. Even if a server crashes mid-delivery, the message is safe and will be delivered on reconnection.
Production systems need regular maintenance: OS patches, code deployments, hardware replacements. Graceful shutdown minimizes user impact:
Most clients will reconnect to other servers during the drain period, making the maintenance nearly invisible to users.
End-to-end encryption (E2EE) ensures that only the sender and recipient can read messages. Even the service provider (WhatsApp, Signal, etc.) cannot decrypt message content.
Most modern messaging apps use the Signal Protocol or something similar:
The basic flow:
Benefits:
Challenges:
For an interview, it's sufficient to mention that E2EE is important for privacy and explain the high-level concept. The cryptographic details (perfect forward secrecy, double ratchet algorithm, etc.) are typically out of scope unless the interviewer specifically asks.
For WhatsApp-like messaging, what is the main benefit of using a persistent connection (e.g., long-lived TCP/WebSocket) for online users?