Whatsapp System Design - Welcome to Tech by Example

Overview

While answering any system design question it is important to keep in mind that system design questions can be really broad. Hence never directly jump to the solution. It is good to discuss the use cases with the interviewer so as to grasp what he is looking for. Decide on a set of features that you are going to include in your system design.
This is also one of the aspects the interview is looking for. They might be looking for

How you are doing requirement analysis

Are you able to list down all the requirements

What question you are asking.

Also, it is important to do the System Design incrementally. You first list down all the features that will be supported by your system design. You first discuss the design of the first feature and later on extend the design of other features as well.

Whatsapp cannot work over an HTTP protocol.

The thing to keep in mind while designing WhatsApp is that it maintains connections per user. So important point to note here is that there is a persistent connection that is maintained per user. This persistent connection exists with the server per user.

If you think of WhatsApp obviously it cannot work over an HTTP protocol.

This is because HTTP is a client-to-server protocol. It is a request-response architecture where the client sends a request and the server sends the response. Since it is a client-to-server protocol hence the server cannot communicate with the client. Also, a persistent connection is not maintained.

Hence Whatsapp needs to work over a TCP protocol which is peer-to-peer communication. Here are a few options

HTTP Long Polling – In this client will wait for a certain amount of time after sending a request to the server. So there is long polling involved. After that certain time client can initiate the request again.

Web sockets – It is a fully bidirectional connection in which a client can talk to a server and also a server can talk to a client. In this, the server or client can send/receive data at any time. The connection is open all the time.

Overall web sockets are best for our scenario. Another advantage of web sockets is that they have a sticky session where if a particular user has to open its connection to a particular server and it connects to one of the instances at the server end, then it will always be connected to that instance. All the requests of that user will go to that particular instance only. Hence this is what makes web sockets a good option for peer-to-peer communication.

Another thing to consider while doing system design of any service is that to not immediately jump to the solution. Since the application can have a lot of features it is always good to clarify what exactly features you are actually going to include as part of your system design. This is very important to discuss. Another important thing to keep in mind is to have non-functional requirements in mind. Some of the non-functional requirements could be

Single point of failures

Scalability – Scale into millions of millions of users

Availability

Low Latency or Performance

Fault Tolerance

Storage Estimation

Cost Estimation

We will be targetting the below features as part of our system design

One to one messaging

Group messaging

Read receiptsOnline status

Storage of messages

Image/Video Messages

Consistency is more important in Whatsapp than availability

With Whatsapp availability is an important factor but consistency is more important. Basically, it is more important for messages to be delivered and to be received in the same order to all users than availability.

One to one messaging

For one to one messaging let’s see the high-level design that will be required

High-Level Design

On a high level let’s discuss what will be the higher flow and what all services would exist.

There will be an API gateway on which every request from all the users will land.

There will be a Session Service. Session Service will contain a group of instances to which users will be connected by a web socket. Since there is a limit on the number of web sockets that you can open per machine, depending upon the load. So based upon the number of users we can have that number of machines

This Session Service will be connected to a Distributed Redis cluster which will contain information of what user is connected to what box. This information is transient till the user is connected and hence we can use a Distributed Redis for that. It will be the responsibility of the session service to maintain the User-id and machine-id mapping. This service will
- Will insert into if any user is connected to any machine
- When the user gets disconnected.

Other than this the session will be a dumb service in the sense that it will just accept the connection and will forward any requests to it on an SNS/Kafka Topic.

The Session Service will publish a message to an SNS topic or Kafka on receiving any user activity.

There will be a User Activity Service that will be a worker that will listen to this SNS/Kafka topic. After receiving the message it will talk to distributed Redis and identify which user the recipient is connected to. This service will also handle the User Offline case. Once it fetches all the information then it will send the message to another service will be a worker again. This service will be Message Outbound Service

There will be a Message Outbound Service which will be a worker whose work will be to send the outbound messages back to the user. This service will not have any kind of business logic at all. It will only accept a message that contains the details of what message to send, to whom it needs to send, and to which machine the user is connected. This is a very dump service that only which doesn’t have any business logic at all.

We need to have a database that will store the messages if the recipient user is offline. We can use Mongo DB for that.

Let’s what all APIs will be needed

Send text message

Get all unread messages (When the user comes online after being offline for some time).

Mongo DB will have one table to store the messages. Let’s name this table Message Table.
Message Table

Below will be the fields in the message table.

message_id

sender_user_id

receiver_user_id

type – could be text, image, or video

body – the actual message

created

updated

is_received

is_group – Is this message group message or not

group_id – It is set only if the message is a group message

Let’s see how each feature will work

User A will get connected to one of the machines of the session service via web sockets. Let’s say it gets connected to machine number 2. An entry will be made into distributed Redis. It will be User A to machine 1 mapping. Similarly, let’s assume that user B is connected to machine number 3. An entry of user B to machine 3 mapping will also be created in Redis.

User A will send a message to User B. The message will come to Session Service on machine 1. It will send it to a Kafka/SNS topic.

The user activity service/worker will listen to this message.

It will check the distributed Redis to check which machine User B is connected to. Then it will publish two messages to Kafka/SNS again on a different topic. One message will be the delivery to User B and another message will be the acknowledgment to User A that message has been sent. This topic will be listened to by the Message Outbound Service

Message Outbound Service will pick these two messages and process them. This service is a dumb service that only knows to forward a message to the right machine so that it gets delivered to the user. It only talks to Session Service and nothing else

Below is the architecture diagram for the same. The diagram represents the flow for sending a message for User A to User B

Let’s understand the above diagram.

Flow for User A

User A makes a call to the API gateway.
User Authentication happens with Token Service
User A is connected to Box number 1
An entry is made into Redis for userId-machineID mapping.
The message that is sent to UserB is published on the topic.
It is picked by user activity service/worker
It checks from the Redis to know which machine does User B is currently connected to. If User B is offline then Mongo DB comes into the picture. We will discuss this later
After knowing which machine the User B is connected it publishes a message to the topic for the Outbound Message Service/Worker
Outbound message Service/Worker picks this message from the topic. The message contains the details of what message to send, to whom it needs to send, and to which machine the user is connected. This is a very dump service that only which doesn’t have any business logic at all.
It makes an API call on machine 3 to which User B is connected.
The message is then sent to User B via web sockets.

Flow for User B

User B makes a call to the API gateway

User Authentication happens with Token Service

User B is connected to Box number 3

There are still some open questions that need to be answered

How read receipts are going to work

What if User B is offline

How ordering of messages is ensured at the User B end. Meaning that if User A has sent two messages M1 and M2 in that order then User B should also receive the message in the same order i.e the User B should be shown M1 first and then M2.

What if one of the machines goes down to which a user is connected

What about race conditions. We will discuss an example of a race condition as well.

What if User A is offline

Let’s discuss each of these points one by one

How read receipts are going to work

Once User B receives the message an acknowledgment will be sent again to machine 3, then to User Activity Service, and then to the message outbound service, and then to machine 1 to user A. The same flow will happen when User B reads the message. That is how read receipts are going to work

What if the User B is offline

Whatsapp stores the message for up to 30 days in case the user is offline. So in this case WhatsApp will store messages in its database which is Mongo DB. So when the user comes online and the connection is first getting established then below will be the flow.

User B will be connected to machine 7 let’s say. An entry will be created in distributed Redis. It will then publish a message on a topic.

The user activity service will pick this message. It will then check what all message exists in the database for User B.

It will then send those messages to User Outbound Service which will send it to machine 7 and then to user B.

Once User B receives all those messages then an acknowledgment will be sent to all the senders that message has been received. The same flow we discussed above will be followed

Once User B reads all those messages then an acknowledgment will be sent to all the senders that message has been read. After this acknowledgment, the message will be deleted from the DB possibly via another service which will be Cleanup Service

How ordering of messages is ensured at the User B end

At the server end, every message processing is stateless which means that each message is processed independently of other messages. Below is one idea that we can use to ensure ordering.

Every message will have a parent message-id. The parent message-id will be the id of the message which is just before the current message in the order. The WA client will generate this message ID using the UUID.

Every message will be delivered to the WA client. It will be the WA client’s responsibility to show the ordered message. For example, let’s say there are three messages
- M1 with parent message-id as null as assuming this is the first message that is ever sent.
- M2 with parent message-id as of M1’s
- M3 with parent message0id as of M2’s

All these three messages will be processed individually in a stateless manner and sent back to the WA client. If the WA client receives any message but has not received the message with id as parent message-id of the current message then it will wait for it to arrive instead of showing it. For example, let’s say the WA client receives the M1 and M3 messages but not M2. Then it will only show M1 to the user and wait for M2 to arrive. How does it know that it has to wait for M2? It does because the M3 parent message-id is of M2’s and it knows there is one message missing.

What if M2 doesn’t arrive within time. In this case, the WA client on User B can ask the message to be resent from User A.

This is one idea of how the ordering will be ensured.

What if one of the machines goes down to which a user is connected

Let’s consider a scenario where one of the instances to which a particular user is connected is just terminated. It could be terminated due to autoscaling/downscaling or it could be due to maintenance activity as well. In this case, the User will again make a connection and probably it will get connected to some other machine. Once the connection is made the Redis entry will be updated to reflect the new machine-id.
It might be the case that several messages that were supposed to send to that user might have failed in between. But since we would have a retrial mechanism every place with some delay and jitter and hence when the user comes online then all the messages will be correctly sent via the new connection

What about race conditions

For example, User A sends a message to User B. At that moment User B was offline and the message was saved in DB. But before the message could be saved User B came online and fetched all pending messages from the DB. But it couldn’t fetch the latest message from User A.

To prevent this the client can at some interval fetch all messages that are in the Database which are in an undelivered state.

What if User A is offline

If user A is offline then the Whatsapp client is going to store the message till User A comes online

How online and last seen will work

For this, there will be an additional table that will be maintained at the new service which we can call as user_last_seen service end. Below will be the fields in that table

user_id

last_seen

Let’s see how this table will be updated. Imagine there are two users A and B

For User A

This table will be updated for User A for any user activity which is any activity which is initiated by the user. In case of any activity by the user, it will send to the session service. It will publish a message on the SNS/Kafka Topic. user_last_seen service will also subscribe to this topic via a queue. And for every user activity, it will update the last seen field in the table.

There could be a case where the user is just online and he is not doing any activity. In that case, then a heartbeat message will be sent and this table will be updated.

Do note that this table will not be updated by any non-user activity. A non-user activity could be for example when the user Whatsapp is not open and it is fetching the messages.

For User B

User B wants to get the online status of User A. It will send a request for the same. The request will come to user_last_seen service.

We can have a threshold. If the last_seen field of a user is less than 2 seconds for now then it will send the status as online and User B will see User A as online

If the last seen field is greater than 2 seconds from now, then it will send the status online as false and will also return the last seen timestamp. This timestamp will be shown to User B

Group Messaging

Let’s see how group messaging will work. For that, we can have a Group Service which will again be a worker. It will have the below database.

Group Table

It will contain below fields

group_id

group_title

created

updated

group_image_id

GroupId-UserId Mapping

It will contain below fields

group_id

user_id

is_admin

An important thing to note about this GroupId-UserId is that it will be sharded on groupId so that call goes to only one of the shards when fetching all the user Ids belonging to a particular groupId.

We will have below APIs

Group Create

Group Member Add

Group Member Delete

Group Member Admin

Group Member Remove

Group Title Update

Group Image Update

Group message send

Let’s see how a group message will be sent.

A group has four users. A, B, C, and D

User A wants to send the message to the entire group

It calls the send group message and also sends the group id and the message. The message reaches the machine to which the user is connected. The machine publishes the message to a topic.

It is picked by the group service. It fetches all group members from the group table. For each group member, it fans out the messages again to a different topic.

All these messages are again picked by the group service or worker. Each of these messages is processed and sent to each of the group members.

When any of the group members receive the message, an acknowledgment is sent back to the sender of the message.

If any of the group members read the message then again an acknowledgment is sent back to the sender of the message.

Note that if any of the group members are offline then an entry is made into the message table in Mongo DB for that member. That is why we have group_id and is_group fields in the massage table

Below is the architecture diagram for the same. The diagram represents the flow for sending a message for User A to User B and User C which belong to the same Whatsapp group

Let’s understand the above diagram.

Flow for User A

User A makes a call to the API gateway.
User Authentication happens with Token Service
User A is connected to Box number 1
An entry is made into Redis for userId-machineID mapping.
The message that is sent to the group is published on the topic.
It is picked by group service/worker
It fetches all the other members of the group from the Mongo Database
For each member then it checks Redis to know what all each member is connected. If any of the members are offline then Mongo DB comes into the picture.
It fans outs the messages for each of the other members of the group. 9.1 – It sends the for User B. 9.2 – It sends the message out for User C
Both the messages are picked by the Outbound message Service/Worker The message contains the details of what message to send, to whom it needs to send, and to which machine the user is connected
It makes an API call on machine 3 to which User B is connected (11.1) and an API call to machine 4 on which User C is connected (11.2)
The message is then sent to User B and User C via web sockets.

Flow for User B

User B makes a call to the API gateway

User Authentication happens with Token Service

User B is connected to Box number 3

Flow for User C

User C makes a call to the API gateway

User Authentication happens with Token Service

User C is connected to Box number 4

Uploading Images or Videos

Let’s see how image and videos upload would work. For images and video upload, we can make the assumption that the original size of the image or video will not be uploaded. A low res version of it will be created at the client’s end and then it will be uploaded. Even the low res version of any image and video would be of a few KBs. They can be uploaded to a storage provider directly. For eg let’s say that the storage provider is AWS S3 then below will be the flow

Let’s User A on its WA client wants to send a request that it wants to upload an image. The client will send a request to the server to send the presigned URL to which the client can upload the image

The server will respond with a pre-signed URL whose validity can be of few hours. You can read this article to get an idea of the pre-signed URL https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html . Basically, it is an URL that is already signed with a token and hence it can be used to directly upload to that URL without requiring any further authentication. This is also called direct upload. The server will also return the image_id here

The client will upload the image to that URL. It will directly be stored in S3

Now the client will request to send the image/video upload message to the receiver. It will also pass in the image_id in the request.

The server will now send the message to User B using the one-to-one messaging flow that we had discussed above.

Let’s consider a scenario where the same image is being uploaded by multiple people. This could be pretty well a case with a particular meme getting popular. It will send by multiple users. It will be a wastage of storage if we want to store each instance of the image whenever it is sent. There is a scope of optimization here.

What we can do here is to calculate the digest or hash of the image at the client end. This hash or digest will be sent to the server. The server will check if this digest or hash already exists. If yes then it will simply return the image id of that image and it will not return the presigned URL. That way the client will know that the image already exists and it will not upload the image. It will simply use the image ID to send the message.

Let’s see a diagram for the same as well. As you can see from the diagram as well. Both User A and User B use Direct Upload and Direct Download to and from the S3 or any other storage layer. It doesn’t go through the API gateway and hence eliminate the costly travel of image/video large number of bits

Other common components

Other common components could be

User Service – It holds the user profile information.
Token/Auth Service – Management of User tokens
SMS Service- It is used for sending any kind of message back to the user. For example – OTP
Analytics Service – This could be used to track any kind of analytics

Non-Functional Requirement

We have discussed the system design for all functional requirements now. Let’s discuss some of the non-functional things

Scalability

Availability

Low Latency

Moving closer to user location.

Avoiding Single Point of Failures

Scalability

The first thing to consider with the above design is the scalability factor. The scalability of each of the components in the system is very important. Here are scalability challenges you can face and their possible resolutions

Each of the machines in the session service could hold only a limited number of connections. Hence based upon the number of users online at a moment, the number of machines and number of instances could be set up. For eg one machine has around 65000 ports.

Your Kafka system might not be able to take that much load. We can scale horizontally but to a limit. If that is becoming a bottleneck then depending upon the geography or userId we can have two or more such systems. Service discovery could be used to figure out which Kafka system a request needs to go to.

A similar approach can be taken for other services as well.

Another important factor of scalability here is that we have designed our system in such a way so that none of the services is bogged with too many things to do. There is a separation of concerns and wherever there was too much of a responsibility on service, we have broken it down

Low latency

A message could be sent in batches from the client. This would potentially reduce the round trip time. AvailabilityIn order for the system to be highly available, it is very important to have redundancy/backups for almost all components in the system. Here are some of the things that need to be done.

In the case of our DB, we need to enable replication. There should be multiple slaves for each of the master shard nodes.

For distributed Redis clusters we also need replication.

For data redundancy, we could be multi-region as well. This could be one of the benefits if one of the regions goes down.

Disaster Recovery could also be set up

Alerting and Monitoring

Alerting and Monitoring is also a very important non-functional requirement. We should monitor each of our services and set up proper alerts as well. Some of the things that could be monitored are

API Response Time

Memory Consumption

CPU Consumption

Disk Space Consumption

Queue Length

….

Moving closer to user location

There are a couple of architectures that could be followed here. One such is Cell Architecture. You can read more about cell architecture more here – https://github.com/wso2/reference-architecture/blob/master/reference-architecture-cell-based.md

Avoiding Single Point of Failures

A single point of failure which that part of a system which if stops working then it would lead the entire system to fail. We should try to prevent any single point of failure as well in our design. By redundancy and going multi-region we can prevent such things

Conclusion

This was all about Whatsapp system design. Hope you have liked this article. Please share feedback in the comments

design

system