My Solution for Design Facebook Messenger with Score: 8/10

by zinan

System requirements


Functional:

  1. Send message - text, image
  2. Add/Remove a friend
  3. Supports both one-on-one chats and group chats - Create/Delete a group, invite your friends
  4. Online indicator
  5. Support mobile and web platforms



Non-Functional:

  1. High performance. The latency of the messaging system would be the most critical as no one like waiting for the message that has been sent by their friends for a long time.
  2. Highly available. The entire system should highly available and should not affect the user's experience even some problem (e.g. backend service wend down) happens
  3. Highly scalable. We have to make our service highly scalable to be able to handle the high traffic, and also avoid SPOF


Capacity estimation

  1. DAU: 500 million
  2. Each user generate 2 kb of text message per day on average
  3. Each user may generate around 1MB of media data (like images or videos) each day.
  4. Concurrent Users: At peak times, we can expect around 20% of users, so approximately 100 million users could be active in chat at one second.

Store messages for 1 year: 500 * (2 * 365 + 1000 * 365) = 547000 TB = 547 PB



API design

Relational database

Post /user/v1/auth

Params userId, password/verification code

Return token, expiryTime


Post /user/v1/qrCode/generate

Return qrCodeId


Post /user/v1/qrCode/scan

Params qrCodeId, userId

Return token


Post /user/v1/chat/message/send

Params token, senderId, recipientId, content, groupId (if in a group), type (text, image, etc.)


Post /user/v1/frient/{event} - add/remove

Params token, userId, friendId


Post /user/v1/group/create

Params userId, groupName, createTime, memberIds, desccription


Presence Cache

key: user_connection_state:{user_id}:status

value: online/offline


key: user_server_connection:{user_id}

value: Server A


Database design

I prefer using relational database to store user and group data as it can easily manage the relationships.


For storing raw image and raw video I would use Object storage such as Amazon S3


Also we can storing the message metadata, text content, image url and video url into NoSql database like Cassandra as it offer efficient writes


For Persistent Queue we can utilize Kafka as it's storing messages into disk and very durable such that we dont have to worry about the message loss.


High-level design

Show as diagram





Request flows

1.Authentication flow:

User send token request -> API gateway forward the request to load balancer -> API service take the requests and generate a token based on User identification -> return to User


2.Message sending flow (indivually):

User A send a text message and a image to user B-> image would be sending to Object storage through a pre-defined url by API service -> send the text message to API Gateway -> Validate the limit by Rate limiter, and redirect to Load balancer if passed -> API Service validate the token first -> token is valid, Chat Service start handshaking for websocket connection for user A -> established the connection, fetch the online status and which server did user B connect with


Their would be two situations after that:

If User B online, send the request to Service Discovery, get the IP Address or hostname and connect to server B -> deliver the message to user B over the network via websocket


If User B offline, user A will send the request to the notification service and persistent Queue so that user B can receive the notification, Persistence Service would take the message and persist into Non-relational database asynchronously as some messages might be media and huge. Once user B online it would fetch the message from the database.


3.Message sending flow (group):

Message sending flow would be same as indivual one


All online members who are in the same group would receive the message by consuming the message from Persistent Queue, and ack the message


All offline members will receive the notification for the message


Detailed component design

1.Service Discovery

For maintaining the status of all registered Chat services and capable of finding out the target server, we need the Service DIscovery to manage the Chat services.


For example:

Server A queries: "Where is User B?"

Discovery Service answers: "User B is on Server C"

Server A sends the message directly to Server C

Server C delivers to User B over WebSocket


2.ID Generator

We have to make sure ID is unique, and ordered in sequence. It's does not make sense that the message sent later appeared before the message sent earlier.


For distributed systems, we can use something like Snowflake to have global 64-bit sequence number generator.


3.Message synchronization with multiple devices

Each devices maintaining their own websocket connection, so if user A send a message to Chat Service A, it would no only send to their receiver like user B, but also sending to other devices.



Trade offs/Tech choices

1.Relational database or Non-relational database

For Chatting Service I used both, because relational database is handling relationships such as friend and group very well.

Additionally, user writes messages very often while chatting so Non-relational database such as Cassandra would be fit in this situation.


2.Store session in backend service or external Cache?

For small group of users we can leverage sticky sessions provided by Load balancer to handle the session info properly. However, for Chatting Service that have such a high traffic, using sticky session would be the bottleneck for us.

So I determined to use Global Cache like Redis to store session data such that it shared accross all services.


Failure scenarios/bottlenecks

1.When Connection established, we will return response to user A, meanwhile storing session data into Cache asynchronously. What if user A received the response but session failed to save into Cache?

If cache save fails, we won't allow the WebSocket session to continue.


2.Persist the message into database is the time-consuming job and might become a bottleneck if the traffic is high.

We can decouple the service and database by sending the message to a Persistence Queue, it does not matter whether the message persist into database in realtime. so we can make this process asynchronously executed.


3.Connection disconnected due to server crashed or bad network

Chat Service would maintaining the reconnection logic such as exponential backoff


4.Tcp, Http or Websocket?

Http is stateless and not fit for chatting as it needs to establish the connection everytime.

Tcp can be used since it's the full duplex connection. However, it's a bit barebones and we have to build our own message encoding/decoding mechnism and choose a proper RPC protocol to transfer the message efficiently, it would add more complexity for building the system

I prefer Websocket as it has provided all functionalities that Tcp have and make it packages so it's easy to use.



Future improvements

1.Build a monitor system to collect the metrics for the critical path and visualize such as Redis hit rate, cpu usage, memory usage, DAU, etc.

2.Make the time-coming jobs asynchronously executed from the main process to improve the latency

3.Make CDN nodes geographically distributed so that user from Apac does not have to connect to the CDN that located in Emea