Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Facebook Messenger with Score: 9/10

by luminous_vibe

System requirements

Functional:

messaging
receive
send
search
delete
react
list of all chats
online presence
read receipts
notifications

out of scope:

authentication
social network features like add friend

Non-Functional:

Real time - it should update my chat screen in real time
Scalable - able to handle 1 billion users chatting
Reliable - we dont want to lose any chats that we have sent before
Available - we will aim for high availability 5 9s is 5 minutes per year down time
Consistency - it's ok for messages to be sent later but not more than 250ms, we can reorder them in client later
Fault tolerant - should be able to chat if some databases go down or if we get logged out

Capacity estimation

1 billion DAU - assume they're all chatting, average 10 message per day

10B messages ~ 10e9 msgs / 86e3 s ~ (10/86)e6 ~ 0.86e6 ~ 86e4 ~ 860e3 ~ 860K TPS

50e3 websockets per server ~ 10e9 users ~ 1/5 e6 ~ 20e3 ~ 20k servers

10e9 per day ~ 1KB per message ~ 10e9KB ~ 10e6MB ~ 10e3 GB ~ 10 TB per day

300 TB per month ~ 3600 TB per year ~ 3.6 PT per year ~ 10 years is 36 PT

for messages alone 100 petabytes if theres redundancy

metadata = 1e9 users ~ 1KB metadata ~ 10TB metadata

indexing:

1e9 users ~ 8B ~ 8e9B ~ 8e6KB ~ 8e3MB ~ 8GB for user index

messages ~ 10e9 * 365 * 10 ~ 4e2 * 10e9 * 10y * 8B ~ 4e4 * 10e9 * 8B ~ 32e4 * 10e9 ~ 320 e13B

320e3GB ~ 320TB alone to index the messages

1 PB to be generous

~ 1 exabyte can handle the system from our calculations

~ 2 exa if we want to be safe

API design

SendMessage
Params:
api_key
user_id
otheruser_id
content
timestamp
Response
OK 200
Generic error
GetStatus
Params:
api_key
user_id
other_user_id - array
timestamp
Response
ok 2xx - json body response [{ friendid: online }]
GetChats
Params:
api_key
user_id
timestamp
Response
ok 2xx
json body: [{{ id: friendid1, info:{ status: "online"}, preview: "lorem"}]
SearchChats
Params
ap_key
user_id
search_terms
timestamp
Response
ok 2xx
json body: [ {[{[ id: friendid1, info:{ status: "online"}, messageid: '1231s', preview: "lorem"}, ]
Error codes that these may return:
4xx errors we have to care about:
timeout
subcodes:
15xx server true, client false
14xx client timeout, couldn't reach server
rate limited
message too big
not authenticated (forbidden, etc)
5xx server issues

Database design

We will use a relational DB for user metadata.

We will use a Nosql DB to store messages.

UserDB

columns
createdAt
userId
userName
email
password (hashed and encrypted)
location
primary key: userId
we can use that as partition key as well

Messages

columns
createdAt
fromId - foreign key
toId - foreign key
content
isRead
partition key should be createdAt
secondary is fromId_toId

Presence

In a fast KV db we can have these information:
userId_state: away, online, offline
userId_lastSeen: date

UserServerDB

In a fast KV db we store information on which server a user is connected to
user_id : server1
this way we can figure out where to send the incoming message to
when server1 gets the message it figures out which websocket the user is connected to
server1 then sends the message over the websocket to the user

High-level design

Please refer to the HLD

Request flows

Please refer to the HLD

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Message Storage

When a message comes in from one user to be sent to another user, we will save this message in our system for posterity
What we do first is put this message in a distributed message queue so that we can decouple our chat service from the job of storing this message
we will choose Kafka for our queue
We will have a bunch of workers/servers that listen to the queue and save the message to the database
The message cluster db is responsible for figuring out where to save this message within the cluster
this db is a cassandra cluster that is running a few nodes
cassandra does consistent hashing built in so it is highly available and is able to handle a large number of writes
the partition is determined by the id_otherId info which also has another sortkey of createdAt

Sending a Message

We will use websockets to send and receive messages between users
when a user comes online they will do a handshake and will be assigned to a server where they will establish a websocket connection to
this pairing is saved in the UserServerDB so that when a message is received and is meant for the user, the ChatService will know which service to forward the message to
to scale sending a message we should do a fanout technique
send the message to a queue, then workers will be picking it up and send it to the proper server that the target user belongs to

Trade offs/Tech choices

choice for queue

we are choosing kafka for the queue
we will be doing a pull based approach for our system
this is so we can effectively decouple the systems from each other
we are not doing a push based approach since we let the workers pull messages whenever they are available

choice for db

we are using cassandra for our message db
this is because cassandra can handle a large amount of writes due to how it is implemented
it is implemented using an LSM tree with a Write-Ahead-Log based approach
cassandra can also be deployed in a distributed manner easily
since it uses a consistent hashing approach, it is highly available and very reliable

choice for websockets

we choose websockets over long polling or short polling
this is so that we can have real time updates on the client side
the other choices were short polling, long polling, SSE.
SSE not really a good choice for us since its a one way communication
short polling will create too much traffic
long polling is the 2nd best choice
each time we request is a new request
long polling is older tech, more devices can be supported
low tech, can handle slower connection
sockets win over them since it utilizes less resources to keep the connection open (long polling needs to re-establish connection every time we check for new data)

Failure scenarios/bottlenecks

number of servers dedicated for web socket connections
since there are hundreds of millions of users connected at one time, one server can handle around 50 thousand websockets connections at a time so we 20 for 1million concurent so 20 * 500 ~ 2000 servers just to accommodate having a persistent websocket connection to clients
fanout from the ChatService to these servers is a bottleneck as well, since the Chatservice needs to query where to push the message, the kafka queue needs to be able to handle this throughput
the number of workers that pull from the queue and then send them to the proper server is as number we have to keep track of as well
message db cluster is something we need to build robustly as well so we dont lose any messages
since we have cassandra here, we should deploy it with a few nodes in the hash ring to avoid being overrun with the traffic
we also need to replicate the messages in a few backup nodes so that we can recover gracefully
We need to handle reconnection as well
when a user is disconnected from the service, we need to gracefully reconnect them to their proper server

Future improvements

multi region and multi datacenter deployments. This will reroute users from different countries to their own datacenters. this will make it more complicated to chat with users from other countries but it is doable since we store where the users are connecting their websockets
using Redis to manage sessions so that we can figure out if users are online
make the the sending message more robust, for example if the user is offline/online
make reconnection more robust, for example if we already have a session (when we check redis) we reconnect to the same server so we dont have to do the full handshake again
have proper dashboards, notifications, logging for the whole system
we will have to capture metrics from the all the components so we can judge the health of the system and setup alerts when things are getting unhealthy. examples of stuff to monitor
queues are getting filled faster than the workers can work on dequeing the jobs
cpu usage on every worker and servers so we can see how busy everyone is
cache misses so we can improve our caching strategies
latencies on end-to-end transactions/api calls
error codes and subcodes
API tracing
build tools to trace network requests so we can visualize the flow and see where the error came from
distributed logs
setup a system to collect logs from different servers and funnel them to a system that parses through them
this way developers can search through them when they are debugging issues
an example tool for this is datadog or azure logs