My Solution for Design Facebook Messenger with Score: 9/10
by luminous_vibe
System requirements
Functional:
- messaging
- receive
- send
- search
- delete
- react
- list of all chats
- online presence
- read receipts
- notifications
out of scope:
- authentication
- social network features like add friend
Non-Functional:
- Real time - it should update my chat screen in real time
- Scalable - able to handle 1 billion users chatting
- Reliable - we dont want to lose any chats that we have sent before
- Available - we will aim for high availability 5 9s is 5 minutes per year down time
- Consistency - it's ok for messages to be sent later but not more than 250ms, we can reorder them in client later
- Fault tolerant - should be able to chat if some databases go down or if we get logged out
Capacity estimation
1 billion DAU - assume they're all chatting, average 10 message per day
10B messages ~ 10e9 msgs / 86e3 s ~ (10/86)e6 ~ 0.86e6 ~ 86e4 ~ 860e3 ~ 860K TPS
50e3 websockets per server ~ 10e9 users ~ 1/5 e6 ~ 20e3 ~ 20k servers
10e9 per day ~ 1KB per message ~ 10e9KB ~ 10e6MB ~ 10e3 GB ~ 10 TB per day
300 TB per month ~ 3600 TB per year ~ 3.6 PT per year ~ 10 years is 36 PT
for messages alone 100 petabytes if theres redundancy
metadata = 1e9 users ~ 1KB metadata ~ 10TB metadata
indexing:
1e9 users ~ 8B ~ 8e9B ~ 8e6KB ~ 8e3MB ~ 8GB for user index
messages ~ 10e9 * 365 * 10 ~ 4e2 * 10e9 * 10y * 8B ~ 4e4 * 10e9 * 8B ~ 32e4 * 10e9 ~ 320 e13B
320e3GB ~ 320TB alone to index the messages
1 PB to be generous
~ 1 exabyte can handle the system from our calculations
~ 2 exa if we want to be safe
API design
- SendMessage
- Params:
- api_key
- user_id
- otheruser_id
- content
- timestamp
- Response
- OK 200
- Generic error
- GetStatus
- Params:
- api_key
- user_id
- other_user_id - array
- timestamp
- Response
- ok 2xx - json body response [{ friendid: online }]
- GetChats
- Params:
- api_key
- user_id
- timestamp
- Response
- ok 2xx
- json body: [{{ id: friendid1, info:{ status: "online"}, preview: "lorem"}]
- SearchChats
- Params
- ap_key
- user_id
- search_terms
- timestamp
- Response
- ok 2xx
- json body: [ {[{[ id: friendid1, info:{ status: "online"}, messageid: '1231s', preview: "lorem"}, ]
- Error codes that these may return:
- 4xx errors we have to care about:
- timeout
- subcodes:
- 15xx server true, client false
- 14xx client timeout, couldn't reach server
- rate limited
- message too big
- not authenticated (forbidden, etc)
- 5xx server issues
Database design
We will use a relational DB for user metadata.
We will use a Nosql DB to store messages.
UserDB
- columns
- createdAt
- userId
- userName
- password (hashed and encrypted)
- location
- primary key: userId
- we can use that as partition key as well
Messages
- columns
- createdAt
- fromId - foreign key
- toId - foreign key
- content
- isRead
- partition key should be createdAt
- secondary is fromId_toId
Presence
- In a fast KV db we can have these information:
- userId_state: away, online, offline
- userId_lastSeen: date
UserServerDB
- In a fast KV db we store information on which server a user is connected to
- user_id : server1
- this way we can figure out where to send the incoming message to
- when server1 gets the message it figures out which websocket the user is connected to
- server1 then sends the message over the websocket to the user
High-level design
Please refer to the HLD
Request flows
Please refer to the HLD
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
+
Message Storage
- When a message comes in from one user to be sent to another user, we will save this message in our system for posterity
- What we do first is put this message in a distributed message queue so that we can decouple our chat service from the job of storing this message
- we will choose Kafka for our queue
- We will have a bunch of workers/servers that listen to the queue and save the message to the database
- The message cluster db is responsible for figuring out where to save this message within the cluster
- this db is a cassandra cluster that is running a few nodes
- cassandra does consistent hashing built in so it is highly available and is able to handle a large number of writes
- the partition is determined by the id_otherId info which also has another sortkey of createdAt
Sending a Message
- We will use websockets to send and receive messages between users
- when a user comes online they will do a handshake and will be assigned to a server where they will establish a websocket connection to
- this pairing is saved in the UserServerDB so that when a message is received and is meant for the user, the ChatService will know which service to forward the message to
- to scale sending a message we should do a fanout technique
- send the message to a queue, then workers will be picking it up and send it to the proper server that the target user belongs to
Trade offs/Tech choices
choice for queue
- we are choosing kafka for the queue
- we will be doing a pull based approach for our system
- this is so we can effectively decouple the systems from each other
- we are not doing a push based approach since we let the workers pull messages whenever they are available
choice for db
- we are using cassandra for our message db
- this is because cassandra can handle a large amount of writes due to how it is implemented
- it is implemented using an LSM tree with a Write-Ahead-Log based approach
- cassandra can also be deployed in a distributed manner easily
- since it uses a consistent hashing approach, it is highly available and very reliable
choice for websockets
- we choose websockets over long polling or short polling
- this is so that we can have real time updates on the client side
- the other choices were short polling, long polling, SSE.
- SSE not really a good choice for us since its a one way communication
- short polling will create too much traffic
- long polling is the 2nd best choice
- each time we request is a new request
- long polling is older tech, more devices can be supported
- low tech, can handle slower connection
- sockets win over them since it utilizes less resources to keep the connection open (long polling needs to re-establish connection every time we check for new data)
Failure scenarios/bottlenecks
- number of servers dedicated for web socket connections
- since there are hundreds of millions of users connected at one time, one server can handle around 50 thousand websockets connections at a time so we 20 for 1million concurent so 20 * 500 ~ 2000 servers just to accommodate having a persistent websocket connection to clients
- fanout from the ChatService to these servers is a bottleneck as well, since the Chatservice needs to query where to push the message, the kafka queue needs to be able to handle this throughput
- the number of workers that pull from the queue and then send them to the proper server is as number we have to keep track of as well
- message db cluster is something we need to build robustly as well so we dont lose any messages
- since we have cassandra here, we should deploy it with a few nodes in the hash ring to avoid being overrun with the traffic
- we also need to replicate the messages in a few backup nodes so that we can recover gracefully
- We need to handle reconnection as well
- when a user is disconnected from the service, we need to gracefully reconnect them to their proper server
Future improvements
- multi region and multi datacenter deployments. This will reroute users from different countries to their own datacenters. this will make it more complicated to chat with users from other countries but it is doable since we store where the users are connecting their websockets
- using Redis to manage sessions so that we can figure out if users are online
- make the the sending message more robust, for example if the user is offline/online
- make reconnection more robust, for example if we already have a session (when we check redis) we reconnect to the same server so we dont have to do the full handshake again
- have proper dashboards, notifications, logging for the whole system
- we will have to capture metrics from the all the components so we can judge the health of the system and setup alerts when things are getting unhealthy. examples of stuff to monitor
- queues are getting filled faster than the workers can work on dequeing the jobs
- cpu usage on every worker and servers so we can see how busy everyone is
- cache misses so we can improve our caching strategies
- latencies on end-to-end transactions/api calls
- error codes and subcodes
- API tracing
- build tools to trace network requests so we can visualize the flow and see where the error came from
- distributed logs
- setup a system to collect logs from different servers and funnel them to a system that parses through them
- this way developers can search through them when they are debugging issues
- an example tool for this is datadog or azure logs