Design Twitter - System Design

System requirements

Functional:

Allows you to post tweets, which includes videos and photos
Allows you to follow other users
Allows you to load up a feed based on the most recently posted to least recently posted
Allows you to view a profile of a user that you are following and their tweets

Non-Functional:

When loading a feed, you should load at most 10 tweets at a time, more will be loaded as you request for more
System should be reliable, meaning your tweets should not go missing
System should be available, meaning that the system's should not have down time.
Your system should be reasonably fast even at times when there is high traffic

Capacity estimation

Assumptions

10 million total users
3 million active uses per day
Each user will send 10 new feed requests a day, and post 1 tweet a day (10 : 1 Read Write ratio)
5% of tweets will contain a photo / video.

Data entities

User
id (4 bytes)
username (20 bytes)
email (20 bytes)
password (20 bytes)
Tweet
id (4 bytes)
content (1KB)
user_id (4 bytes)
media_id (4 bytes)
Media
id (8 bytes)
media_url(8 bytes)

Traffic Estimations

Read requests
30 million read requests a day
Each request retrieves 10 tweets, 5% of them has a photo / video. Assuming this photo or video is 1MB
Total bytes = (4 + 1024 + 4 + 4) bytes * 30 million + (8+8+1MB) * 1.5 million = 30GB + 1.6TB (worth of photos) per day = 1.7TB a day
73GB per hour = 1250MB per min = 19MB per second
Write requests
3 million write requests a day
2MB per second

Storage Estimations

Assuming that we keep the data for 10 years
For photos and videos
1.5 million a day
600 million a year
6000 million for 10 years
6000 million MB = 6 PB
Users
10 million * (4+20+20+20) = 620 MB
Tweets
30 million * (4 + 1024 + 4 + 4) + 1.5 million * (8 + 8) = 30GB a day = 107 TB for 10 years

Cache Estimations

We definitely want to cache the 20% most popular tweets a day. Assuming we dont want to cache the media, only the meta data
30 million requests * 20% = 6 million requests
6 million tweets = 6 million * (4 + 4 + 4 + 1024) + 300 thousand * (8+8) = 6GB + 5MB = 6GB

API design

get_feed(user_id) : retrieves the 10 most recent tweets based on the user following
get_user_feed(user_id, target_id) : checks whether the user is following the target user, if so, retrieves the 10 most recent tweets from this target user
follow(user_id, target_id) : request for user to follow target user
post_tweet(user_id, content, media=None) posts a tweet, media is defaulted to None

Database design

Data entities

User
id (4 bytes)
username (20 bytes)
email (20 bytes)
password (20 bytes)
Tweet
id (4 bytes)
content (1KB)
user_id (4 bytes)
media_id (4 bytes)
Media
id (8 bytes)
media_url(8 bytes)
Following
User_id
following_id

The entities above are highly relational, tweets are related to users, and then tweets are related to media and users are related to following. Hence a relational SQL DB like MySQL or PostgresSQL will likely be needed to support this functionality. For photos and videos, we will use a cloud storage like Google Cloud to stores these entities.

In terms of cache, we will use a memcache or redis to support this. We will use the LRU algorithm for data eviction.

High-level design

Client servers - serves as a connection between users and backend servers
Backend servers - Execute workflows, retrieving data from our DBs or writing data into our DBs
SQL DBs - for our data
Cloud Storage - for photos and media
Cache - to cache the 20% most popular tweets
Load balancers - between the client and backend servers, and backend serves and DBs

Request flows

User can request to load their feed by calling the get_feed API, the backend servers will go to the database to retrieve their following list, then from the following list, retrieve the 10 most recent tweets
Users can request a target's feed. the backend servers will go to the database to check whether the user is following the target, if so, retrieve the 10 most recent tweets from the target
Users can follow other users, simply go to the DB and insert into the followings table
Post tweet: Backened servers will add the tweet into DB, and take note of the time as well.

Detailed component design

The biggest issue here is how can we load the 10 most recent tweets for a user in the best way possible?

A naive solution would be to use a SQL query to retrieve a list of tweets based on the user's following and sort them by time. This works, but it is terribly slow for users following a large number of users. How can we do better?

We can do an offline generation of a user's feed. Basically we can have dedicated servers that are continuously generating users’ newsfeed and storing them in memory. So, whenever a user requests for the new posts for their feed, we can simply serve it from the pre-generated, stored location. Using this scheme, user’s newsfeed is not compiled on load, but rather on a regular basis and returned to users whenever they request for it.

Whenever these servers need to generate the feed for a user, they will first query to see what was the last time the feed was generated for that user. Then, new feed data would be generated from that time onwards.

We can store these tweets in a Linked List fashion where a tweet leads to another feed.

We should do this generation for users that are online since majority of our users will be offline at any given time, we do not want to waste resources on the offline users. When the user comes online, the system will then start, users would have to wait a few seconds before the feed is generated for the first time but that is a trade off that we are willing to make.

users can continually make new request for feeds by scrolling, as the user reaches the bottom of the page, then the next 10 tweets are requested

In terms of data partitioning, we could potentially shard the data in terms of userID, but that would mean that for DBs that contain 'hot' users, the DB will be overloaded when users request for the tweets of the 'hot' user. What we could do is to use consistent hashing to distribute our tweets evenly among the DBs. this way the load is distributed across the DBs

Trade offs/Tech choices

The bad thing about our current data partitioning strategy is that the user's tweets are located in different DBs meaning that we could potentially be searching across all DBs to retrieve a person's tweets. But this is a risk that we are willing to make a tthe moment

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

Instead of showing tweets on feed based on time, we could show tweets based on importance or popularity, meaning that we could come up with an algorithm to determine what tweets would interest the target user the most

We could also show other tweets of accounts that are not followed by the user if our algorithm determines that this would be of interest to the target user. Of course, this would mean that there will be privacy settings as well.