System requirements
Functional:
- Allows you to post tweets, which includes videos and photos
- Allows you to follow other users
- Allows you to load up a feed based on the most recently posted to least recently posted
- Allows you to view a profile of a user that you are following and their tweets
Non-Functional:
- When loading a feed, you should load at most 10 tweets at a time, more will be loaded as you request for more
- System should be reliable, meaning your tweets should not go missing
- System should be available, meaning that the system's should not have down time.
- Your system should be reasonably fast even at times when there is high traffic
Capacity estimation
Assumptions
- 10 million total users
- 3 million active uses per day
- Each user will send 10 new feed requests a day, and post 1 tweet a day (10 : 1 Read Write ratio)
- 5% of tweets will contain a photo / video.
Data entities
- User
- id (4 bytes)
- username (20 bytes)
- email (20 bytes)
- password (20 bytes)
- Tweet
- id (4 bytes)
- content (1KB)
- user_id (4 bytes)
- media_id (4 bytes)
- Media
- id (8 bytes)
- media_url(8 bytes)
Traffic Estimations
- Read requests
- 30 million read requests a day
- Each request retrieves 10 tweets, 5% of them has a photo / video. Assuming this photo or video is 1MB
- Total bytes = (4 + 1024 + 4 + 4) bytes * 30 million + (8+8+1MB) * 1.5 million = 30GB + 1.6TB (worth of photos) per day = 1.7TB a day
- 73GB per hour = 1250MB per min = 19MB per second
- Write requests
- 3 million write requests a day
- 2MB per second
Storage Estimations
- Assuming that we keep the data for 10 years
- For photos and videos
- 1.5 million a day
- 600 million a year
- 6000 million for 10 years
- 6000 million MB = 6 PB
- Users
- 10 million * (4+20+20+20) = 620 MB
- Tweets
- 30 million * (4 + 1024 + 4 + 4) + 1.5 million * (8 + 8) = 30GB a day = 107 TB for 10 years
Cache Estimations
- We definitely want to cache the 20% most popular tweets a day. Assuming we dont want to cache the media, only the meta data
- 30 million requests * 20% = 6 million requests
- 6 million tweets = 6 million * (4 + 4 + 4 + 1024) + 300 thousand * (8+8) = 6GB + 5MB = 6GB
API design
- get_feed(user_id) : retrieves the 10 most recent tweets based on the user following
- get_user_feed(user_id, target_id) : checks whether the user is following the target user, if so, retrieves the 10 most recent tweets from this target user
- follow(user_id, target_id) : request for user to follow target user
- post_tweet(user_id, content, media=None) posts a tweet, media is defaulted to None
Database design
Data entities
- User
- id (4 bytes)
- username (20 bytes)
- email (20 bytes)
- password (20 bytes)
- Tweet
- id (4 bytes)
- content (1KB)
- user_id (4 bytes)
- media_id (4 bytes)
- Media
- id (8 bytes)
- media_url(8 bytes)
- Following
- User_id
- following_id
The entities above are highly relational, tweets are related to users, and then tweets are related to media and users are related to following. Hence a relational SQL DB like MySQL or PostgresSQL will likely be needed to support this functionality. For photos and videos, we will use a cloud storage like Google Cloud to stores these entities.
In terms of cache, we will use a memcache or redis to support this. We will use the LRU algorithm for data eviction.
High-level design
- Client servers - serves as a connection between users and backend servers
- Backend servers - Execute workflows, retrieving data from our DBs or writing data into our DBs
- SQL DBs - for our data
- Cloud Storage - for photos and media
- Cache - to cache the 20% most popular tweets
- Load balancers - between the client and backend servers, and backend serves and DBs
Request flows
- User can request to load their feed by calling the get_feed API, the backend servers will go to the database to retrieve their following list, then from the following list, retrieve the 10 most recent tweets
- Users can request a target's feed. the backend servers will go to the database to check whether the user is following the target, if so, retrieve the 10 most recent tweets from the target
- Users can follow other users, simply go to the DB and insert into the followings table
- Post tweet: Backened servers will add the tweet into DB, and take note of the time as well.
Detailed component design
The biggest issue here is how can we load the 10 most recent tweets for a user in the best way possible?
A naive solution would be to use a SQL query to retrieve a list of tweets based on the user's following and sort them by time. This works, but it is terribly slow for users following a large number of users. How can we do better?
We can do an offline generation of a user's feed. Basically we can have dedicated servers that are continuously generating users’ newsfeed and storing them in memory. So, whenever a user requests for the new posts for their feed, we can simply serve it from the pre-generated, stored location. Using this scheme, user’s newsfeed is not compiled on load, but rather on a regular basis and returned to users whenever they request for it.
Whenever these servers need to generate the feed for a user, they will first query to see what was the last time the feed was generated for that user. Then, new feed data would be generated from that time onwards.
We can store these tweets in a Linked List fashion where a tweet leads to another feed.
We should do this generation for users that are online since majority of our users will be offline at any given time, we do not want to waste resources on the offline users. When the user comes online, the system will then start, users would have to wait a few seconds before the feed is generated for the first time but that is a trade off that we are willing to make.
users can continually make new request for feeds by scrolling, as the user reaches the bottom of the page, then the next 10 tweets are requested
In terms of data partitioning, we could potentially shard the data in terms of userID, but that would mean that for DBs that contain 'hot' users, the DB will be overloaded when users request for the tweets of the 'hot' user. What we could do is to use consistent hashing to distribute our tweets evenly among the DBs. this way the load is distributed across the DBs
Trade offs/Tech choices
The bad thing about our current data partitioning strategy is that the user's tweets are located in different DBs meaning that we could potentially be searching across all DBs to retrieve a person's tweets. But this is a risk that we are willing to make a tthe moment
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
Instead of showing tweets on feed based on time, we could show tweets based on importance or popularity, meaning that we could come up with an algorithm to determine what tweets would interest the target user the most
We could also show other tweets of accounts that are not followed by the user if our algorithm determines that this would be of interest to the target user. Of course, this would mean that there will be privacy settings as well.