System requirements
Functional:
- Publish tweet
- See a feed of followed user's posts
- Favorite post
- List favorited posts
Non-Functional:
- Supports 500M DAU
Capacity estimation
500M users/day. Each writes 2 tweets = 1B tweets/day / 10 hours = 100M/h / 100 min = 1M/m / 100s = 10k tweets per second authored.
Each users checks feed 10 times per day = 50k feed gets per sec.
Each users favorites 20 times per day = 100k favorites per sec.
Each user checks favorites 1 time per day = 5k favorites list per sec.
API design
PUT /publishTweet
{
text: string
}
GET /feed?before=timestamp
PUT /favorite?tweetId=int
GET /favorites
Database design
Tweets:
- id int
- userId int
- text string
- favoriteCount int
- timestamp time
FavoritedTweet:
- userId int
- tweetId int
Users:
- id int
UserFollow:
- followingUser int
- followedUser int
High-level design
We will have a web tier to handle the HTTP requests from the clients. It will have a load balancer. We use a relational database to store tweets, favorites, and users, because joins are helpful for listing favorited tweets, and also we may want to do different types of queries in the future. We shard the database by user and replicate it. We have a queue that subscribes to the tweets table. Feed service reads from the queue to assemble a feed for the users. It has a feed cache so the feed for a user can be served quickly. If more tweets are requested than is in the cache, it checks the database for posts from users that the user has followed.
Request flows
- When a tweet is authored, the web tier routes the request to the tweet service, which writes it to the database.
- When a tweet is favorited, the web tier routes the request to the tweet service, which updates the tweet's like count and adds to UserFavorites.
- When favorites are queried, the web tier routes the request to the tweet service, which queries UserFavorites.
- When a feed is requested, the web tier routes the request to the feed service, which checks the feed cache. If no feed is in the cache, it checks the database for posts from users that the user has followed. The tweet cache can help reduce load for tweets that are very popular. It can also be used to provide some default tweets if the user's follows have not been tweeting recently.
Detailed component design
Assembling a feed can be complex. We want to prioritize tweets from the people the user followed, and also more recent tweets. This can be done using where clauses in the SQL query. It can also be expensive to cache feed for users that are not active in the past week or month, so we can have the feed service ignore those from the queue.
Trade offs/Tech choices
This architecture makes the assumption that most users will not be constantly refreshing feed very frequently. If there is too much cache miss in the feed service then the database may get overloaded.
Failure scenarios/bottlenecks
If there is too much cache miss in the feed service then the database may get overloaded.
Future improvements
We can have a separate nosql database for storing feed, so that the cache miss is not as impactful.