System requirements
Functional:
- Get a feed of tweets aggregated from a list of followers and popular posts
- be able to follow and unfollow a user
- post pictures or videos potentially?
- search for tweets
- secure login
Non-Functional:
- high availability
- quick reads
- eventual consistency on posts
Capacity estimation
- assume 100 million daily active users
- assume 1 out of 10 users post tweets per minute
- 1 tweet will have max 160 b we should have around around 160 gb worth of tweets a day
- each tweet will have another 160b or so of meta data for uid, likes, retweets, and etc, doubling the amount of storage a day approximately
- band with we would want at least 1 mpbs per user to load multiple tweets and tweets meta data
- if we have photos or videos then we would have to consider a blob storage based on limits we set on media posts
API design
- postTweet(uid, content)
- post request
- getTweetFeed(uid)
- get request
- postTweetMedia(uid, content, fileType)
- post request
- searchTweet(content)
- get request
- getFollowers(uid)
- get request
- followUser(uid, followingUid)
- put request
Database design
- For the database we can have a followers table
- uid | followingUid
- We can have a posts table
- uid | tweetid | content | media link to blob storage
- likes & retweets table
- uid | likes | retweets|
- User profile table
- uid | email | age |
High-level design
We will have some algorithm figuring out which posts have the highest engagement hourly and pull that into the cache for users to to pull into their own feed. Their feed will also have cached a list of who they follow indexed on uid for our followers table. That way we can quickly figure out who they follow and pull latest posts from them.
Another thing we can do is for popular users with many followers we can cache their posts as well to fan out to users since they have a large amount of followers these posts being cached will save us a lot of time for when their followers log on and pull posts.
Request flows
A user will log in and automatically see a feed pulled from a couple of caches storing the posts their following users posted, and mixed in with a bunch of pre determined high engagement posts. From there they can view each posts, see follow up reply posts, and like or retweet the post if they desire. Soon they will reach the end of the page, where we will load more tweet and potentially update the cache for other users, this way if they follow a lot of users we don't pull the tweets for over 10k following users but rather a subset, and increase throughput for edge case users.
Detailed component design
Trade offs/Tech choices
Main trade off we have is for fast reads vs writes, since most of of our users will be reading tweets rather than creating them. Its fine if the tweet isn't shown to everyone right after they post. Since we want fast reads writes will be slower for example if our followers are tweeting a lot then we will have to update the cache constantly causing writes to be a lot flower.
Failure scenarios/bottlenecks
If a user has a lot of followers and tweets a lot, then the cache will have to be constantly updated which is an expensive operation since we are doubling up the amount of writes and updates. Maybe we can periodically cache or only choose to update the cache when they have a more popular posts.
Future improvements
What we could do is we could have a better algorithm for figuring out which posts are popular, and potentially have more underlying metadata for each post to show relevant topics to a user's preference. We could potentially use LLM or machine learning algorithms to analyze the data in our database to ensure future content algorithms are more catered per user.