Requirements
Functional Requirements:
- Allow users to tweet messages up to 140 characters.
- Enable users to follow other users.
- Allow users to like tweets from other users.
- Display tweets from followed users in the home feed.
- Show top K popular tweets in the home feed based on likes and followers.
Non-Functional Requirements:
- System should be highly available, scalable and reliable. Consistency is not a strict requirement. Should be low latent.
Data Model
Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...
tweetdb:
User table
- PK user_id: UUID
- email: varchar
Tweet table
- PK id: UUID
- FK: user_id: UUID -- creator of tweet
- text: varchar(140)
Follower table:
- PK id: UUID
- FK: user_follower_id: UUID -- foreign key to user table
- FK: user_followee_id: UUID -- foreign key to user table
Tweet action table
- PK id: UUID
- FK tweet_id: uuid
- FK user_id: uuid (who make the action)
- action_type ENUM (like, favorite, retweet, share)
For simplicity, we didn't add audit columns (creation time, updated time...)
tweetsvc API Design
postweet(user_id, text,...)
follow_user(user_id, follower_id)
action_tweet(user_id, action_type [like, share, retweet,...])
timelinesvc API Design
getTimelineOfFollowedUsers(user_id) --> returns a list of tweets from followers
getTimelineOfTopK(user_id) --> returns a list of popular tweets
High-Level Design
Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.
- tweetsvc: svc for users to create, share and like a tweet. It can also allow retweets. It also allow users to follow other users. It also handles the creation of timelines for users
- timelinesvc: svc that generates timelines (feeds) for users
- distributed cache: store the timelines of users
- queue: store recently posted tweets. timelinesvc will consume these tweets to build timelines
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
We have tweetsvc to allow users to create tweets, share tweets, like tweets, repost tweets, follow other people,... As users posts tweet, we can insert the tweet in a queue. A separate process, tweet consumer, consumes the tweet and it will identify all the followers of user that posted tweet and append the tweet to each follower's timeline in a distributed cache. Then, when user asks for its timeline, we obtain it from the cache. This is called the fan out push model. Tradeoff: if a user has millions of followers, the tweet will appended a million times (every timeline). This does not scales well because it is having too many writes for one user. We can instead ignore the fanout model for users greater than 500 followers. We should handle celebrities separately. We can also try fan out pull model. Let's generate the timeline of tweets from follower's when user asks for it. Let's assume that the timeline will have at most 50 tweets (most recent). So first request to generate timeline will be a cache miss. After the timeline gets generated, it is stored in the cache for that user. A separate process can refresh the timeline in cache based on new tweets. Our solution will use a hybrid approach between fan out push/pull.
How to scale? We expect way many more reads than writes. That is why, the timelines of users are stored in a distributed cache. To further scale the cache, we can partition by user. Sure, a user can become a hot key and therefore it can be unbalanced partitions. Instead we can partition by tweet id to not have a hot partition (overloaded) We would use consistent hashing for this. We can also partition the database by tweet id to scale our storage layer. Queue can be kafka. Kafa can scale easily by adding more partitions.
So far, we covered getting the timeline of followed users. To obtain the top K, we would have a separate process to obtain the top K most popular tweets based on likes and followers (ranking system). Let's assume this is a global timeline for everyone based on their region. This is the trending section. Our ranking tweets process can run periodically to refresh top K most popular tweets. Let's keep this trending section in a cache as well.
How to have high availability? Each service have multiple instances (scale horizontally) and each cache and database have follower replicas.