Requirements

Functional Requirements:

Allow users to tweet messages up to 140 characters.
Enable users to follow other users.
Allow users to like tweets from other users.
Display tweets from followed users in the home feed.
Show top K popular tweets in the home feed based on likes and followers.

Non-Functional Requirements:

System should be highly available, scalable and reliable. Consistency is not a strict requirement. Should be low latent.

Data Model

Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...

tweetdb:

User table

PK user_id: UUID
email: varchar

Tweet table

PK id: UUID
FK: user_id: UUID -- creator of tweet
text: varchar(140)

Follower table:

PK id: UUID
FK: user_follower_id: UUID -- foreign key to user table
FK: user_followee_id: UUID -- foreign key to user table

Tweet action table

PK id: UUID
FK tweet_id: uuid
FK user_id: uuid (who make the action)
action_type ENUM (like, favorite, retweet, share)

For simplicity, we didn't add audit columns (creation time, updated time...)

tweetsvc API Design

postweet(user_id, text,...)

follow_user(user_id, follower_id)

action_tweet(user_id, action_type [like, share, retweet,...])

timelinesvc API Design

getTimelineOfFollowedUsers(user_id) --> returns a list of tweets from followers

getTimelineOfTopK(user_id) --> returns a list of popular tweets

High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.

tweetsvc: svc for users to create, share and like a tweet. It can also allow retweets. It also allow users to follow other users. It also handles the creation of timelines for users
timelinesvc: svc that generates timelines (feeds) for users
distributed cache: store the timelines of users
queue: store recently posted tweets. timelinesvc will consume these tweets to build timelines

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

We have tweetsvc to allow users to create tweets, share tweets, like tweets, repost tweets, follow other people,... As users posts tweet, we can insert the tweet in a queue. A separate process, tweet consumer, consumes the tweet and it will identify all the followers of user that posted tweet and append the tweet to each follower's timeline in a distributed cache. Then, when user asks for its timeline, we obtain it from the cache. This is called the fan out push model. Tradeoff: if a user has millions of followers, the tweet will appended a million times (every timeline). This does not scales well because it is having too many writes for one user. We can instead ignore the fanout model for users greater than 500 followers. We should handle celebrities separately. We can also try fan out pull model. Let's generate the timeline of tweets from follower's when user asks for it. Let's assume that the timeline will have at most 50 tweets (most recent). So first request to generate timeline will be a cache miss. After the timeline gets generated, it is stored in the cache for that user. A separate process can refresh the timeline in cache based on new tweets. Our solution will use a hybrid approach between fan out push/pull.

How to scale? We expect way many more reads than writes. That is why, the timelines of users are stored in a distributed cache. To further scale the cache, we can partition by user. Sure, a user can become a hot key and therefore it can be unbalanced partitions. Instead we can partition by tweet id to not have a hot partition (overloaded) We would use consistent hashing for this. We can also partition the database by tweet id to scale our storage layer. Queue can be kafka. Kafa can scale easily by adding more partitions.

So far, we covered getting the timeline of followed users. To obtain the top K, we would have a separate process to obtain the top K most popular tweets based on likes and followers (ranking system). Let's assume this is a global timeline for everyone based on their region. This is the trending section. Our ranking tweets process can run periodically to refresh top K most popular tweets. Let's keep this trending section in a cache as well.

How to have high availability? Each service have multiple instances (scale horizontally) and each cache and database have follower replicas.