Design Twitter - System Design

System requirements

Functional:

User can post a tweet with up to 150 characters
User can like a tweet
User can follow and unfollow other users
User can view all tweets composed by themselves
User can view all tweets of friends they follow by reverse chronological order

Non-Functional:

High availability
View tweets in low latency
Eventual consistency
Highly scalable to handle peak usage (x10 concurrent users)

Assume there are 1 billion DAU, and each user composes 1 tweet per day and shares/likes 10 tweets per day.
There are 1 billion new tweets per day and the QPS for writing new tweets is 10^9 / 10^5 = 10K/second, and the QPS for like/share is 100K/second
There will be 150 characters per tweet and let's estimate the size of each tweet is up to 512 bytes, considering different languages, which means there will be 512GB storage increment per day
Twitter has heavier read operations, and let's assume read/write ratio is 5:1, so the read QPS is 50K/second
At peak and there're 20% DAU as concurrent user, which is 20M users

API design

Post tweet: tweet(user_id, content) and server returns tweet id
Like tweet: like(user_id, tweet_id)
Follow: follower(from_user_id, to_user_id)
Fetch homepage tweets: fetch(user_id, cursor_id, limit)

Database design

Users:

user_id primary key,

user_name index,

email,

created_at,

updated_at

Friendships:

from_user_id index,

to_user_id index,

created_at

primary_key(from_user_id, to_user_id)

To look up whom a user is following and we can make from_user_id as partition key and to_user_id as sort key.

To easily view who are following a user, we can create a GSI for revert relationship: to_user_id as partition key and from_user_id as sort key

Tweets:

tweet_id primary key,

content,

created_by index,

created_at inxex

We need fast reads on tweets, and there isn't a strong need to join with other tables, so we can select a key-value database like DynamoDB to allow fast reads and easy to be scalable.

To look up tweets by user id and sort tweets by created_at, we can create GSI: created_by as partition key and created_at as sort key

Likes:

tweet_id,

user_id,

created_at

Hashtag:

hashtag_id index,

tweet_Id index,

created_at

primary_Key(hashtag_id, tweet_id)

High-level design

There will be 3 services: friendship, tweet, and home feed
API server will be responsible for authentication, rate limit, and redirecting request to corresponding service
friendship will handle follow and unfollow
tweet will handle posting tweet, hastag, and liking tweet. It will also store tweets and hastag in a cache with long TTL because tweets are rarely edited
home feed service will fetch data from cache and database to precompute home feed page for each user

Request flows

When user follows/unfollows another user, the request will be redirected to friendship service and the service will update the database
When user posts or likes a tweets, the request will be redirected to tweet service and the service will update related tables and cache
When user requests homepage feeds, the request will be redirected to home feed service and it will compute the feeds to send back to user

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Computing feeds is costly. To reduce the latency, we can consider to fan out tweets to the user's followers. In addition, if a user follows a lot of users, we can create a new table called precompute_feeds. The partition key is user_id and the value is list of tweet ids ordered by created_at.
If a user has tons of followers, fanning out tweets is costly. For this condition, we can only send the tweet to a follower when the follower requests to update the homepage feeds

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?