Design Twitter - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Post tweets

View tweets of followed people as a news feed

Non-Functional:

List non-functional requirements for the system...

Highly available

High reliability

High freshness of the newsfeed

Capacity estimation

Estimate the scale of the system you are going to design...

If the system has 100M daily active users, each post 3 tweets a day on average, it means the system receives 300M tweets. That is about 300M/86400 ~ 3000QPS. This is a read heavy system. Assume the read requests for news feed is 10x of posting tweets, it corresponds to 30000QPS. This means we definitely need a distributed system for support such a service.

As storage, we will need to store 300M * 140 Bytes = 42 GB new tweets a day. This translates to 42 GB * 365 * 10 ~ 15 TB storage for 10 years. We also need to store metadata about user to tweet relationship and newsfeeds, which will require additional storage.

API design

Define what APIs are expected from the system...

post_tweet(api_dev_key, user_id, tweet, timestamp, isPublic=True)

get_newsfeed(api_dev_key, user_id, max_tweets=100)

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

We can use a relational database to store the metadata

User table

user_id

creation_time

user_name

user_gender

...

Tweet table

tweet_id

user_id

creation_time

is_public

tweet_text

tweet_url

link_to_image_in_tweet

Follow table

user_id

follower_id

follow_since

Besides we can use an object storage to store the images in the tweets, and populate the links to access them in the tweet table as link_to_image_in_tweet.

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

For the tweet posting service, once a user request comes, the load balancer distributes the request to one of the frontend web server, which is in charge of forwarding the request to pre-processing component that extract the URL, images, and other necessary information from the tweet and write them into the backend database. If image exists, it will be written to an object storage and the link to access it will be stored in the tweet table. The data may also be written to a cache for efficient access during news feed generation. Once the write to the database is completed, the user will receive a response about successful post.

For the news feed service, once a user request comes, the load balancer distributes the request to one of the frontend web server, and it first checks if the user's newsfeed exists in the cache, if so, it will be directly returned. Otherwise, the request is forwarded to the backend system for generating the newsfeed. The backend server will fetch the list of users the current user follows from the cache and the most recent n tweets for each user from the cache, then aggregate and rank them based on some pre-defined criteria, such as recency, or relevance to get the top-k tweets as the newsfeed. The tweet cache and follow cache are sitting before the corresponding databases in the system.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

See above.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

To scale the system, the database need to be shared. One option is to shard by user id. It has the advantage of accessing all tweets of a user from a single server but may lead to imbalanced load. This is because if a user is followed by many users the server hosting it will be overloaded. A better option is to shard by tweet id. In this way, load will be more evenly distributed across servers, but a single newsfeed request will access all servers.

To ensure freshness of the newsfeed, the cache needs to be refreshed based on newly available content. One approach is to periodically pull the new data from the database and generate the news feed to refresh the cache. Infrequent pull causes staled news feed and too frequent pull wastes bandwidth and processing resources. Another approach is to always push new data as soon as it is available. It is a waste for users whose tweets are rarely read. A hybrid approach that pulls for frequently changed users and push for infrequently changed users, or only pull when a user's newsfeed is likely to change using heuristic algorithm may be a better choice.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Hot users followed by many users may cause a bottleneck for the system. To alleviate this problem, one possibility is to replicate the user's tweets to more servers so that loads can be distributed and handled by more servers to avoid slowing down the service.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

To address the hot user problem, a hot user's tweet can be distributed to more servers by adding a prefix to its original id and hash each of them to a different key for partitioning.