Design Twitter - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Create a tweet
Tweet constraint 140 characters
Tweet can include image, video
Like a tweet
Comment a tweet
Retweet a tweet
View like count, retweet count, comment count, etc. for a tweet
View home feeds
View a user's tweet page
Follow/unfollow a user
Search user/tweet
notifications for new tweets/mentions/followers

Non-Functional:

List non-functional requirements for the system...

Scalability: More users, More data
Availability: Twitter should always be available to users. No down time.
Low Latency: Requests should be fulfilled quickly.
Durability: The data should be stored persistently.
Fault tolerant: The system shouldn't be affected by failure of some servers or disaster.
Eventual consistency: For the local area, users should get the same view. But if users are depart from each other geographically, it is acceptable that those users get the update later, but eventually all users should get the same view.

Capacity estimation

Estimate the scale of the system you are going to design...

Assume DAU: 500 million

Tweets created per day per user: 3 tweets

Tweets viewed per day: 20 tweets

70% text only, 20% image, 10% video

Text: 140 characters - 500 bytes per tweet

Image: 250KB

Video: 3MB

Storage:

500M * 3 * (500B + 250KB * 0.2 + 3MB * 0.1)

= 1500M * (0.5KB + 50KB + 300KB)

= 1500M * 350KB = 1500 * 350 GB = 525TB per day

525TB * 365 = 191 PB per year

Incoming bandwidth: 525TB/86400s

Outgoing bandwidth: 525TB / 3 * 20 / 86400s

Write heavy and read heavy system. More read heavy.

API design

Define what APIs are expected from the system...

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

User database: This can be sql database.

User_id
name
profile_image_path
description

Follow/Unfollow database: This can be a graph database.

follower_id
followee_id

Tweet database: This is a no-sql database since the data is huge.

Tweet_id
user_id
tweet_text
image_path
vide_path
timestamp
tags

Home feed database:

user_id
list of tweet_id

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

We have several components:

Load balancers: evenly distribute the client requests among application servers
CDN: clients can get media files from CDN.
FollowService: This is used to handle user follow and unfollow requests.
FollowDatabase: FollowService talks to FollowDatabase. FollowDatabase stores the mapping between follower user id and followee user id. This database is graph database, which can capture the relationship among users.
TweetService: This is called when user posts a tweet.
TweetDatabase: TweetService stores tweet information in database, this only stores the text and tweet metadata. The database can be no-sql database.
Object Store: This stores media files of those tweets, for example, image, video. Object Store also pushes those static contents to CDN.
Tweet Cache: We use cache to store frequently accessed tweets. This can lower the latency.
Pub-sub queue: We can use kafka in this case. Every time a tweet is posted, a tweet is commented, a tweet is shared, etc., services should put the message into the pubsub queue based on different topics. Subscribers will get those messages and do the corresponding tasks.
Homefeed generator: it helps with offline homefeed generation for users and stores the homefeed in the database. Those home feed should also be maintained in the cache.
Homefeed database: It stores the home feed of users. It is a no-sql database. We can use document database.
Notification service: it subscribes to multiple topics. For each update, it notifies users about it.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Home feed service:

A user's home feed contains tweets from users that this user follows. If we generate the home feed during the runtime, it's quite slow. To achieve low latency, we can pre-generate the home feed for each user and store it in the database. When user requests the home feed, the server can return home feed directly to user. When a tweet is posted, tweet service should send that tweet into a pubsub queue. We have home feed generator workers subscribe to the tweet topic. For a tweet, we get tweet creator's followers and add the tweet id to those follower's home feed.

For celebrities, they have millions of followers. When a celebrity posts a tweet, it's not efficient to update all followers' home feed lists. While for other normal users, they usually have hundreds of followers. It's reasonable to update all follower's home feed lists. We can use a hybrid mode where the client will pull celebrities' tweets and load pre-generated home feed of other non celebrities' tweets when requesting home feed.

Tweets data is huge. We need to shard the tweets database to make it scalable since one server couldn't store such huge amount of data and it is single point of failure. We can partition tweets database by tweet_id. We take the hash value of tweet_id by applying a hash function and know which shard should store the tweet metadata. It makes sure that workload is distributed evenly among servers. We can avoid hot spot issue.

We also need to replicate the data to make sure that the data won't be lost. We can adopt primary-secondary replication strategy. So the tweet is written to the primary node only. And the primary node will update secondary nodes. Within a data center, updating secondary nodes should be pretty quick since the network latency is minimal. We can make sure that reading from the same data center provide the same data. But it takes some time to replicate data in nodes in a different data center. We have monitoring system to check the status of each nodes. If a primary node is down, the system should select the new primary node.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?