Design Twitter - System Design

System requirements

Functional:

Follow and unfollow users.
Post tweets.
Search tweets.
The tweets could contain text, image and video.
View tweets in user's timeline.
Like a tweets.
Comment a tweets.
Search tweets.

Non-Functional:

Availability
Scalability
Latency
Reliability
Consistency

Capacity estimation

Assume twitter have 1B Daily Active Users.

Read QPS is 1B / 100k = 10k. Peak read QPS is 2 * 10k = 20k.

Assume 1% users write 10 tweets per day.

Write QPS is 1B / 100k * 1% * 10 = 1k. Peak write QPS is 2 * 1k = 2k.

Data storage estimation:

Assume 10% tweets contains image or videos. Average Tweets storage usage is 10k. The total daily storage usage is:

1B * 1% * 10 * 10k = 1TB. Each storage will need two more replica. So in total daily storage usage is 1TB * 3 = 3TB.

API design

GET getNewsFeed(

authToken,

userId,

lastSeenKey (for pagination)

count

) => Tweet[] / Error

GET getComments(

authToken,

userId,

parentId, (Could be either a tweet or comment)

lastSeenKey (pagination)

) => Comments / Error

POST postTweet(

authToken,

userId,

content: Content

) => Succeed / Error

POST postComments(

authToken,

userId,

parentId (Could be either a tweet or comment)

) => Succeed / Error

POST likeTweet(

authToken,

userId,

tweetId

) => Succeed / Error

GET searchTweet(

authToken,

userId,

searchText,

lastSeenKey (for pagination)

)

PUT editTweet

POST followUser

POST unfollowUser

For error handling, the api could send back different http error code with detailed error message with different kind errors. Some example http status error codes are:

400 Bad Request

401 Unauthorized

403 Forbidden

404 Not Found

503 Service Unavailable

For authentication, each api will be attached with authentication token. The token could be either cookies, auth 2 token or JWT token to validate user's identity.

Here are the objects in the api above:

Tweet {

tweetId,

creators,

createdAt,

likeCount,

content: Content

topComments: Comment[]

}

Content {

contentId,

tweetId,

text,

media: Media[]

}

Media {

mediaType,

mediaSolution,

mediaURL

}

Database design

High-level design

I will use sql database to store user, tweets and comments information. Because these data are relational data in nature. It is easier to combine and queries the data.

I will use graph database to store the follow and unfollow relationship between users.

I will use no-sql database to store media metadata.

I will use blob storage to store the actual image or video of the tweets.

I will use data storage like elastic search to store indexes of the tweets, which allows searching tweets faster.

SQL database Schema:

User Table: userId, userInfo...

Tweet Table: tweetId, userId, content, mediaId, likeCount, commentIds, createdAt, updatedAt

Comment Table: commentId, parentId, userId, content, createdAt, updatedAt

Graph database Schema: The node will be the userId, the edges will be the relationship (following, follow) between users.

no-sql database stores media metadata: key is the id of the media, values contain the type of the media, whether it's image or video. The values also contain the actual urls of different resolution of the medias stored in blob storage.

Request flows

Client post a tweets or comments.
The post request go through load balancer to do load balancing
The post request go through api gateway to do authentication, rate limiting and security check.
The post request go through tweets service.
The tweets service publish the request to tweets message queue.
The tweets service publish the comments request to comments message queue.
The tweets service publish the request to the media message queue if there is a media in the posts.
One worker which subscribes tweets message queue will store the tweet information into SQL database.
Another worker which subscribes tweets message will store the tweets into elastic search data store to index the tweets contents. The data store will provide fast search when searching tweets.
The comment message queue subscriber will save the comment request into SQL database.
The SQL database will update RDB Cache.
NewsFeed service will read from RDB cache for any new tweets update. The NewsFeed service will be responsible to run ML model and recommend user's news feed post timeline based on the user's followings, posts' popularity, user's region, user's interests and more factors.
When client request posts in their timeline, the request will be routed to get timeline service. The get timeline service will call NewsFeed service to return the pre-generated news feed for the client.
When client searches tweets, the request will be routed to search tweets service. The search tweets service will query elastic search cache and indexes to return search results.
When client request follow or unfollow a user, the request will be routed to following service. The following service will update Graph Database to update the relationship between users.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?