Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter with Score: 8/10

by horizon6080

System requirements

Functional:

User can send tweet up to 140 characters (string of 150 byte)
User can follow other user
User can like other users' tweets
User's timeline
list tweets the user posted presented in reversed chronological order
User's home feed timeline will show an aggregation of all tweets from the users a user is following
This home feed will show top K popular tweets, based on the number of likes a tweet received, and the number of followers that tweet's author has

Non-Functional:

Scalability. 500M DAU
Availability: low latency. User has to tweets quickly. When user opens the home feed, the first 10 tweets should show up within 500 ms.
Can sacrifice consistency for Availability. It does not need strong consistency like banking transactions. Eventual consistency is okey. If user send a tweet, other user/follower in the same geographic region can see it within 1 second, but other user from other geographic regions of the world can see it after 30 seconds. This is acceptable.
Security, content moderation, anti abuse protection.

Capacity estimation

500 M DAU

Each user, send 2 tweets per day on average: 1B tweets per day.

Each tweets has 140 bytes, with meta data, so 500 bytes.

Storage: 1 B * 500 b -> 500 GB per day. (storage cost for 2 years: 500GB* 365 * 2-> 400 TB)

Database storage required: 500 TB

It better to use NoSQL database, the typical capacity of Relational database is around 100 TB.

Document based DB: MangoDB or DynamoDB

QPS: 500 M / day (10^5 seconds) -> 500 * 10 ^ 6 / (10^5) -> 5000 QPS on average

If the latency of 1 API call to pull tweets is 500 ms per core: need 10000 cores

10000 core / 8 core per instance -> 2000 machine instances

Each user:

Visit home timeline API 5 times per day
each timeline has 200 tweets
500M * 5/ day -> (25 * 10^8) / (10 ^ 5) -> 25 * 10^3 -> 25000 QPS
downloading bandwidth = 25000 * 200 bytes -> 5MB / s
Visit a certain people's timeline 5 times per day
each time line has 200 tweets
25 K QPS
5MB downloading bandwidth
50000 QPS (500 ms processing time per core per API call) -> 10 ^ 5 core -> 1000 ~ 2000 machines server instance (8 ~ 16 core)

Each user, send 2 tweets per day on average: 1B tweets per day.

10% of tweet has video(2MB): 10^9 * 0.1 * 2 * 10^ 6 b -> 2* 10 ^5 GB -> 200 TB per day
20% of tweets has picture (200 KB): 40 TB per day
tweets text with meta data: 500 GB per day (500 bytes * 1 B tweets)

Bandwidth:

Reading (ingress, downloading) bandwidth:
50000 QPS (each QPS 20 tweets) -> 5 * 10 ^4 * 2^ 10^2 -> 10^6 tweets /second
10^6 * 0.5 KB -> 500 MB / second (text)
10^6 * 2MB * 10% -> 200 GB /second (video)
10^6 * 200KB * 20% -> 50 GB/second (image)

Bottlenecks:

number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed in short period of time. Better to message queue to achieve this.

It is worth note that: millions of user might be viewing same content concurrently.

API design

postTweet(userToken, string tweet)
deleteTweet(userToken, string tweetId)
likeOrDislikeTweet(userToken, string tweetId, bool like)
readHomeTimeLine(userToken, int pageSize, opt string pageToken)
readUserTimeLine(userToken, string userId, int pageSize, opt string pageToken)

Database design

Tweet (document NoSQL database):

tweet_id: primary key
created_by: user ID
posted_time
content: string of 140 chars
media link: link to a picture or video content (which can be stored at S3)
number_of_likes
hashtags: list of hashtag strings user in the tweet
users mentioned: list of users mentioned in the tweet

User (can be stored in SQL database):

user_id: primary key: Integer
is_hot_user: boolean
email
name
nickname
Date of birth
gender
lastLogin: datetime

Follower Table (can be stored in SQL database):

userId1 (follower): integer
userId2 (followee): integer

Timelines data (Cache or NoSQL database)

home timeline
user_id
{tweet_id}
pre-computed
can be large, but dont need to be since we have pagination and user cannot view the last page, by designing the Front End pagination scrolls
user timeline
user_id
{tweet_id}
can be large for hot user
tweets
tweet_id
content

High-level design

Scenario 1 user post tweet:

User -> LB -> tweet writer service -> DB

-> cache

Scenario 2 user visit a specific user's timeline

User -> LB -> tweet timeline service -> DB or cache

Need very low latency (< 200ms), cache is required.

Naive approach: query database with specific UserId, and fetch tweets in reversed chronological order (DB read is slower)

Fast approach: query cache. The timeline data of a hot user is stored in cache when the user post a new tweet

When to update cache: user post new tweet, fan out on write. The user write tweet event is going to update the cache of its own timeline tweet list.

Scenario 3 user visit Home timeline

User -> LB -> tweet timeline service-> Database

-> Cache

Naive approach (Pull Model):

Query all the following userIds of a user, then query all the new tweets of those userIds, then merge all these tweets based on some TopN algorithm, then return to the user. (DB read and in memory compute is too slow)

Improvement (Push Model):

User cache. Store the home timeline of a certain of user into the cache, so the user do need to fetch data and do compute and aggregation on the fly.

When to update cache: user post new tweet, fan out of write. The user write tweet event is going to update the cache of all its followers' home timeline tweet list.

Pros:

Read O(1)

Cons:

Write O(n) n is number of followers
async task
Eventual consistency

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Bottlenecks:

number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
Huge number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed timeline in short period of time. Fan out on write on 1 billion user is not efficient in Push Model.
Solution: Hybrid solution
Non hot user: Push Model, still fan out of write, write to all the follower's cache
Hot user: Pull Model. Fan in on write. When user call the Home Time line API, the service read all the hot user (that the user is following)'s User Timelist list of tweets, merge them with the user's own cached hometime tweets list, then do aggregation on the fly.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Scalability and Availability:

Data sharding:
Store data into a shard which resides in a region close to user
Tweets table sharding
shard by creation time
cons: hot/cold issue (recent shards high QPS)
shard by userId
Pros:
simple
query user's timeline is straightforward
Cons:
Home timeline still need to query multiple shards
Non-uniform distribution of storage of different User
Hot user high QPS shard
shard by tweetId
Pros:
Uniform distribution
High availability
Cons:
Need to query all shards for both Home timeline and User Timeline
Load balancing
placed between
user & app server
app server & cacheserver
app server & db server
Data caching:
CDN for media files
cache