My Solution for Design Twitter with Score: 8/10
by horizon6080
System requirements
Functional:
- User can send tweet up to 140 characters (string of 150 byte)
- User can follow other user
- User can like other users' tweets
- User's timeline
- list tweets the user posted presented in reversed chronological order
- User's home feed timeline will show an aggregation of all tweets from the users a user is following
- This home feed will show top K popular tweets, based on the number of likes a tweet received, and the number of followers that tweet's author has
Non-Functional:
- Scalability. 500M DAU
- Availability: low latency. User has to tweets quickly. When user opens the home feed, the first 10 tweets should show up within 500 ms.
- Can sacrifice consistency for Availability. It does not need strong consistency like banking transactions. Eventual consistency is okey. If user send a tweet, other user/follower in the same geographic region can see it within 1 second, but other user from other geographic regions of the world can see it after 30 seconds. This is acceptable.
- Security, content moderation, anti abuse protection.
Capacity estimation
500 M DAU
Each user, send 2 tweets per day on average: 1B tweets per day.
Each tweets has 140 bytes, with meta data, so 500 bytes.
Storage: 1 B * 500 b -> 500 GB per day. (storage cost for 2 years: 500GB* 365 * 2-> 400 TB)
Database storage required: 500 TB
It better to use NoSQL database, the typical capacity of Relational database is around 100 TB.
Document based DB: MangoDB or DynamoDB
QPS: 500 M / day (10^5 seconds) -> 500 * 10 ^ 6 / (10^5) -> 5000 QPS on average
If the latency of 1 API call to pull tweets is 500 ms per core: need 10000 cores
10000 core / 8 core per instance -> 2000 machine instances
Each user:
- Visit home timeline API 5 times per day
- each timeline has 200 tweets
- 500M * 5/ day -> (25 * 10^8) / (10 ^ 5) -> 25 * 10^3 -> 25000 QPS
- downloading bandwidth = 25000 * 200 bytes -> 5MB / s
- Visit a certain people's timeline 5 times per day
- each time line has 200 tweets
- 25 K QPS
- 5MB downloading bandwidth
- 50000 QPS (500 ms processing time per core per API call) -> 10 ^ 5 core -> 1000 ~ 2000 machines server instance (8 ~ 16 core)
Each user, send 2 tweets per day on average: 1B tweets per day.
- 10% of tweet has video(2MB): 10^9 * 0.1 * 2 * 10^ 6 b -> 2* 10 ^5 GB -> 200 TB per day
- 20% of tweets has picture (200 KB): 40 TB per day
- tweets text with meta data: 500 GB per day (500 bytes * 1 B tweets)
Bandwidth:
- Reading (ingress, downloading) bandwidth:
- 50000 QPS (each QPS 20 tweets) -> 5 * 10 ^4 * 2^ 10^2 -> 10^6 tweets /second
- 10^6 * 0.5 KB -> 500 MB / second (text)
- 10^6 * 2MB * 10% -> 200 GB /second (video)
- 10^6 * 200KB * 20% -> 50 GB/second (image)
Bottlenecks:
- number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
- One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
- number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed in short period of time. Better to message queue to achieve this.
It is worth note that: millions of user might be viewing same content concurrently.
API design
- postTweet(userToken, string tweet)
- deleteTweet(userToken, string tweetId)
- likeOrDislikeTweet(userToken, string tweetId, bool like)
- readHomeTimeLine(userToken, int pageSize, opt string pageToken)
- readUserTimeLine(userToken, string userId, int pageSize, opt string pageToken)
Database design
Tweet (document NoSQL database):
- tweet_id: primary key
- created_by: user ID
- posted_time
- content: string of 140 chars
- media link: link to a picture or video content (which can be stored at S3)
- number_of_likes
- hashtags: list of hashtag strings user in the tweet
- users mentioned: list of users mentioned in the tweet
User (can be stored in SQL database):
- user_id: primary key: Integer
- is_hot_user: boolean
- name
- nickname
- Date of birth
- gender
- lastLogin: datetime
Follower Table (can be stored in SQL database):
- userId1 (follower): integer
- userId2 (followee): integer
Timelines data (Cache or NoSQL database)
- home timeline
- user_id
- {tweet_id}
- pre-computed
- can be large, but dont need to be since we have pagination and user cannot view the last page, by designing the Front End pagination scrolls
- user timeline
- user_id
- {tweet_id}
- can be large for hot user
- tweets
- tweet_id
- content
High-level design
Scenario 1 user post tweet:
User -> LB -> tweet writer service -> DB
-> cache
Scenario 2 user visit a specific user's timeline
User -> LB -> tweet timeline service -> DB or cache
Need very low latency (< 200ms), cache is required.
Naive approach: query database with specific UserId, and fetch tweets in reversed chronological order (DB read is slower)
Fast approach: query cache. The timeline data of a hot user is stored in cache when the user post a new tweet
When to update cache: user post new tweet, fan out on write. The user write tweet event is going to update the cache of its own timeline tweet list.
Scenario 3 user visit Home timeline
User -> LB -> tweet timeline service-> Database
-> Cache
Naive approach (Pull Model):
Query all the following userIds of a user, then query all the new tweets of those userIds, then merge all these tweets based on some TopN algorithm, then return to the user. (DB read and in memory compute is too slow)
Improvement (Push Model):
User cache. Store the home timeline of a certain of user into the cache, so the user do need to fetch data and do compute and aggregation on the fly.
When to update cache: user post new tweet, fan out of write. The user write tweet event is going to update the cache of all its followers' home timeline tweet list.
Pros:
Read O(1)
Cons:
- Write O(n) n is number of followers
- async task
- Eventual consistency
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Bottlenecks:
- number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
- One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
- Huge number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed timeline in short period of time. Fan out on write on 1 billion user is not efficient in Push Model.
- Solution: Hybrid solution
- Non hot user: Push Model, still fan out of write, write to all the follower's cache
- Hot user: Pull Model. Fan in on write. When user call the Home Time line API, the service read all the hot user (that the user is following)'s User Timelist list of tweets, merge them with the user's own cached hometime tweets list, then do aggregation on the fly.
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
Scalability and Availability:
- Data sharding:
- Store data into a shard which resides in a region close to user
- Tweets table sharding
- shard by creation time
- cons: hot/cold issue (recent shards high QPS)
- shard by userId
- Pros:
- simple
- query user's timeline is straightforward
- Cons:
- Home timeline still need to query multiple shards
- Non-uniform distribution of storage of different User
- Hot user high QPS shard
- shard by tweetId
- Pros:
- Uniform distribution
- High availability
- Cons:
- Need to query all shards for both Home timeline and User Timeline
- Load balancing
- placed between
- user & app server
- app server & cacheserver
- app server & db server
- Data caching:
- CDN for media files
- cache