My Solution for Design Twitter with Score: 8/10

by horizon6080

System requirements


Functional:

  1. User can send tweet up to 140 characters (string of 150 byte)
  2. User can follow other user
  3. User can like other users' tweets
  4. User's timeline
  5. list tweets the user posted presented in reversed chronological order
  6. User's home feed timeline will show an aggregation of all tweets from the users a user is following
  7. This home feed will show top K popular tweets, based on the number of likes a tweet received, and the number of followers that tweet's author has


Non-Functional:

  1. Scalability. 500M DAU
  2. Availability: low latency. User has to tweets quickly. When user opens the home feed, the first 10 tweets should show up within 500 ms.
  3. Can sacrifice consistency for Availability. It does not need strong consistency like banking transactions. Eventual consistency is okey. If user send a tweet, other user/follower in the same geographic region can see it within 1 second, but other user from other geographic regions of the world can see it after 30 seconds. This is acceptable.
  4. Security, content moderation, anti abuse protection.


Capacity estimation

500 M DAU

Each user, send 2 tweets per day on average: 1B tweets per day.

Each tweets has 140 bytes, with meta data, so 500 bytes.

Storage: 1 B * 500 b -> 500 GB per day. (storage cost for 2 years: 500GB* 365 * 2-> 400 TB)

Database storage required: 500 TB

It better to use NoSQL database, the typical capacity of Relational database is around 100 TB.

Document based DB: MangoDB or DynamoDB


QPS: 500 M / day (10^5 seconds) -> 500 * 10 ^ 6 / (10^5) -> 5000 QPS on average

If the latency of 1 API call to pull tweets is 500 ms per core: need 10000 cores

10000 core / 8 core per instance -> 2000 machine instances


Each user:

  • Visit home timeline API 5 times per day
  • each timeline has 200 tweets
  • 500M * 5/ day -> (25 * 10^8) / (10 ^ 5) -> 25 * 10^3 -> 25000 QPS
  • downloading bandwidth = 25000 * 200 bytes -> 5MB / s
  • Visit a certain people's timeline 5 times per day
  • each time line has 200 tweets
  • 25 K QPS
  • 5MB downloading bandwidth
  • 50000 QPS (500 ms processing time per core per API call) -> 10 ^ 5 core -> 1000 ~ 2000 machines server instance (8 ~ 16 core)


Each user, send 2 tweets per day on average: 1B tweets per day.

  • 10% of tweet has video(2MB): 10^9 * 0.1 * 2 * 10^ 6 b -> 2* 10 ^5 GB -> 200 TB per day
  • 20% of tweets has picture (200 KB): 40 TB per day
  • tweets text with meta data: 500 GB per day (500 bytes * 1 B tweets)


Bandwidth:

  • Reading (ingress, downloading) bandwidth:
  • 50000 QPS (each QPS 20 tweets) -> 5 * 10 ^4 * 2^ 10^2 -> 10^6 tweets /second
  • 10^6 * 0.5 KB -> 500 MB / second (text)
  • 10^6 * 2MB * 10% -> 200 GB /second (video)
  • 10^6 * 200KB * 20% -> 50 GB/second (image)


Bottlenecks:

  1. number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
  2. One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
  3. number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed in short period of time. Better to message queue to achieve this.


It is worth note that: millions of user might be viewing same content concurrently.


API design

  1. postTweet(userToken, string tweet)
  2. deleteTweet(userToken, string tweetId)
  3. likeOrDislikeTweet(userToken, string tweetId, bool like)
  4. readHomeTimeLine(userToken, int pageSize, opt string pageToken)
  5. readUserTimeLine(userToken, string userId, int pageSize, opt string pageToken)



Database design

Tweet (document NoSQL database):

  • tweet_id: primary key
  • created_by: user ID
  • posted_time
  • content: string of 140 chars
  • media link: link to a picture or video content (which can be stored at S3)
  • number_of_likes
  • hashtags: list of hashtag strings user in the tweet
  • users mentioned: list of users mentioned in the tweet


User (can be stored in SQL database):

  • user_id: primary key: Integer
  • is_hot_user: boolean
  • email
  • name
  • nickname
  • Date of birth
  • gender
  • lastLogin: datetime


Follower Table (can be stored in SQL database):

  • userId1 (follower): integer
  • userId2 (followee): integer


Timelines data (Cache or NoSQL database)

  1. home timeline
  2. user_id
  3. {tweet_id}
  4. pre-computed
  5. can be large, but dont need to be since we have pagination and user cannot view the last page, by designing the Front End pagination scrolls
  6. user timeline
  7. user_id
  8. {tweet_id}
  9. can be large for hot user
  10. tweets
  11. tweet_id
  12. content




High-level design

Scenario 1 user post tweet:

User -> LB -> tweet writer service -> DB

-> cache


Scenario 2 user visit a specific user's timeline

User -> LB -> tweet timeline service -> DB or cache


Need very low latency (< 200ms), cache is required.

Naive approach: query database with specific UserId, and fetch tweets in reversed chronological order (DB read is slower)

Fast approach: query cache. The timeline data of a hot user is stored in cache when the user post a new tweet

When to update cache: user post new tweet, fan out on write. The user write tweet event is going to update the cache of its own timeline tweet list.


Scenario 3 user visit Home timeline

User -> LB -> tweet timeline service-> Database

-> Cache


Naive approach (Pull Model):

Query all the following userIds of a user, then query all the new tweets of those userIds, then merge all these tweets based on some TopN algorithm, then return to the user. (DB read and in memory compute is too slow)


Improvement (Push Model):

User cache. Store the home timeline of a certain of user into the cache, so the user do need to fetch data and do compute and aggregation on the fly.

When to update cache: user post new tweet, fan out of write. The user write tweet event is going to update the cache of all its followers' home timeline tweet list.


Pros:

Read O(1)


Cons:

  • Write O(n) n is number of followers
  • async task
  • Eventual consistency



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


Bottlenecks:

  1. number_of_likes. If a famous person posts something, and millions of user click "like" within a few minutes, it would overwhelm the database server
  2. One approach to overcome this is to break like counter into multiple (let's say 100) sub-counters, and make different database nodes responsible for each sub-counter
  3. Huge number of followers. If a famous person with millions of followers post something, the tweet should show up in millions's people's home feed timeline in short period of time. Fan out on write on 1 billion user is not efficient in Push Model.
  4. Solution: Hybrid solution
  5. Non hot user: Push Model, still fan out of write, write to all the follower's cache
  6. Hot user: Pull Model. Fan in on write. When user call the Home Time line API, the service read all the hot user (that the user is following)'s User Timelist list of tweets, merge them with the user's own cached hometime tweets list, then do aggregation on the fly.





Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?


Scalability and Availability:

  1. Data sharding:
  2. Store data into a shard which resides in a region close to user
  3. Tweets table sharding
  4. shard by creation time
  5. cons: hot/cold issue (recent shards high QPS)
  6. shard by userId
  7. Pros:
  8. simple
  9. query user's timeline is straightforward
  10. Cons:
  11. Home timeline still need to query multiple shards
  12. Non-uniform distribution of storage of different User
  13. Hot user high QPS shard
  14. shard by tweetId
  15. Pros:
  16. Uniform distribution
  17. High availability
  18. Cons:
  19. Need to query all shards for both Home timeline and User Timeline
  20. Load balancing
  21. placed between
  22. user & app server
  23. app server & cacheserver
  24. app server & db server
  25. Data caching:
  26. CDN for media files
  27. cache