System requirements


Functional:

  1. User can post tweets
  2. User can follow other users
  3. User can view followed tweets on their home timeline
  4. User can view another user's profile home page
  5. Tweets are shown in reverse chronological order
  6. User can like a tweet
  7. A tweet can contain texts and media files such as picture



Non-Functional:

  1. Posted tweets should be updated to show up in real time
  2. Prioritize high availability over consistency





Capacity estimation

Total users: 1M per day

Tweets sent per day: 5M => write QPS: 5M / 24 / 3600 = 58

Tweets view per day: 500M => read QPS: 500M / 24 / 3600 = 5800

Favorites per day: 50M => write QPS: 580

Total QPS: 6500

Peak QPS estimation: 2 * 6500 = 13000

This can be handled by 20s SQL machines


We can see read is much larger than write


Storage estimation:

One tweet = 200 byte text + 5M media

Assuming 20% of tweets contain media

One day: 5.2MByte * 1M + 200 byte * 4M = 5000 TB

If we store the data for 50 years: 50 * 365 * 5000 TB




API design


  1. User can post tweets
  2. Post /v1/tweet
  3. body {auth_token, user_id, content}
  4. User can follow other users
  5. Post /v1/follow
  6. body {auth_token, user_id, follow_user_id}
  7. User can view followed tweets on their home timeline
  8. Get /v1/home:user_id
  9. User can view another user's profile home page
  10. Get /v1/profile:user_id
  11. User can like a tweet
  12. Post /v1/favorite
  13. body {auth_token, user_id, tweet_id}



Database design

Tweet table - Store tweet info

  • TweetId
  • OwnerId
  • Date
  • Text
  • Media link
  • Like count


User table - Store user info

  • UserId
  • UserName
  • RegisterDate
  • Follower count
  • Country
  • Gender
  • Birthday


Like table - Store like info

  • TweetId
  • LikedUserId


Follow table - Store follow info

  • UserId
  • FollwerId


Timeline table - Store timeline info managed

  • UserId
  • TweetId



High-level design

Client

  • End user client

Media file CDN

  • Store media files for tweets to ensure they are highly available

Load balancer

  • Ensure requests are equally distributed to different servers

API Gateway / Webapp server

  • Return end user web page
  • Rate limiting
  • Route API requests to corresponding services

Tweet service

  • Handle post tweet request

Fanout service

  • Fanout a newly posted tweet to follower's timeline

Message queue - Kafka

  • Pub / sub for tweet post request between tweet service and fanout service

Follow service

  • Handle user follow request

Favorite service

  • Handle tweet like request

Home/profile service

  • Handle timeline request

Timeline Cache

  • Cache timeline to make sure it's highly available





Request flows

Post tweet

  1. User post a tweet from client, an API request Post v1/tweet is sent to tweet service after going through load balancer and API gateway
  2. Tweet service write the new tweet into tweet table
  3. Tweet service publish a tweet posted message to message queue
  4. Fanout service subscribes the message queue, when received a message, it add the new tweet into the user and followers timeline table as well as timeline cache
  5. On end user side, the user will see the new post on profile, follower will see the new post on their home


View home/profile timeline

  1. When user land on home page, home/profile service first get tweets from timeline cache, which stores X most recent tweets, then fetch tweet info from tweet table and return to the user to render the home page with tweets.
  2. The service also query timeline table in the DB to get more than X tweets
  3. Tweets will be sorted by reverse chronological order with tweet id to return to user


Follow / Unfollow

  1. When a user follow another user, a follow API request will be sent to follow service
  2. Follow service updates follow count in the user table, and then write a new line of data into follower table to record the follow
  3. Once the request is completed, it returns OK to the client, the follow button on client side will change to unfollow.
  4. For unfollow, it's a reverse operation to follow


Like

  1. When user clicks like button on a tweet, a favorite request is sent to favorite service
  2. The service update likes count for the tweet in tweet table as well as write a new line of data into like table
  3. Once the request is completed, it returns OK to the client and user see the like count is updated




Detailed component design

Home/profile service

  1. The service first get tweets from timeline cache, which stores X most recent tweets, then fetch tweet info from tweet table and return to the user to render the home page with tweets.
  2. The service also query timeline table in the DB to get more than X tweets


Timeline Cache

  1. The timeline cache stores X most recent tweets for a user which can be sorted which reverse chronological order by tweet id
  2. For cache eviction, the least recently visited users' timeline will be evicted
  3. Users who has large follower group such as celebrities would most likely have their timeline in cache, in order to allow other users to get benefit from cache, we can prepare separate caches for celebrities


Sharding strategy

  1. Similarly, celebrities tweets will have larger amount of viewing request, causing hotspot on read for most of the tables, we can shard a separate DB for celebrities to increase availability




Trade offs/Tech choices

TweetId

  • We can use snowflake id as tweet id, it's a 64 bit id that include useful information such as timestamp info, region info, we can sort the tweetid to generate a timeline with reverses chronological order, and it does not require a central place to generate the id


Follow

  1. During favorite flow, if the favorite count is updated but follow request failed, user may see an inconsistency of follow count and follower, this is OK as it's not a key info, we can further introduce a daily worker to run on follow table to fix the follow count

Like

  1. Similar to follow, the like count can be inconsistency with the actual like if any of the like request failed, which is OK. We can also introduce a daily worker to fix the count.



Failure scenarios/bottlenecks




Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?