System Requirements

Functional requirements:

  1. The user can post and share tweets
  2. The user can like/favorite tweets
  3. The user can see home timeline
  4. The user can see other user's timeline


Non-functional requirements


  1. Availability: Each request should get a response without error, without the guarantee that the data is the most recent
  2. Consistency: Eventual consistency is chosen
  3. Partition tolerance: The system should still operate even if some message are dropped due to the network between nodes
  4. Low-latency: The user can see their timeline within 500ms








Capacity Estimation

Assumption:


200 million DAU, each user post 3 tweets per day = 600 million tweet per day.

Each tweet with 140 bytes as content and 30bytes as metadata, and 20% of them contains photo 20KB, and 10% of contains 2Mb video.

Each user read 5 times hometimeline and 5 times other user's timeline, each timeline contains 20 tweets.


Data storage:

so the total size will be: 600m * (170bytes + 20kb * 30% + 2Mb * 10%) = 180TB per day


Bandwidth: 200 million * (5 + 5) * 20 * (140 bytes + 10 % * 2Mb + 20% * 20kb) / 86400 = 120 GB/s



API Design


  1. createTweet(userToken, String tweetcontent) -> response status code
  2. hometimeline(userToken, int pagesize, optional int pageOffset: indicating current page location) -> tweets list
  3. user timeline(userToken, int userId, int pagesize, optional int pageOffset) -> tweets list
  4. likeOrUnlikeTweet(userToken, int tweetId, boolean likeOrDislike) -> response status code






Database design


I'd choose MongoDB as our database, because:

  1. The tweets data are 180TB per day, a lot of data.
  2. The low latency is our requirement.
  3. We have horizontal scalability needs.

Database design:


Tweet:


TweetID: Integer, primary key

content: Varchar(140)

Metadata: Varchar(30)

....


User:

userId: Integer, primary key

email: varchar(30)

isHotUser: Boolean


Follower:

followerUserId: Integer

FolloweeUserId: Integer

FollowingDate: Timestamp







High level design







Request Flow

Can see from the high level diagram


Core component design


To ensure latency, every time a user post a tweet, the fanout service will retrieve follower data from database and then update the followers' timeline in the cache, and every user wanna get timeline, they check the cache firstly, if not existed, then they will query the database.


And for checking other's timeline, we can combine the push and pull mode, for hot users, we use the push mode, which means we add its updates in cache to reduce the database load and improve latency, and for cold user, we only query the database when needed, as querying in database is less efficient than redis.


And to better avoid staleness for the data in cache, we can update the cache regularly, and user LFU as our cache eviction policy.


For database sharding, we have 3 options: Sharding by userId, sharding by tweetCreationDate, sharding by tweetId. Here we can use sharding by tweetId, as it can better avoid the hot user problem while the other 2 ways are easier when doing query and easier implemented.


And for the database, we will do a master-slave mode to ensure availability and eventual consistency, although it increases implementation complexity and cost more resources, and can avoid a single point of failure.