System Requirements
Functional requirements:
- The user can post and share tweets
- The user can like/favorite tweets
- The user can see home timeline
- The user can see other user's timeline
Non-functional requirements
- Availability: Each request should get a response without error, without the guarantee that the data is the most recent
- Consistency: Eventual consistency is chosen
- Partition tolerance: The system should still operate even if some message are dropped due to the network between nodes
- Low-latency: The user can see their timeline within 500ms
Capacity Estimation
Assumption:
200 million DAU, each user post 3 tweets per day = 600 million tweet per day.
Each tweet with 140 bytes as content and 30bytes as metadata, and 20% of them contains photo 20KB, and 10% of contains 2Mb video.
Each user read 5 times hometimeline and 5 times other user's timeline, each timeline contains 20 tweets.
Data storage:
so the total size will be: 600m * (170bytes + 20kb * 30% + 2Mb * 10%) = 180TB per day
Bandwidth: 200 million * (5 + 5) * 20 * (140 bytes + 10 % * 2Mb + 20% * 20kb) / 86400 = 120 GB/s
API Design
- createTweet(userToken, String tweetcontent) -> response status code
- hometimeline(userToken, int pagesize, optional int pageOffset: indicating current page location) -> tweets list
- user timeline(userToken, int userId, int pagesize, optional int pageOffset) -> tweets list
- likeOrUnlikeTweet(userToken, int tweetId, boolean likeOrDislike) -> response status code
Database design
I'd choose MongoDB as our database, because:
- The tweets data are 180TB per day, a lot of data.
- The low latency is our requirement.
- We have horizontal scalability needs.
Database design:
Tweet:
TweetID: Integer, primary key
content: Varchar(140)
Metadata: Varchar(30)
....
User:
userId: Integer, primary key
email: varchar(30)
isHotUser: Boolean
Follower:
followerUserId: Integer
FolloweeUserId: Integer
FollowingDate: Timestamp
High level design
Request Flow
Can see from the high level diagram