Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter with Score: 8/10

by drift1945

System Requirements

Functional requirements:

The user can post and share tweets
The user can like/favorite tweets
The user can see home timeline
The user can see other user's timeline

Non-functional requirements

Availability: Each request should get a response without error, without the guarantee that the data is the most recent
Consistency: Eventual consistency is chosen
Partition tolerance: The system should still operate even if some message are dropped due to the network between nodes
Low-latency: The user can see their timeline within 500ms

Capacity Estimation

Assumption:

200 million DAU, each user post 3 tweets per day = 600 million tweet per day.

Each tweet with 140 bytes as content and 30bytes as metadata, and 20% of them contains photo 20KB, and 10% of contains 2Mb video.

Each user read 5 times hometimeline and 5 times other user's timeline, each timeline contains 20 tweets.

Data storage:

so the total size will be: 600m * (170bytes + 20kb * 30% + 2Mb * 10%) = 180TB per day

Bandwidth: 200 million * (5 + 5) * 20 * (140 bytes + 10 % * 2Mb + 20% * 20kb) / 86400 = 120 GB/s

API Design

createTweet(userToken, String tweetcontent) -> response status code
hometimeline(userToken, int pagesize, optional int pageOffset: indicating current page location) -> tweets list
user timeline(userToken, int userId, int pagesize, optional int pageOffset) -> tweets list
likeOrUnlikeTweet(userToken, int tweetId, boolean likeOrDislike) -> response status code

Database design

I'd choose MongoDB as our database, because:

The tweets data are 180TB per day, a lot of data.
The low latency is our requirement.
We have horizontal scalability needs.

Database design:

Tweet:

TweetID: Integer, primary key

content: Varchar(140)

Metadata: Varchar(30)

....

User:

userId: Integer, primary key

email: varchar(30)

isHotUser: Boolean

Follower:

followerUserId: Integer

FolloweeUserId: Integer

FollowingDate: Timestamp

High level design

Request Flow

Can see from the high level diagram

Core component design

To ensure latency, every time a user post a tweet, the fanout service will retrieve follower data from database and then update the followers' timeline in the cache, and every user wanna get timeline, they check the cache firstly, if not existed, then they will query the database.

And for checking other's timeline, we can combine the push and pull mode, for hot users, we use the push mode, which means we add its updates in cache to reduce the database load and improve latency, and for cold user, we only query the database when needed, as querying in database is less efficient than redis.

And to better avoid staleness for the data in cache, we can update the cache regularly, and user LFU as our cache eviction policy.

For database sharding, we have 3 options: Sharding by userId, sharding by tweetCreationDate, sharding by tweetId. Here we can use sharding by tweetId, as it can better avoid the hot user problem while the other 2 ways are easier when doing query and easier implemented. And for better distribute the request evenly, we can use consistent hashing when deciding the sharding.

And for the database, we will do a master-slave mode to ensure availability and eventual consistency, although it increases implementation complexity and cost more resources, and can avoid a single point of failure.

Trade-off

For database, we have 2 options, SQL and NoSQL, although SQL supports strong consistency, complex queries, and ACID transactions, NoSQL better fits our requirements for its high availability and horizontal scalability. As our system prioritizes availability over consistency. And NoSQL provides us with eventual consistency.

And for fanout on write service, and fanin on read, although fanout service is good for read, as it stores the data in advance in cache, it cost more resources, and for fanin on read service, although it saves our resources and better to implement, it will make the read request slower, as it needs to do the database query every time, so here we combine both approaches, for hot user, we use fanout on write service, and for cold user, we use fan in on read services.

Bottlenecks

the high though put is our bottleneck

Future improvements

Although the sharding by tweetId can avoid the hot tweet problem in a certain level, it's still possible that a single server contains much more hot tweets than others.

Besides that, we can support more functionalities in the future.