System requirements
Functional:
The critical requirements are:
- User Registration and Authentication
- Compose Tweets:
- Follow/Unfollow
- View Timeline
- Notification
There could be other features won't be included in the discussion today:
- Like Tweets
- Retweet
- Search Functionality
- User profile
- Report/Block users
Non-Functional:
- Availability: 99.99% up time
- Latency: View timeline should be less than 10 seconds, send tweet should be within 2 seconds
- Consistency: The order of the posts don't have a strong consistency. The new post should show up in timeline soon.
- Security and privacy: User information should not be accessed by unauthorized party.
Capacity estimation
10 million DAU, 5000 peek view QPS.
Assume 20% user tweets, the write QPS could be 1000.
Assume each tweet takes 1KB, the daily storage requirement is 1GB. The overall storage requirement could be TB.
API design
- Login: POST /login
- Request {"account", "encoded_password"}
- Response {"status", "message"}
- Compose Tweet: POST /tweet
- Request {"user_id", "tweet"}
- Response {"status", "message"}
- View Timeline: GET /timeline
- Response {["author", "tweet", "likes", "create_at"]}
- Follow: POST /follow
- Request {"user_id", "follow_id"}
- Response {"status", "message"}
- Unfollow: POST /follow
- Request {"user_id", "follow_id"}
- Response {"status", "message"}
Database design
- User
- int user_id(primary_key)
- string account
- string salt
- string encoded_password
- datetime created_at
- Post
- int post_id(primary_key)
- int user_id(foreign_key)
- string content
- datetime created_at
- Follow
- int follow_id(primary_key)
- int from_user(foreign_key)
- int to_user(foreign_key)
- Enum status(follow, unfollow)
High-level design
We have the following components:
- API gateway: rate-limiting, load balancing, token validation
- authentication service: has the authentication specific logic, login, sign up, account management
- post service: handles request to create, delete, modify posts. Also generate timeline, can be replicated to be scalable
- DB: stores tables, can be replicated and sharded by user_id or location, to improve scalability and reliability
- MessageQueue: store the notification events when follow, like happens
- Notification service: consume the message fromt he MessageQueue and send through the third party notification service
- Redis: the in-memory DB to cache frequently queried data
Request flows
Tweet flow
- The User initiates a tweet request through the API Gateway.
- The API Gateway verifies the user’s authentication by communicating with the Authentication Service.
- Once authentication is successful, the API Gateway forwards the request to the Post Service to create the tweet.
- The Post Service then stores the tweet in the Database and, after successfully storing it, sends a notification message to the Message Queue.
- The Notification Service consumes the notification message and sends out notifications to the user’s followers.
- Lastly, the API Gateway sends a confirmation back to the User.
View timeline flow
- User request to view timeline
- The API Gateway verifies the user’s authentication by communicating with the Authentication Service.
- Once authentication is successful, the API Gateway forwards the request to the Post Service to generate timeline.
- The post service looks up in the DB for all the followed user's tweet, and look up in redis for the frequent viewed tweet
- Post service send timeline back to user
Detailed component design
DB sharding and redis cache:
- The sharding can be based on user_id and following relationships to avoid cross shard lookups to be efficient
- The popular users who has a lot of followers can be sharded separately. Their posts can be cached in Redis, and everytime process a timeline view request, the service look up these popular user's posts in Redis
- Sharding can be re-shuffled at a certain time frame to make the sharding work efficiently
- Since we asked for eventual consistency, the replication can be done with a single leader replication, which ensures fast write and fast read from followers, but the asynchronous update to replica would not have strong consistency, which is what we don't have here.
Trade offs/Tech choices
Since we asked for eventual consistency, the replication can be done with a single leader replication, which ensures fast write and fast read from followers, but the asynchronous update to replica would not have strong consistency, which is what we don't have here.
We used microservice to allow each service to be scale independently. i.e. post service can be scale up more than authentication service, as it serves higher load.
Failure scenarios/bottlenecks
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?