System requirements
Functional:
- Register user
- Login
- Update portfolio
- Compose tweet
- Get tweets
- Update tweet
- Delete tweet
- Favorite tweet
- Unfavorite tweet
- Follow other users
- Unfollow other users
Non-Functional:
- Availability over consistency
- Distributed system
- Scalability
Capacity estimation
DAU: 10 million users
Tweets: 10 million tweets/day
100 reads / tweet
Average: 500 bytes/tweet
Tweets created per second: 10 million tweets/day / 24 hours/day / 60 min/hour / 60 sec/min = 1 * 10^7 / 1 * 10^5 = 100 writes/second
Estimate max tweets created per second: 2*100 = 200 writes/sec
Tweets read per second: 100 * 100 = 10,000 reads/second
Max tweets read per second: 2* 10,000 reads/second
Daily storage without media: 10 * 10^6 * 500byte = 5* 10^9 B = 5 TB
API design
- Register user takes email and password
- Login (takes email and password)
- Update portfolio takes (userId, portfolio)
- Compose Tweet (userId, content, createDate, updateDate)
- notifyUsers(takes the tweet object which include userId)
- getTweet(tweetId)
- getTweets(userId)
- Update Tweet(tweetId, content, updateDate)
- Delete Tweet(tweetId)
- Favorite Tweet(userId, tweetId)
Database design
Register user, login, and update portfolio endpoints will mainly interact with the USER table.
Create, Get, Update, and delete tweets with interact with TWEET Table, and User table.
When trigger follower and notify we will interact with the FOLLOW table.
When favorite a tweet will interact with FAVORITE table
High-level design
A load balancer is place before all of the backend services, since our max tweets read per second can be very high, an API gateway is also place at the load balancer level because we need to authorization purpose such as if user able to perform some operation such as edit/delete tweets.
For "CRUD" operations with tweets, we will send them to database and cache.
For operations that notifyUsers will go to a pub/sub structure like kafka. And There will be workers that listen to the queue and sends notifications to people who followed the user.
Request flows
Compose/Update tweet: User sends tweet to the service, when service receive the tweet, it will add to the database and put it to queue. Workers in the background will fetch the tweet and find out followers from the FOLLOW table and notify the users.
Read tweet: User sends request to read a tweet, service will first check cache, if exists return tweet else return not found. Service will need to retrieve tweet from db if not found, and store result from db to cache.
Delete: User sends delete request, the service will first delete record in cache if exists, and delete from db if exists. And delete from consumed queue
Detailed component design
Load balancer: perhaps a geo base strategy can be used here since user in the same geographic location tend to have similar interest.
API Gateway: In case of preventing user intentional behavior such as overwhelm services by sending lots requests we should have rate limitors. Perhaps 100 tweets per minute for read and 5 tweets per minute for write.
Service:
Service can used for CRUD operations for users and tweets. It will interact with cache, db and queue when needed. Service also have background workers where they contently listen to the queue to notify followers.
DB:
Since there will be handling at max 20,000 reads per second, we will need to have replicas of database in case cache does not have the tweets information. We can have a main-secondary approach for db. The main db will handle writes and propagate information to the secondary dbs.
Cache:
Distributed cache is needed here because we want our system to be scaled and favor scalability.
Queue:
Assuming we use a pub/sub-system like Kafka, it depends on the number of workers we have, and our partitions should not exceed the number of workers.
Trade offs/Tech choices
sql vs no sql DB, since we are storing structural data and we have cache to reduce traffic, SQL is picked here. We can do analysis easily.
Distribute redis can be use for cache
Reason to use distributed cache is that we can have concurrent reads that can increase performance. In terms of strategy, we can use LRU. Trade off compare to least frequent use is that sometimes people tend to read the same post from an older time, it will have higher count and tend not to disappear in cache, but LRU won't be able to handle. But least frequent use can sometimes take up spaces for awhile.
Queue:
Kafka can be use because of its resilience, scalability
Failure scenarios/bottlenecks
One of the bottle necks can come from redis when multiple items expire at the same time. It will suddenly cause high traffic to database.
Future improvements
Use both sql and no sql for db, SQL to write and no SQL for read.