System requirements


Functional:

  1. Register user
  2. Login
  3. Update portfolio
  4. Compose tweet
  5. Get tweets
  6. Update tweet
  7. Delete tweet
  8. Favorite tweet
  9. Unfavorite tweet
  10. Follow other users
  11. Unfollow other users



Non-Functional:

  1. Availability over consistency
  2. Distributed system
  3. Scalability


Capacity estimation

DAU: 10 million users

Tweets: 10 million tweets/day

100 reads / tweet

Average: 500 bytes/tweet


Tweets created per second: 10 million tweets/day / 24 hours/day / 60 min/hour / 60 sec/min = 1 * 10^7 / 1 * 10^5 = 100 writes/second

Estimate max tweets created per second: 2*100 = 200 writes/sec

Tweets read per second: 100 * 100 = 10,000 reads/second

Max tweets read per second: 2* 10,000 reads/second

Daily storage without media: 10 * 10^6 * 500byte = 5* 10^9 B = 5 TB



API design

  1. Register user takes email and password
  2. Login (takes email and password)
  3. Update portfolio takes (userId, portfolio)
  4. Compose Tweet (userId, content, createDate, updateDate)
  5. notifyUsers(takes the tweet object which include userId)
  6. getTweet(tweetId)
  7. getTweets(userId)
  8. Update Tweet(tweetId, content, updateDate)
  9. Delete Tweet(tweetId)
  10. Favorite Tweet(userId, tweetId)


Database design

Register user, login, and update portfolio endpoints will mainly interact with the USER table.

Create, Get, Update, and delete tweets with interact with TWEET Table, and User table.

When trigger follower and notify we will interact with the FOLLOW table.

When favorite a tweet will interact with FAVORITE table







High-level design

A load balancer is place before all of the backend services, since our max tweets read per second can be very high, an API gateway is also place at the load balancer level because we need to authorization purpose such as if user able to perform some operation such as edit/delete tweets.


For "CRUD" operations with tweets, we will send them to database and cache.

For operations that notifyUsers will go to a pub/sub structure like kafka. And There will be workers that listen to the queue and sends notifications to people who followed the user.




Request flows

Compose/Update tweet: User sends tweet to the service, when service receive the tweet, it will add to the database and put it to queue. Workers in the background will fetch the tweet and find out followers from the FOLLOW table and notify the users.


Read tweet: User sends request to read a tweet, service will first check cache, if exists return tweet else return not found. Service will need to retrieve tweet from db if not found, and store result from db to cache.


Delete: User sends delete request, the service will first delete record in cache if exists, and delete from db if exists. And delete from consumed queue







Detailed component design

Load balancer: perhaps a geo base strategy can be used here since user in the same geographic location tend to have similar interest.

API Gateway: In case of preventing user intentional behavior such as overwhelm services by sending lots requests we should have rate limitors. Perhaps 100 tweets per minute for read and 5 tweets per minute for write.

Service:

Service can used for CRUD operations for users and tweets. It will interact with cache, db and queue when needed. Service also have background workers where they contently listen to the queue to notify followers.

DB:

Since there will be handling at max 20,000 reads per second, we will need to have replicas of database in case cache does not have the tweets information. We can have a main-secondary approach for db. The main db will handle writes and propagate information to the secondary dbs.


Cache:

Distributed cache is needed here because we want our system to be scaled and favor scalability.


Queue:

Assuming we use a pub/sub-system like Kafka, it depends on the number of workers we have, and our partitions should not exceed the number of workers.






Trade offs/Tech choices

sql vs no sql DB, since we are storing structural data and we have cache to reduce traffic, SQL is picked here. We can do analysis easily.


Distribute redis can be use for cache

Reason to use distributed cache is that we can have concurrent reads that can increase performance. In terms of strategy, we can use LRU. Trade off compare to least frequent use is that sometimes people tend to read the same post from an older time, it will have higher count and tend not to disappear in cache, but LRU won't be able to handle. But least frequent use can sometimes take up spaces for awhile.


Queue:

Kafka can be use because of its resilience, scalability





Failure scenarios/bottlenecks

One of the bottle necks can come from redis when multiple items expire at the same time. It will suddenly cause high traffic to database.






Future improvements

Use both sql and no sql for db, SQL to write and no SQL for read.