Functional:
- User publish tweets
- User browse following users' updates
- User favorites others' tweets
Non-functional:
- High availability
- Scalability
- Final consistency
QPS:
Suppose there are 100 million active daily users and about 20% user swill publish 1 tweet every day. So the QPS for publishing should be 20,000,000 * 1 / 24 / 3600 = 200. Each tweet will take about 100 byte. Therefore, the total storage should be 200 * 100 / 1000 = 20 MB/day. After 1000 days, the storage would be 20 * 1000 / 1000 = 20GB. It can be stored in one machine.
Suppose each user will read once every day. QPS for reading should be 1000. There are much more reading operations than writing operations. But a single machine can deal with that.
Database:
- User table
- Tweet table
- User_Following table
- User_Follower table
For User table, I would like to use SQL because each user's information is structured. A user should have its id, name, password, gender, created time, and so on. Also it's same for User_Following table and User_Follower table.
For Tweet table, I would like to use NoSQL because it's not structured data. Each piece of information should include tweet_id, user_id, text_content, picture/video url, and number of favorites.
API Design:
- POST: "/users/{user_id}/publish"
- GET: "/users/{user_id}/browse"
- PATCH: "/users/{user_id}/like/{tweet_id}"
Traditionally, we can use push mode to get tweets, because it's real-time. But for stars, they have too many followers and it's expensive to push the updated tweets to each follower. Thus, for stars we can apply pull mode.
For favorites api, I think it may cause number error due to the concurrent operations. We can introduce message queue like kafka, or RabbitMQ to ensure the correctness.
Optimization:
- We can use CDN to load static contents to save time according to users' location.
- We can use load balancers to deal with large number of requests with some strategy.
- We can use redis to reduce the number of reading/writing operations.
- We can use reading/writing split and master-slave database to increase the performance.