Post Tweet API - users should be able to write a tweet and attach images or vides
Read timeline API - users should be able to generate a timeline of tweets
Favorite Tweet API - users should be able to favorite a tweet
Non-Functional:
1 million users
10 million write requests per month
1 billion read requests per month
Capacity estimation
1 million users
10 million write requests per month
1 billion read requests per month
there are roughly 2.5 million seconds in a month:
10 million / 2.5 million = 4 writes per second
1 billion / 2.5 million = 400 reads per second
Assume a user likes 10% of the tweets they see. and a user sees 1000 tweets per month. This means that there will be:
10 million likes per month
Likes are write requests so we will double the writes to month to 20 million which means there are now 8 writes per second on average
Tweets will be capped at 255 characters which is 255 bytes and the max attachment size will be 1MB
Total data we need to store is roughly 1MB per tweet * 10 million tweets per month = 10TB
API design
The system will have 3 API's
Tweet API - This takes a user's text input and attachment and stores it
Generate Timeline API - This API reads from a database a list of new tweets and displays them on the user's timeline.
Favorite API - This API increments a like counter on the tweet and stores a reference to the tweet in the user's database entry
Database design
Eventual consistency will be fine here are there is no time critical data in a social media site
I will choose a NoSQL database here because of the structure of the calls that need to be made and the lack of necessity for relational mapping
NoSQL also allows us to scale horizontally and we could use a managed service such as DynamoDB or Firebase Firestore
The NoSQL database will have two collections: Tweets and User's
Tweets will be contained in separate documents with the following fields: timestamp, text, attachments, favorites and userID
User's will have have documents containing fields such as tweets from user (maps or links to the tweet document), favorites (maps or links to tweets they have favorited)
High-level design
From the high level our client will make a request to the server
The client request will hit a load balancer which will use round robin to direct it to the appropriate server
Depending on the request type the server will call one of 3 API's: write tweet, read timeline, or favorite tweet
these three api's will perform their necessary actions in regards to the NoSQL database
For write tweet, it will create a new document in the tweets collection with the following fields: timestamp, text, attachments, favorites and userID. Attachments should be stored in the object store bucket and the reference to the attachment should be put in the attachment field in the document.
The read timeline api first hits a cache to see if a new enough version of the timeline lets say within one minute has already been read from the database and returns that
If the cache is stale, which means the newest tweet in the cache is greater than one minute old, we go to the database and grab all documents that are newer than the newest tweet in the cache. We replace the current cache with the list of tweets we just obtained and return this data to the user to populate their timeline.
The favorite tweet api finds the matching tweet document in the database and increments the number of likes counter. It then goes to the current user using their UID and adds a reference to that tweet to the user's "liked tweets" field in their document
Request flows
From the high level our client will make a request to the server
The client request will hit a load balancer which will use round robin to direct it to the appropriate server
Depending on the request type the server will call one of 3 API's: write tweet, read timeline, or favorite tweet
these three api's will perform their necessary actions in regards to the NoSQL database
For write tweet, it will create a new document in the tweets collection with the following fields: timestamp, text, attachments, favorites and userID
The read timeline api first hits a cache to see if a new enough version of the timeline lets say within one minute has already been read from the database and returns that
If the cache is stale, which means the newest tweet in the cache is greater than one minute old, we go to the database and grab all documents that are newer than the newest tweet in the cache. We replace the current cache with the list of tweets we just obtained and return this data to the user to populate their timeline
The favorite tweet api finds the matching tweet document in the database and increments the number of likes counter. It then goes to the current user using their UID and adds a reference to that tweet to the user's "liked tweets" field in their document
Detailed component design
Read timeline api - The cache should be AWS elasticache allowing for managed scaling of the cache
The client should read attachment references using a CDN like AWS Cloudfront allowing for faster loading of the attachments which could be images or videos in an Availability zone closer to them
The write api should store the attachments in an object store like AWS S3 and in the tweet document store a reference to that CDN reference to that object. For example if we are storing objects in S3 and using Cloudfront we will store the cloudfront link to that image in the tweet document
Trade offs/Tech choices
Using a NoSQL database here over a SQL database. We are using this because there is not a strict relational mapping between any of the actions the server needs to perform. Also as the number of writes grows scaling the SQL database vertically will become challenging and expensive. Using NoSQL from the beginning will allow us to horizontally scale into billions of users and write requests. Also eventual consistency is fine for a social media platform and we do not need ACID transactions to show people tweets or the number of likes as it will not affect the user experience.
We are also assuming that the user will see every tweet. If the user only wanted to see tweets from user's they follow, SQL could be better to join the read api calls where the user's follower list contains the tweets user UID
Failure scenarios/bottlenecks
We can replicate the database so in case of an emergencies we have a backup. This can be done every hour as losing an hour of tweets at the worst is not that big of an impact on user experience.
If we have a cache miss there could be a large amount of tweets that a user has to load. This means on the tail end, one user will have a long load time to load new tweets and cache
Availability should not be a concern as we will use managed services for the backend. AWS ELB for the load balancers, EC2 for the servers, Cloudfront for the CDN, S3 for the object store, elasticache for the read cache, and dynamodb for the NoSQL database.
Future improvements
Have a worker that fans out the new tweets to individual user's timelines. Instead of having to read the from the cache or check the database, a worker would generate a timeline for the user before they request it. Then when it is requested it is guaranteed to already be generated