System requirements
Functional:
- a user can post tweets
- a user can read and like tweets posted by other people
- a user can follow other users
- a user can be notified about new tweets from users he follows
- a user can search other users
Non-Functional:
- High availability
- Eventual consistency
- Read-heavy system (most tweets being read then written)
- uneven distribution of traffic
- posting a tweet should be fast
Capacity estimation
Assume 100 million active users
Assume 500 million new tweets per day
Assume that a tweet has an average size of 10 KB
Assume on average 10 users read a single tweet
50 GB of storage per day, 150 TB of storage per month
5 billion read requests a day, 150 billion a month
API design
search API: reached when a user is trying to search. Will be able to parse the input into a query based on key words and return a result
read API: reached when a user views their timeline. Will pull out the most relevant tweets to showcase.
write API: reached when a user wants to post a new tweet. Will store the tweet in the DB. A user can also like a tweet and that will also be stored in the DB through the write API
Database design
we will use a relational DB. We have to store a lot of data rigidly and will need to have many joins. a RDB will allow us that. I also allows us clear patterns for scaling.
Some of the tables in the DC can be:
Users: user_id, username, password, date_joined, followers, following
Tweets: tweet_id, content, media, time_stamp
Likes: tweet_id, user_id
To support the high traffic we will scale by using copies and master-slave replication, federation, or sharding.
We will also add a cache that will contain the most popular tweets so that they do not have to constantly be read from the DB.
We will add an object store that will have all the static content.
High-level design
the client directed through a web server to the load balancer which redistributes work. The Search API uses the search service to parse the search query a user inputs and return an answer. The read API uses the timeline service to update the timeline a user requests to see. The write API uses its services to post a new tweet and fan it out to the user's followers or like an already posted tweet.
All APIs use the database to get information. The DB has some master-slave replicas and a cache to help with the load
Request flows
Read request:
when a user wants to view a timeline, the load balancer will go to the read API, which using the timeline service will reach the database or the cache and will load a timeline with the most recent tweets posted by the users our user follows. The timeline is retrieved from the database.
Search request:
a user inputs a query. The search API using the search service parses the query and using the DB returns the most matching results.
Write request:
A user posts a new tweet. The write API, using the fanout service posts the tweet to the user's followers and notifies them using the notification service. It stores the tweet in the database and if the user id popular enough it will cache the tweet. The write API also supports the "like" option through the like service. This service will store into the DB info about the tweet liked and by which user.
Detailed component design
Read API and timeline service work with a "push" approach. Will only update a timeline when a user requests to see it and not immediately after posting a tweet. This is done to save storage and work. We can assume that not all users check their timeline at every given moment but once they do we know how to handle their request.
We can scale the database using master-slave replications. We assume that most users view timelines rather then post tweets so this is a read-heavy system. Thus, slave replicas will serve all the read requests, helping handle the massive amounts of users engaging with the platform.
Trade offs/Tech choices
I chose a relational DB even though I know that we handle massive amounts of data and it would take a lot of scaling because to handle a system of such scales with so many influencing factors we need highly structured data. Moreover, to receive all the information we need about a tweet or a user we will need complex joins which are better with a RDB.
Failure scenarios/bottlenecks
we are risking a bottle neck at the READ API. If many users try to view their timeline at once, our "push" logic for the timeline might become slow since we will have to update many timelines at once.
We can have a bottleneck at the load balancer as it redistributes traffic.
Future improvements
it is still preferable to updating timelines that will not be viewed and we can help scale the process by caching popular or very recent tweets to spare access to the DB.