System requirements


Functional:

We want to allow users to post a tweet, view feed and follow users.


Non-Functional:

We want to prioritize latency, and availability. Consistency is less important, if it takes a couple of seconds for our system to show the tweet globally that is okay.


Capacity estimation

Lets say we have 200 million daily active users on twitter, 10% of which will create tweets. so we have 20 million writes, lets say 150 char tweet is 150 bytes with some meta data. Each tweet is then 300 bytes lets average to 1mb.


Lets say of the 200 million daily each views 100 tweets per day so we have 200 million * 100 = 20 billion, and then 20billion * 1mb per tweet on avg is 20 petabytes daily, which is 20*30 600 monthly, and 600 * 12 = roughly 6000 petabytes annually.


In terms of writes we will have something like 20 million * 1mb which is 20 terabytes daily of writes.



API design

https://twitter/:user_id/ GET request that gets the feed for the user, so this will be a list of lets say 10 tweets and loading the rests using pagination. 400 request for the error and returning something like a list of


{

tweet_id,

created_at,

user_id,

image

payload

}


https://twitter/:tweet_id/ GET request of the tweet itself which returns something like


{

created_at,

user_id,

replies: [{ image, user_name, payload }]

}


or 400 if error.


https://twitter/user_id/create POST request which creates a tweet and returns


{

tweet_id,

created_at,

user_id,

image

payload

} with 201 status


or 400 if error.


Database design


We want to use a sql database for this system because we have follower/followee relationships within the database that we are using here. And users a feed is filled with tweets of the their own plus tweets of other users that they follow so naturally a sql database can these table joins and return the correct data.


the tweet schema:

{

tweet_id: PRIMARY,

created_at,

user_id,

media_url,

payload

}


the user schema:

{

user_id: Primary,

user_name,

created_at,

first_name,

last_name,

profile_picture

}


we want to shard our sql database based on user_id so that we query a smaller dataset.


High-level design


We want to use a load balancer to distribute the load accordingly to the server.


Redis LRU cache to store the most recently used tweets for quick read access.


We have a sql database which will store our tweets and user information. Everytime a feed is loaded we will do a table join of who the user is following plus the tweets and return that to the user. In terms of profile pictures or tweets that have video/images we will store the raw data in our s3 bucket. Use a message queue like amazon sqs to queue the task and then store the url of the s3 bucket in our sql database under media_url or if its a user profile picture under the user schema. profile_picture.



Request flows

Get tweet: Client access the server through a load balancer. We look at the redis cache on a tweet read, if a cache miss occurs we check the sql database for that tweet


Get feed: Client access the server through a load balancer. We do a table merge of who the user is following and their tweets, plus our own users tweets. Use some sort of algorithm to actually see what is most relevant to our user amongst these tweets, and then return the ten most recent tweets in this list.


Post tweet: Client access the server through a load balancer. If a media is present we start uploading to our s3 bucket and push to our aws sqs. Once the task is complete we write the tweet to our sql database, otherwise we just save the tweet in our sqs. We will also do write around caching so that we optimize latency for reads over writes.


Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...



Trade offs/Tech choices


LRU cache over LFU cache. If a celebrity posts a tweet and it gets a a lot of retweets, LFU could become stagnant over time. We also use write around caching which optimizes for reads over writes here.


Using a sql database means we are acid compliant but have more latency due to table joins. Although sharding our database based on user id will help us only read from relevant sql databases as opposed to the entire database. This will also reduce write latency.


Failure scenarios/bottlenecks


Large amount of users requesting a single tweet that was just created, if our read only replicas of our sql database and redis caches don't have the tweet yet, they will get an error. We can create a separate redis cache for users that have above an certain amount of followers so that their tweets are cached immediately, and other users can look at their tweets instantly.


SPOF, we need more load balancers, more redis caches, more sql databases to ensure availability if there are natural disasters especially.



Future improvements

Large amount of users requesting a single tweet that was just created, if our read only replicas of our sql database and redis caches don't have the tweet yet, they will get an error. We can create a separate redis cache for users that have above an certain amount of followers so that their tweets are cached immediately, and other users can look at their tweets instantly.


SPOF, we need more load balancers, more redis caches, more sql databases to ensure availability if there are natural disasters especially.