System requirements
Assuming we already have a user log in system that handles authentication and authorization for us
Functional:
- users should be able to make a post, with an attached photo
- users should be able to re-post other people's tweets
- users should be able to respond to other people's tweets
- users should be able to follow and unfollow users
Non-Functional:
- user's news feed should be displayed in a combined approach that shows recent tweets prioritized based off engagement
- prioritize loading newsfeed quickly over optimizing the order of tweets perfectly, load 15 tweets at a time
- posts users make should show up the next time their followers refresh their news feed
- prioritizing speed and scalability over being incredibly accurate
- rate limit users trying to post more than 10 posts per minute
- system needs to be scalable as we expect to continue growing in number of daily users and posts
- system needs to be available and fast
Capacity estimation
Estimate the scale of the system you are going to design...
- expect 100k daily active users, with an average of 3 posts per user -> 300k posts per day
- 40% of tweets will have photos -> 120k tweets with photos
- 20% of tweets will have GIFs -> 60k tweets with GIFs
- 10% of tweets will have videos -> 30k tweets with videos
- allocate all 280 bytes for characters for tweets -> 84 MB to store text
- photo tweets -> 5MB per photo * 120k photos ->600k MB -> 600GB
- GIF tweets -> 15MB per GIF * 60k -> 900k MB -> 900GB
- video tweets -> 512MB per vid * 30k -> 15.36M MB -> ~15TB
to reduce load on GIFs we can rely on a bank of GIFs to reduce duplication of GIFs that need to be stored and instead store a link to the GIF used by tweet
we can also cache the most popular media and most popular tweets for faster retrieval
API design
Define what APIs are expected from the system...
Assuming headers will contain information about which user is making the request
POST /tweet/ - allows users to post a tweet, can take in a media attachment
returns 200 upon successful write of tweet to DB
returns error if write is not successful
automatically retries on error, using an exponential retry system before declaring total failure after 3 retries
POST /retweet/ - allows users to retweet another users post
returns 200 upon successfully writing link to users tweet in user's DB post entry
returns error if write is not successful, use exponential retry system for 3 retries before declaring total failure
GET /newsfeed/ - returns a users newsfeed, fetching tweets from the people they follow that have been made in the last 3 days and then running them through a prioritization algorithm to sort them based on engagement
returns 200 and newsfeed
returns 404 on error, using an exponential retry system
POST /response/ - allows users to respond to a tweet
re
DELETE /deleteTweet/ - allow user to delete tweet
return 200 if authorized and delete successful
return error if otherwise
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
User DB
userID - primary key
account creation date - datetime
tweets - list of tweetIDs associated with an account
follows - list of userIDs user follows
followers - list of userIDs following user
Tweet DB
tweetID - primarykey
text of tweet
mediaID - foreign key of media contained in tweet
author - foreign key of userID
creation timestamp
Media DB
media ID - primary key
media
tweetIDs - list of tweets associated with media
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...
user accesses application via a web UI or an app, logging in and getting authenticated and being issued a token with a TTL to allow them to continue interacting with the application without having to refresh
all user requests will pass through an API gateway to allow for built in rate limiting and authorization services
all requests will hit a load balancer that will route the request to the right service and ensure an even level of load across all hosts
the service will pull the needed information from the database and return it, or return an error if the database read was unsuccessful
the service will write to the database if the request is a POST
Service break down:
User service will manage:
- creation and deletion of users
- follower management
Tweet service will manage:
- creation and deletion of tweets
- associating a tweet with a userID
- fetching tweets for newsfeed and storing them in the cache
Media service will manage:
- uploading of media
- fetching media
- association of media with tweets
all databases will have back up, read only copies that will be updated using a gossip protocol
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
user logs in -> gets authenticated through API gateway, receives some sort of auth token so we don't have to re-authenticate in the future -> request for newsfeed automatically generated, passes through load balancer and returns the top 20 or so tweets and then places the rest in a cache for faster access as the user scrolls
user posts a tweet -> tweet is written to tweet database
user retweets a tweet -> link to original tweet added to DB containing list of tweet ID's for a user
user follows another user -> user gets written to their followers list in the db
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
- each service will use a load balancer to help scale up and down as traffic spikes
- databases will have to be partitioned based on their primary key, and services will have to use a hash function to know which table to access (the specific hashing algorithm will depend on the service and database being accessed) to allow for horizontal DB scaling
- we will use a redis cache for caching the newsfeed of active users, as well as the currently popular tweets and the media associated with them
- we will use a least recently used method for clearing the cache when we run out of space in the cache
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- SQL databases provide faster reads but can be more of a pain to manage the migrations if there are schema updates that need to be made
- using a gossip protocol means that newsfeeds may not be 100% accurate as a gossip protocol means eventual consistency instead of immediate consistency, however for something like social media it's not a big deal if a user misses a tweet that someone they follow just made
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- if there's a sudden traffic spike and there are lots of writes being made to the database this can cause a failure scenario as the database is overloaded and latency is increased as more requests have to wait for the database to lock and unlock before they can write, and requests can time out
- if the API gateway is failing - either the gateway service we use is down, or it somehow gets disconnected from the services - then the whole application will be down
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
- be able to block users
- build in a system to delete users who have not been active in years to free up storage space
- we can use a content distribution network to delivery the media components of tweets faster if we're running in to consistent latency issues
- for an overwhelming amount of database writes, we could implement a queue system where the tweets that need to be written are added to the queue, the tweet POST request returns a 200, and the tweet will eventually be written to the database for access as the queue is processed
- having a failover API gateway, or a back up authentication and authorization system that can be used in place of it should it fail will help to mitigate that scenario