System requirements
Functional:
Compose a tweet
Share a tweet
See tweets from other users
See metadata on the tweet
Favorite a tweet
Let's assume that tweets are completely raw text for now, and we can consider images, videos, etc. later
Non-Functional:
Extremely high capacity storage
Quickly scalable in big spikes (events may have a lot of tweets)
Tweets need to load relatively quickly < 50ms
Capacity estimation
300 Million DAU
500 million tweets daily
Each user follows 100 other users
Average size of a tweet -> 128 bytes + random metadata -> 0.5kb per tweet * 500 million tweets daily = 10^8 * 5 * 10^2 bytes = 5 * 10^10 bytes / 2 ^ 30 = 200ish gigabytes per day with usually eternal storage on twitter
API design
I think we can consider the following APIs:
/tweet/compose
- This will compose a tweet and return a tweet ID of the tweet we composed
parameters:
- tweet text
- user id
/tweet/share
- This will share a tweet to the user's followers, so
/user/feed
- This will return the tweets for a given user, so you can query tweet info for each tweet per user
/tweet/favorite
- This will add a favorite to the tweet and return the tweet Id
Database design
Tweet consists of the following information:
- User - FK
- Tweet text
- Timestamp
Likes Table - This can be used by querying a tweet Id, and doing a count to find all entries
- ID
- User - FK
- Timestamp
- Tweet Id - FK
Shares Table - this can be used by querying a tweet
- ID
- User - FK
- Timestamp
- Tweet Id - FK
User
- User ID
- Password (hashed)
- Profile Photo
- Headline
- Joined Date
- Verified
- Other metadata
Follows - this allows us to find all the followed users for a user
- User ID
- Followed User ID
High-level design
To develop this service, we'll need a couple different components.
We'll need multiple different services due to the sheer load and each will be backed by a load balancer.
- User Metadata Service
- Tweet Service
- Newsfeed Service
These will all be backed by a NoSQL database due to high scalability and query speeds. There's a strong relational component here, but PostgresQL does not scale as well as NoSQL. The databases will be replicated across multiple AZ to ensure high availability.
We should add a note here about how we plan to shard the database.
Furthermore, each of these services will need a cache like Redis backing it so that we can easily retrieve metadata, feeds, etc as needed.
To dig further into the Tweet Service, we need to consider the two important paths which are compose and delete. These critical actions we need to have as strong consistency. That means as we publish to the tweets we will need it to be accurate since it'd be odd for a tweet to not show up as published. We also need to make sure deleting tweets is accurate.
For likes and shares, we just need eventual consistency since it's not super important the likes number is completely up to date, and then it's also fine if we don't have every single share.
For the newsfeed, we can have a service that dedicates itself to traversing the graph and finding all the different tweets for the different users that are followed by a user.
Request flows
The client will make a request and it depends what it is.
For example if it's a post, we'll need to authenticate and authorize the user to actually make the post. This is probably the case for most of these endpoints since we want them to be protected, no one should access any of these things, unless the tweet is considered "public".
The request will be made and then the cache will check if the tweet is already there, If it is, we can just return the tweet. Otherwise, we'll have to retrieve data from the multiple databases and return it.
For newsfeed, we'll make a request and it'll check in the cache if the user has recently loaded their newsfeed. We can then just load that newsfeed if it's already in there. We can have a TTL of 60 seconds, so that when they refresh the newsfeed, it'll make a new query to the NewsFeed service to get the information.
We could use a CDN for content as well.
Detailed component design
Home feed service is essentially returning the most interesting tweets for a user. To determine that we need some algorithm to look at the tweets the person is liking/viewing, and then use that to weight each algorithm. We can determine the tweet by how much it's liked/followed.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Scenarios to consider:
- Controversial tweets/High traffic tweets
- Notification of tweets that have been deleted
- Tweet by a person will be sent to tons of people and can overwhelm notification service
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?