System requirements


Functional:

Compose a tweet

Share a tweet

See tweets from other users

See metadata on the tweet

Favorite a tweet


Let's assume that tweets are completely raw text for now, and we can consider images, videos, etc. later


Non-Functional:

Extremely high capacity storage

Quickly scalable in big spikes (events may have a lot of tweets)

Tweets need to load relatively quickly < 50ms




Capacity estimation

300 Million DAU

500 million tweets daily

Each user follows 100 other users

Average size of a tweet -> 128 bytes + random metadata -> 0.5kb per tweet * 500 million tweets daily = 10^8 * 5 * 10^2 bytes = 5 * 10^10 bytes / 2 ^ 30 = 200ish gigabytes per day with usually eternal storage on twitter




API design

I think we can consider the following APIs:


/tweet/compose

  • This will compose a tweet and return a tweet ID of the tweet we composed

parameters:

  • tweet text
  • user id

/tweet/share

  • This will share a tweet to the user's followers, so

/user/feed

  • This will return the tweets for a given user, so you can query tweet info for each tweet per user

/tweet/favorite

  • This will add a favorite to the tweet and return the tweet Id





Database design

Tweet consists of the following information:

  • User - FK
  • Tweet text
  • Timestamp


Likes Table - This can be used by querying a tweet Id, and doing a count to find all entries

  • ID
  • User - FK
  • Timestamp
  • Tweet Id - FK


Shares Table - this can be used by querying a tweet

  • ID
  • User - FK
  • Timestamp
  • Tweet Id - FK


User

  • User ID
  • Password (hashed)
  • Profile Photo
  • Headline
  • Joined Date
  • Verified
  • Other metadata


Follows - this allows us to find all the followed users for a user

  • User ID
  • Followed User ID



High-level design


To develop this service, we'll need a couple different components.


We'll need multiple different services due to the sheer load and each will be backed by a load balancer.


  1. User Metadata Service
  2. Tweet Service
  3. Newsfeed Service


These will all be backed by a NoSQL database due to high scalability and query speeds. There's a strong relational component here, but PostgresQL does not scale as well as NoSQL. The databases will be replicated across multiple AZ to ensure high availability.


We should add a note here about how we plan to shard the database.


Furthermore, each of these services will need a cache like Redis backing it so that we can easily retrieve metadata, feeds, etc as needed.


To dig further into the Tweet Service, we need to consider the two important paths which are compose and delete. These critical actions we need to have as strong consistency. That means as we publish to the tweets we will need it to be accurate since it'd be odd for a tweet to not show up as published. We also need to make sure deleting tweets is accurate.


For likes and shares, we just need eventual consistency since it's not super important the likes number is completely up to date, and then it's also fine if we don't have every single share.


For the newsfeed, we can have a service that dedicates itself to traversing the graph and finding all the different tweets for the different users that are followed by a user.



Request flows


The client will make a request and it depends what it is.


For example if it's a post, we'll need to authenticate and authorize the user to actually make the post. This is probably the case for most of these endpoints since we want them to be protected, no one should access any of these things, unless the tweet is considered "public".


The request will be made and then the cache will check if the tweet is already there, If it is, we can just return the tweet. Otherwise, we'll have to retrieve data from the multiple databases and return it.


For newsfeed, we'll make a request and it'll check in the cache if the user has recently loaded their newsfeed. We can then just load that newsfeed if it's already in there. We can have a TTL of 60 seconds, so that when they refresh the newsfeed, it'll make a new query to the NewsFeed service to get the information.


We could use a CDN for content as well.


Detailed component design


Home feed service is essentially returning the most interesting tweets for a user. To determine that we need some algorithm to look at the tweets the person is liking/viewing, and then use that to weight each algorithm. We can determine the tweet by how much it's liked/followed.



Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

Scenarios to consider:

  • Controversial tweets/High traffic tweets
  • Notification of tweets that have been deleted
  • Tweet by a person will be sent to tons of people and can overwhelm notification service



Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?