System requirements


Functional:

  • Create tweet : text, photos, videos
  • character size limit : 140
  • Follow others
  • View timeline of the user : top tweets by people user follows


Non-Functional:

  • Read-heavy system : eventual consistency is ok
  • high availability, high reliability : information should be accurate
  • low latency : for generating timeline for user



Capacity estimation

  • 500 Mil total users
  • DAU : 200 mil
  • total reads/day : assme 100 reads per user per day -> 200 mil * 100 = 20 Billion reads
  • read storage estimate :
  • avg size of tweet - text (1KB) + image (500KB) + video (10MB) -> avg 1MB
  • daily read storage : 1 MB * 20 Billion = 20 PB / day
  • Read-heavy system
  • total writes/day : assume 50 mil people write, on average 2 tweets a day
  • daily write storage : 1MB * 50 mil * 2 = 1 MB * 100 mil = 100 TB / day
  • writes/second = 100 mil (writes/day) / 100000 (sec/day) = 1000 writes/sec
  • every user follows on average 100 people on twitter
  • some power users can have more followers = millions
  • think about celebrity tweet reading and storing


API design


createTweet(user_id, tweet_id, content, timestamp, authorization_token)


getFeed(user_id, authorization_token) : get feed (timeline) for current user


follow(user_id, target_user_id)


Database design


Option 1 : Relational DB to store the tweet metadata and Object Store (S3, GCS) to store media files since they are large in volume.

  • Advantages: support complex relationships, faster querying, maintain data integrity and consistency, highly realiable

Tweet table :

  • tweet_id : UUID
  • created_at: timestamp
  • user_id : UUID
  • content: text, reference to the media stored in object store


Follow table :

  • followee : people followed by follower
  • follower : user who follows other people
  • We want to get "Followees" of the current user quickly to generate timeline ->
  • index DB by "Followers" -> for faster search
  • groupby all the people follower (users) is following


System is going to be read-heavy. For relational DB it could be difficult to scale and reduce complexity as the data increases in volume


For this reason, we can consider NoSQL databases :

Option 2 :

  • For follow table relations - we can use Graph DB where relations are in the form of nodes and edges
  • user follows other user can be represented by outgoing edges
  • user getting followed by other user can be represented by incoming edges
  • For media, we can use Key-value object store (S3, GCS)
  • maintainable, scalable, provides security


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...



flowchart TD

B[client]-- read tweet --> S[app server]

S[app server] --> C[cache]

C -- get tweet metadata --> RD[Relational Database]

RD -- tweet details (URL) --> S

S -- send media url --> B

B -- request media --> CDN[Content Delivery]

CDN <-- pull CDN --> OS[Object Storage]




Request flows


Write tweet:

sequenceDiagram

Client->>App-Server: write tweet

App-Server->>writeAPI: write_request

writeAPI->>RelationalDB: write tweet metadata

writeAPI->>ObjectStorage: write media

writeAPI->>Cache: store metadata

writeAPI->>App-Server: transaction complete


Read tweet:

(if present in the cache)

sequenceDiagram

Client->>App-Server: read tweet

App-Server->>readAPI: read_request

readAPI->>Cache: is present

Cache->>App-Server: tweet content

App-Server->>Client : tweet metadata

Client->>CDN: get media

CDN->>Client : tweer feed


(if not present in the cache)

sequenceDiagram

Client->>App-Server: read tweet

App-Server->>readAPI: read_request

readAPI->>Cache: is not present

readAPI->>relationalDB : get tweet metadata

relationalDB->>readAPI : tweet metadata

readAPI->>Cache: add tweet metadata

readAPI->>App-Server: tweet metadata

App-Server->>Client : tweet metadata

Client->>CDN: get media

CDN->>Client : tweet feed




Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


Scaling DB :


Improvement-1 : system is read-heavy - maintain read-only replicas of the relational DB

  • Advantage : faster read performance since they are optimized for reading
  • Issue :
  • relational DB can async populate the replica with any new change providing eventual consistency
  • which is ok since we are preferring availability and low latency over consistency
  • We will have around 1000 writes/sec writing to the DB and replica (in non-peak hour) so we need have scaling enabled for writing only. Hence we can do "sharding".



Improvement-2 : Sharding :

  • based on user_id : user only cares about tweet from its followees
  • we separate the DB into multiple shards based on user follow patterns.
  • If we do based on tweet_id : we need to check all the shards for user followees for the current user, not ideal
  • Hence, sharding by user_id makes sense.


Creating timeline :

  • Ranking (sort) tweets by time they created
  • Two step algorithm:
  • Candidate generation -> get all users current user is following and then get all their tweets
  • Ranking -> Select top 20 tweets from all based on "relevance"
  • most "relevant" tweets can be determined by using ML techniques for embedding generation and similarity matching
  • pre-generate tweet timelines for each user and store in separate DB : asynchronusly
  • do it for every users (200 million)
  • use pubsub / message queue to get the new. tweet from app server -> user pusub + workers to publish that new tweet to separate Cache -> "Feed Cache"
  • Feed cache will contain cached tweets only for each user's feeds and continuously getting updated with new tweets
  • Feed cache size : 200 mill * top 20 -> 4 Billion tweets
  • perform sharding by user_id to improve performance
  • Updating feed cache with celebrities (100 million followers) could be expensive
  • use Pull mechanism for celebrities -> generate new feed only when they request
  • User follows someone new ? how to update the feed ?
  • send this new info to app server -> via pubsub and workers -> they update the feed for the user
  • workers will access Relational DB to get more tweets from new followee and updated feed for current user


Cache:

  • Store popular tweets in the memory cache for metadata
  • could be limited by actual memory of the system
  • LRU cache eviction - we care above most recent tweets
  • help with reduced latency to get top 20 tweets quickly
  • we may have to query multiple shards
  • CDN cache for object storage
  • create replicas for high availability and reduced latency across different regions


Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?