System requirements
Functional:
- Create tweet : text, photos, videos
- character size limit : 140
- Follow others
- View timeline of the user : top tweets by people user follows
Non-Functional:
- Read-heavy system : eventual consistency is ok
- high availability, high reliability : information should be accurate
- low latency : for generating timeline for user
Capacity estimation
- 500 Mil total users
- DAU : 200 mil
- total reads/day : assme 100 reads per user per day -> 200 mil * 100 = 20 Billion reads
- read storage estimate :
- avg size of tweet - text (1KB) + image (500KB) + video (10MB) -> avg 1MB
- daily read storage : 1 MB * 20 Billion = 20 PB / day
- Read-heavy system
- total writes/day : assume 50 mil people write, on average 2 tweets a day
- daily write storage : 1MB * 50 mil * 2 = 1 MB * 100 mil = 100 TB / day
- writes/second = 100 mil (writes/day) / 100000 (sec/day) = 1000 writes/sec
- every user follows on average 100 people on twitter
- some power users can have more followers = millions
- think about celebrity tweet reading and storing
API design
createTweet(user_id, tweet_id, content, timestamp, authorization_token)
getFeed(user_id, authorization_token) : get feed (timeline) for current user
follow(user_id, target_user_id)
Database design
Option 1 : Relational DB to store the tweet metadata and Object Store (S3, GCS) to store media files since they are large in volume.
- Advantages: support complex relationships, faster querying, maintain data integrity and consistency, highly realiable
Tweet table :
- tweet_id : UUID
- created_at: timestamp
- user_id : UUID
- content: text, reference to the media stored in object store
Follow table :
- followee : people followed by follower
- follower : user who follows other people
- We want to get "Followees" of the current user quickly to generate timeline ->
- index DB by "Followers" -> for faster search
- groupby all the people follower (users) is following
System is going to be read-heavy. For relational DB it could be difficult to scale and reduce complexity as the data increases in volume
For this reason, we can consider NoSQL databases :
Option 2 :
- For follow table relations - we can use Graph DB where relations are in the form of nodes and edges
- user follows other user can be represented by outgoing edges
- user getting followed by other user can be represented by incoming edges
- For media, we can use Key-value object store (S3, GCS)
- maintainable, scalable, provides security
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
flowchart TD
B[client]-- read tweet --> S[app server]
S[app server] --> C[cache]
C -- get tweet metadata --> RD[Relational Database]
RD -- tweet details (URL) --> S
S -- send media url --> B
B -- request media --> CDN[Content Delivery]
CDN <-- pull CDN --> OS[Object Storage]
Request flows
Write tweet:
sequenceDiagram
Client->>App-Server: write tweet
App-Server->>writeAPI: write_request
writeAPI->>RelationalDB: write tweet metadata
writeAPI->>ObjectStorage: write media
writeAPI->>Cache: store metadata
writeAPI->>App-Server: transaction complete
Read tweet:
(if present in the cache)
sequenceDiagram
Client->>App-Server: read tweet
App-Server->>readAPI: read_request
readAPI->>Cache: is present
Cache->>App-Server: tweet content
App-Server->>Client : tweet metadata
Client->>CDN: get media
CDN->>Client : tweer feed
(if not present in the cache)
sequenceDiagram
Client->>App-Server: read tweet
App-Server->>readAPI: read_request
readAPI->>Cache: is not present
readAPI->>relationalDB : get tweet metadata
relationalDB->>readAPI : tweet metadata
readAPI->>Cache: add tweet metadata
readAPI->>App-Server: tweet metadata
App-Server->>Client : tweet metadata
Client->>CDN: get media
CDN->>Client : tweet feed
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Scaling DB :
Improvement-1 : system is read-heavy - maintain read-only replicas of the relational DB
- Advantage : faster read performance since they are optimized for reading
- Issue :
- relational DB can async populate the replica with any new change providing eventual consistency
- which is ok since we are preferring availability and low latency over consistency
- We will have around 1000 writes/sec writing to the DB and replica (in non-peak hour) so we need have scaling enabled for writing only. Hence we can do "sharding".
Improvement-2 : Sharding :
- based on user_id : user only cares about tweet from its followees
- we separate the DB into multiple shards based on user follow patterns.
- If we do based on tweet_id : we need to check all the shards for user followees for the current user, not ideal
- Hence, sharding by user_id makes sense.
Creating timeline :
- Ranking (sort) tweets by time they created
- Two step algorithm:
- Candidate generation -> get all users current user is following and then get all their tweets
- Ranking -> Select top 20 tweets from all based on "relevance"
- most "relevant" tweets can be determined by using ML techniques for embedding generation and similarity matching
- pre-generate tweet timelines for each user and store in separate DB : asynchronusly
- do it for every users (200 million)
- use pubsub / message queue to get the new. tweet from app server -> user pusub + workers to publish that new tweet to separate Cache -> "Feed Cache"
- Feed cache will contain cached tweets only for each user's feeds and continuously getting updated with new tweets
- Feed cache size : 200 mill * top 20 -> 4 Billion tweets
- perform sharding by user_id to improve performance
- Updating feed cache with celebrities (100 million followers) could be expensive
- use Pull mechanism for celebrities -> generate new feed only when they request
- User follows someone new ? how to update the feed ?
- send this new info to app server -> via pubsub and workers -> they update the feed for the user
- workers will access Relational DB to get more tweets from new followee and updated feed for current user
Cache:
- Store popular tweets in the memory cache for metadata
- could be limited by actual memory of the system
- LRU cache eviction - we care above most recent tweets
- help with reduced latency to get top 20 tweets quickly
- we may have to query multiple shards
- CDN cache for object storage
- create replicas for high availability and reduced latency across different regions
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?