Design Twitter - System Design

System requirements

Functional:

Create tweet : text, photos, videos
character size limit : 140
Follow others
View timeline of the user : top tweets by people user follows

Non-Functional:

Read-heavy system : eventual consistency is ok
high availability, high reliability : information should be accurate
low latency : for generating timeline for user

Capacity estimation

500 Mil total users
DAU : 200 mil
total reads/day : assme 100 reads per user per day -> 200 mil * 100 = 20 Billion reads
read storage estimate :
avg size of tweet - text (1KB) + image (500KB) + video (10MB) -> avg 1MB
daily read storage : 1 MB * 20 Billion = 20 PB / day
Read-heavy system
total writes/day : assume 50 mil people write, on average 2 tweets a day
daily write storage : 1MB * 50 mil * 2 = 1 MB * 100 mil = 100 TB / day
writes/second = 100 mil (writes/day) / 100000 (sec/day) = 1000 writes/sec
every user follows on average 100 people on twitter
some power users can have more followers = millions
think about celebrity tweet reading and storing

API design

createTweet(user_id, tweet_id, content, timestamp, authorization_token)

getFeed(user_id, authorization_token) : get feed (timeline) for current user

follow(user_id, target_user_id)

Database design

Option 1 : Relational DB to store the tweet metadata and Object Store (S3, GCS) to store media files since they are large in volume.

Advantages: support complex relationships, faster querying, maintain data integrity and consistency, highly realiable

Tweet table :

tweet_id : UUID
created_at: timestamp
user_id : UUID
content: text, reference to the media stored in object store

Follow table :

followee : people followed by follower
follower : user who follows other people
We want to get "Followees" of the current user quickly to generate timeline ->
index DB by "Followers" -> for faster search
groupby all the people follower (users) is following

System is going to be read-heavy. For relational DB it could be difficult to scale and reduce complexity as the data increases in volume

For this reason, we can consider NoSQL databases :

Option 2 :

For follow table relations - we can use Graph DB where relations are in the form of nodes and edges
user follows other user can be represented by outgoing edges
user getting followed by other user can be represented by incoming edges
For media, we can use Key-value object store (S3, GCS)
maintainable, scalable, provides security

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

flowchart TD

B[client]-- read tweet --> S[app server]

S[app server] --> C[cache]

C -- get tweet metadata --> RD[Relational Database]

RD -- tweet details (URL) --> S

S -- send media url --> B

B -- request media --> CDN[Content Delivery]

CDN <-- pull CDN --> OS[Object Storage]

Request flows

Write tweet:

sequenceDiagram

Client->>App-Server: write tweet

App-Server->>writeAPI: write_request

writeAPI->>RelationalDB: write tweet metadata

writeAPI->>ObjectStorage: write media

writeAPI->>Cache: store metadata

writeAPI->>App-Server: transaction complete

Read tweet:

(if present in the cache)

sequenceDiagram

Client->>App-Server: read tweet

App-Server->>readAPI: read_request

readAPI->>Cache: is present

Cache->>App-Server: tweet content

App-Server->>Client : tweet metadata

Client->>CDN: get media

CDN->>Client : tweer feed

(if not present in the cache)

sequenceDiagram

Client->>App-Server: read tweet

App-Server->>readAPI: read_request

readAPI->>Cache: is not present

readAPI->>relationalDB : get tweet metadata

relationalDB->>readAPI : tweet metadata

readAPI->>Cache: add tweet metadata

readAPI->>App-Server: tweet metadata

App-Server->>Client : tweet metadata

Client->>CDN: get media

CDN->>Client : tweet feed

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Scaling DB :

Improvement-1 : system is read-heavy - maintain read-only replicas of the relational DB

Advantage : faster read performance since they are optimized for reading
Issue :
relational DB can async populate the replica with any new change providing eventual consistency
which is ok since we are preferring availability and low latency over consistency
We will have around 1000 writes/sec writing to the DB and replica (in non-peak hour) so we need have scaling enabled for writing only. Hence we can do "sharding".

Improvement-2 : Sharding :

based on user_id : user only cares about tweet from its followees
we separate the DB into multiple shards based on user follow patterns.
If we do based on tweet_id : we need to check all the shards for user followees for the current user, not ideal
Hence, sharding by user_id makes sense.

Creating timeline :

Ranking (sort) tweets by time they created
Two step algorithm:
Candidate generation -> get all users current user is following and then get all their tweets
Ranking -> Select top 20 tweets from all based on "relevance"
most "relevant" tweets can be determined by using ML techniques for embedding generation and similarity matching
pre-generate tweet timelines for each user and store in separate DB : asynchronusly
do it for every users (200 million)
use pubsub / message queue to get the new. tweet from app server -> user pusub + workers to publish that new tweet to separate Cache -> "Feed Cache"
Feed cache will contain cached tweets only for each user's feeds and continuously getting updated with new tweets
Feed cache size : 200 mill * top 20 -> 4 Billion tweets
perform sharding by user_id to improve performance
Updating feed cache with celebrities (100 million followers) could be expensive
use Pull mechanism for celebrities -> generate new feed only when they request
User follows someone new ? how to update the feed ?
send this new info to app server -> via pubsub and workers -> they update the feed for the user
workers will access Relational DB to get more tweets from new followee and updated feed for current user

Cache:

Store popular tweets in the memory cache for metadata
could be limited by actual memory of the system
LRU cache eviction - we care above most recent tweets
help with reduced latency to get top 20 tweets quickly
we may have to query multiple shards
CDN cache for object storage
create replicas for high availability and reduced latency across different regions

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?