Design Twitter - System Design

Requirements

Functional Requirements:

Allow users to tweet messages up to 140 characters.
Enable users to follow other users.
Allow users to like tweets from other users.
Display tweets from followed users in the home feed.
Show top K popular tweets in the home feed based on likes and followers.

Non-Functional Requirements:

Security - authN, rate limiting
Scalability - hundreds of millions of worldwide active users
Latency should be low, many users reading many tweets
Writes can be a bit slower, ~1s acceptable "human operation speed"

Assumptions/Notes:

currently no req to "display a user's tweets"
currently no req to decay/time-window/weighting for the top-K
expected "celebrity problem" of particularly popular users and tweets

API Design

Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...

create_tweet(user_id, message) -> tweet_id

get_tweet(tweet_id) -> [tweet_id, message, user_id]

follow(follower_id, followed_id)

unfollow(follower_id, followed_id)

like(user_id, tweet_id)

unlike(user_id, tweet_id)

feed(user_id, next_page_token) -> [Tweet, ...]

High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.

Capacity

est. 250 million MAU, 4 avg tweets per day,

1B tweets / 100k sec/day ~= 10k tweets/sec

Expect reads to be 10-100x

=> will need partitioning and caching

Architecture

Use CDN to serve static elements, including trending tweets, alleviating hotspots

API gateway/load balancer to direct & partition requests

User Service & DB

key on user_id, track other attrs like email/auth/etc

Also owns Follow relationships

DB tradeoffs:

sharded relational DB (Citus postgres) - good for consistency & bi-directional follow queries, slightly slower consistency, okay b/c # users is in millions
nosql /w secondary indexes (dynamo) - good for slightly faster latency, encoding followers in Sort Key, similar to concat(followed_id, follower_id), and followed relationship in an inverted index

Tweet Service & DB

key on tweet_id, track other attrs like user_id/message/etc

Also stores likes for tweets as count and per-user

DB: some nosql (dynamo) b/c # of tweets will be enormous, easily multi-billions per year, encode likes in Sort key, similar to concat(tweet_id, user_id)

Stream: also emit tweets to a high throughput queue like Kafka for analytics. Diff strategies exist: queue first, queue + DB, outbox/CDC

Celebrity User follows and Tweet likes (Top-K)

While nosql can handle individual follows and likes, aggregating to count large number of followers and likes probably need to be special-cased. With large numbers lose exact count accuracy and use eventual consistency. Can add a tier of sharded aggregations, then centrally aggregate the shards. Fundamental counting can be done with heap or sorted set or probabilistic count implementation.

Followed Users' Tweets

Do some precompute by maintaining tweet_id list cache per user that represents their followed users' tweets. For most normal users with few tweets and few followers, the write-fanout-on-tweet won't be very costly and optimizes for reads.

Celebrity Followed Users' Tweets

Once a user's follower count exceeds some limit, mark them as celebrities and exempt them from the write-fanout. Followers of celebrities will have to additionally query their celebrity's recent tweets and merge with their precomputed list on read, in a hybrid approach

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.