Requirements
Functional Requirements:
- Allow users to tweet messages up to 140 characters.
- Enable users to follow other users.
- Allow users to like tweets from other users.
- Display tweets from followed users in the home feed.
- Show top K popular tweets in the home feed based on likes and followers.
Non-Functional Requirements:
- Security - authN, rate limiting
- Scalability - hundreds of millions of worldwide active users
- Latency should be low, many users reading many tweets
- Writes can be a bit slower, ~1s acceptable "human operation speed"
Assumptions/Notes:
- currently no req to "display a user's tweets"
- currently no req to decay/time-window/weighting for the top-K
- expected "celebrity problem" of particularly popular users and tweets
API Design
Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...
create_tweet(user_id, message) -> tweet_id
get_tweet(tweet_id) -> [tweet_id, message, user_id]
follow(follower_id, followed_id)
unfollow(follower_id, followed_id)
like(user_id, tweet_id)
unlike(user_id, tweet_id)
feed(user_id, next_page_token) -> [Tweet, ...]
High-Level Design
Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.
Capacity
est. 250 million MAU, 4 avg tweets per day,
1B tweets / 100k sec/day ~= 10k tweets/sec
Expect reads to be 10-100x
=> will need partitioning and caching
Architecture
Use CDN to serve static elements, including trending tweets, alleviating hotspots
API gateway/load balancer to direct & partition requests
User Service & DB
key on user_id, track other attrs like email/auth/etc
Also owns Follow relationships
DB tradeoffs:
- sharded relational DB (Citus postgres) - good for consistency & bi-directional follow queries, slightly slower consistency, okay b/c # users is in millions
- nosql /w secondary indexes (dynamo) - good for slightly faster latency, encoding followers in Sort Key, similar to concat(followed_id, follower_id), and followed relationship in an inverted index
Tweet Service & DB
key on tweet_id, track other attrs like user_id/message/etc
Also stores likes for tweets as count and per-user
DB: some nosql (dynamo) b/c # of tweets will be enormous, easily multi-billions per year, encode likes in Sort key, similar to concat(tweet_id, user_id)
Stream: also emit tweets to a high throughput queue like Kafka for analytics. Diff strategies exist: queue first, queue + DB, outbox/CDC
Celebrity User follows and Tweet likes (Top-K)
While nosql can handle individual follows and likes, aggregating to count large number of followers and likes probably need to be special-cased. With large numbers lose exact count accuracy and use eventual consistency. Can add a tier of sharded aggregations, then centrally aggregate the shards. Fundamental counting can be done with heap or sorted set or probabilistic count implementation.
Followed Users' Tweets
Do some precompute by maintaining tweet_id list cache per user that represents their followed users' tweets. For most normal users with few tweets and few followers, the write-fanout-on-tweet won't be very costly and optimizes for reads.
Celebrity Followed Users' Tweets
Once a user's follower count exceeds some limit, mark them as celebrities and exempt them from the write-fanout. Followers of celebrities will have to additionally query their celebrity's recent tweets and merge with their precomputed list on read, in a hybrid approach
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.