System requirements
Functional:
- User can post a tweet with up to 150 characters
- User can like a tweet
- User can follow and unfollow other users
- User can view all tweets composed by themselves
- User can view all tweets of friends they follow by reverse chronological order
Non-Functional:
- High availability
- View tweets in low latency
- Eventual consistency
- Highly scalable to handle peak usage (x10 concurrent users)
- Assume there are 1 billion DAU, and each user composes 1 tweet per day and shares/likes 10 tweets per day.
- There are 1 billion new tweets per day and the QPS for writing new tweets is 10^9 / 10^5 = 10K/second, and the QPS for like/share is 100K/second
- There will be 150 characters per tweet and let's estimate the size of each tweet is up to 512 bytes, considering different languages, which means there will be 512GB storage increment per day
- Twitter has heavier read operations, and let's assume read/write ratio is 5:1, so the read QPS is 50K/second
- At peak and there're 20% DAU as concurrent user, which is 20M users
API design
- Post tweet: tweet(user_id, content) and server returns tweet id
- Like tweet: like(user_id, tweet_id)
- Follow: follower(from_user_id, to_user_id)
- Fetch homepage tweets: fetch(user_id, cursor_id, limit)
Database design
Users:
user_id primary key,
user_name index,
email,
created_at,
updated_at
Friendships:
from_user_id index,
to_user_id index,
created_at
primary_key(from_user_id, to_user_id)
To look up whom a user is following and we can make from_user_id as partition key and to_user_id as sort key.
To easily view who are following a user, we can create a GSI for revert relationship: to_user_id as partition key and from_user_id as sort key
Tweets:
tweet_id primary key,
content,
created_by index,
created_at inxex
We need fast reads on tweets, and there isn't a strong need to join with other tables, so we can select a key-value database like DynamoDB to allow fast reads and easy to be scalable.
To look up tweets by user id and sort tweets by created_at, we can create GSI: created_by as partition key and created_at as sort key
Likes:
tweet_id,
user_id,
created_at
Hashtag:
hashtag_id index,
tweet_Id index,
created_at
primary_Key(hashtag_id, tweet_id)
High-level design
- There will be 3 services: friendship, tweet, and home feed
- API server will be responsible for authentication, rate limit, and redirecting request to corresponding service
- friendship will handle follow and unfollow
- tweet will handle posting tweet, hastag, and liking tweet. It will also store tweets and hastag in a cache with long TTL because tweets are rarely edited
- home feed service will fetch data from cache and database to precompute home feed page for each user
Request flows
- When user follows/unfollows another user, the request will be redirected to friendship service and the service will update the database
- When user posts or likes a tweets, the request will be redirected to tweet service and the service will update related tables and cache
- When user requests homepage feeds, the request will be redirected to home feed service and it will compute the feeds to send back to user
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
- Computing feeds is costly. To reduce the latency, we can consider to fan out tweets to the user's followers. In addition, if a user follows a lot of users, we can create a new table called precompute_feeds. The partition key is user_id and the value is list of tweet ids ordered by created_at.
- If a user has tons of followers, fanning out tweets is costly. For this condition, we can only send the tweet to a follower when the follower requests to update the homepage feeds
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?