Design Twitter - System Design

System requirements

Functional:

Create Tweet

author, timestamp, likes, retweets
Content: text (280 char), images, video, URLs, hashtags, mentions, polls, emojis

View Tweets - sorted chronologically OR relevance/popularity

Favorite Tweets

Follow Users

Search Tweets - keywords or hashtags

Filter/Sort - by media type

Non-Functional:

Large number of viewers

First 10 tweets show within 500ms

Eventual consistency OK

Capacity estimation

500 million DAU X 100 tweets viewed a day = 50 billion reads

500 mil DAU x 2 tweets written a day = 1 billion writes

Peak usage: 20% of users (100 million simultaneous read/writes)

API design

POST /v1/tweet/ Parameters: post_text, has_complex_content, auth_token

GET /v1/feed/ Parameters: auth_token, page_number, start_timestamp, end_timestamp (optional)

POST /v1/follow/id: auth_token

DELETE /v1/follow/id: auth_token

POST /v1/tweet/id/retweet: auth_token

POST /v1/tweet/id/favorite: auth_token

Database design

Users:

Use relational DB (User DB)
ID (index)
Username
Hashed Password
Profile Text
Last Login Date

Follows:

Use a graph database to connect User IDs to each other (User DB)
Add a follow vertex from the person to who they follow

Posts:

Use NoSQL DB (Post DB)
UUID (chronological) (index)
HTML content
User ID (author)
Retweet ID (optional) (index)

Favorites:

Use relational DB (Metadata DB)
Post ID (index)
User ID (who favorited)
Timestamp

Hashtags:

Use relational DB (Metadata DB)
UUID (Index)
Hashtag

HashtagPosts:

Use relational DB (Metadata DB)
Hashtag ID (index)
Post ID

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Create Tweet:

User submits text and indication of whether there is complex content (image/video).
The Post Tweet logic fetches a chronological UUID from the UUID Generator and stores the author, text content, and ID in the DB.
If there are any hashtags, it stores a HashtagPost to track it in the Meta DB. It also creates a new Hashtag if one doesn't exist.
It returns success, and if there is non-text content to upload it generates a pre-signed URL and returns it. The client uses that to upload to S3. From there, the Image / video processing job processes the content into different formats and blocks, and when done updates the status of the post in the DB, as well as the HTML to include the URL of the content.
When the post is fully stored, Post Tweet or the Image / video processing job puts the post_id in the New Post Queue.
The Feed Generator workers pick it up, fetch users that follow the author, and puts the new post into newsfeed caches for those users.

Feed Generator:

Pulls the most recent X (which is configurable, let's say 20 by default) posts and overwrites the user's entry in the Newsfeed Cache with that feed.

Get News Feed:

The client sends a request to the API with no starting timestamp, and an end timestamp of the latest start timestamp in the local cache (if it exists).
The request goes to the web server, which checks the Newsfeed Cache for the feed. If not found, it fetches a page of X posts from the starting timestamp (or X latest posts if no start) from followed users (ordered chronologically by the UUID). If an end timestamp is provided, it stops there, or if it reaches X post, whichever is smaller. It fetches from the Post Cache or Post DB.
It also tries to fetch follow and retweet counts from the Meta Cache, and if it is not found it counts them in the DB. The client fetches any video/images from the CDN.
When the feed is returned, the posts are cached locally along with the starting timestamp of the latest post.

Follow/Unfollow:

User requests to follow, and a vertex is added to the UserDB from the requesting user to the followed user.
For unfollow, same path, but the vertex is deleted.

Favorite/Unfavorite:

User requests to favorite a post, and a Favorites table entry is added to the Meta DB.
Un-favorite: same, but the entry is deleted.

Detailed component design

User cache: uses user ID as key and shards on that

Newsfeed cache:

Uses Redis - can use sorted sets to maintain ordering quickly and change order between user ID (chronological) or something like popularity (# of retweets or favorites)
Contains user ID (key), posts, and metadata summaries for each.
Can shard on userID.
Has a timed invalidation that does not refresh itself
The feed generator overwrites the previous cache entry for the user every time it is prompted with a new post.

New Post Queue:

Uses Kafka
Stores post ID and post HTML

Meta Cache:

Uses Redis
Contains post ID (key), and counts of retweets and favorites for the post.
Can shard on user ID
Timed invalidation from when it is first inserted. Timer can reset whenever it's updated. Cache is updated for each new favorite or retweet, where the count is incremented by 1. If cache is locked for write already, skip the write to conserve speed.

Local Cache:

We can store recently fetched posts that drop out over time
We can store a list of who we follow so we don't have to look that up, although this will mean if you add a follow on a different device it may take time for the cache to invalidate and fetch again

Trade offs/Tech choices

Making the retweet an indexed part of the Post table slows down inserts, but since retweets act as posts in so many ways it can simplify that logic and speed up certain reads.
Only caching the first page to save on memory, since most users don't go past the first page, and caching pages individually could cause issues with showing duplicate posts if pages time out of the cache at different times.
Incrementing the Meta Cache on each retweet and favorite could cause some bottlenecking if a lot are done at once. We have the write skipped if it encounters a lock because accurate numbers are not super important, and it will be accurate on the next DB fetch.
The New Post Queue could be faster using Pub/Sub in Redis instead of Kafka, but that would keep us from updating the cache while a user is offline. It also carries more risk of missed posts if the queue crashes.

Failure scenarios/bottlenecks

Post Tweet failure: If the standard flow fails, an HTTP Error code is returned, which shows to the user in the client. If there is complex content (image/videos), if the Image / Video processing job fails explicitly, it sends a notification which will show to the user when notifications are fetched, and the post and content are removed. There can also be an occasional job that checks for too much time between initial post storage time and now, and if that has occurred, send a notification to the user and delete the content.

Get Feed failure: On error, return HTTP error code. If CDN is down, user can fetch directly from S3.

Hot Key problems: A user with a lot of followers could create a lot of traffic for the Feed Generator when they post. We could treat these users differently and have a separate infrastructure for fetching those posts (separate from other posts) instead of having them proactively pushed to the Feed Generator. We can flag these users so their posts aren't processed by Feed Generator, and

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?