Design Twitter - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Tweet Management

Post a tweet - max 280 chars
Tweet can have optional media - GIFs, images, short videos (1MB)
Reply to a tweet
Retweet a tweet
Like/Unlike a tweet
Delete a tweet

Follow System

User can follow/unfollow others

Search

Search by keywords - tweets, users, hashtags
Suggestions for people to follow

Out of Scope

User registration
Direct messaging to other users
Analytics - message count, leaderboard
Moderation - message filtering, abuse filtering, blocking users
Security - user authentication/authorisation
Language support - translation and mult-language support
API support - Read/write tweets

System Requirements

A tweet and comment is of size 280 chars
Support URLs but altogether has to be within 280 chars
Multi-meida in a tweet to be in 2KB

Non-Functional:

List non-functional requirements for the system...

Scale : supports hundreds of millions of users
Performance: low latency timeline ( < 200ms), tweets are visible to others within secs
Availability - Highly available (99,9%)
Durability - once tweeted, the tweets are not lost
Consistency - strong consistency for likes, eventual consistency for timeline or new tweets appearing in home page
Observability - monitoring system and components - health and wellness

Capacity estimation

Estimate the scale of the system you are going to design...

Total No of Users - 500M
Active Users 40% - 200M
Each user tweeting 10 tweets a day - 200M x 10 = 2B
Total Tweets/comments a day - 2B x 300 chars = 600GB
Total multimedia a day - assuming 20% of tweets has multi-media - 20M tweets x 2MB = approx 40 GB

API design

Define what APIs are expected from the system...

Core Entities

Tweets
Users
Followers
Following
Timeline
Trends

Tweets API

Get a Tweet

GET twitter.com/api/1/tweets/:id
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response : 200 OK
Respond Body : [Tweet instance]

Get Tweets by user

GET twitter.com/api/1/users/:id/tweets/
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response : 200 OK
Respond Body : [a list of tweets]

Delete a tweet

DELETE twitter.com/api/1/tweets/:id
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response : 200 OK

Post a Tweet

POST twitter.com/api/1/tweets
Request Header: [
  Authorization : Bearer <JWT Token>
]
Request Body: [A tweet]
Response: 201 Created

Update a tweet

PUT twitter.com/api/1/tweets
Request Header: [
  Authorization : Bearer <JWT Token>
]
Request Body: [A tweet]
Response: 200 OK

Retweet a tweet

POST twitter.com/api/1/users/:id/tweets/:source-tweet-id
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK

Unlike/Like API

PATCH twitter.com/api/1/tweets/:id/like
Request Header: [
  Authorization : Bearer <JWT Token>
]
Request Body: [a tweet identifier, like/unlike]

Search Trends

GET twitter.com/api/1/trending/topics
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of Trending tweets]

Get User API

GET twitter.com/api/1/users/:id/
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [an instance of user]

Get User by username

GET twitter.com/api/1/users/by/username/:name
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]

Get following

GET twitter.com/api/1/users/:id/following/
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]

Get followers

GET twitter.com/api/1/users/:id/followers/
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]

Follow a user

GET twitter.com/api/1/users/:source-user-id/following/
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK

UnFollow a user

GET twitter.com/api/1/users/:source-user-id/following/:target-user-id
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response: 200 OK

Timeline API

PATCH twitter.com/api/1/users/:id/timeline?cursor={date}&region={region}
Request Header: [
  Authorization : Bearer <JWT Token>
]
Response : 200 OK
Response Body : [a list of 1000 tweets in reverse_chronological order]

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

Tweets (In KV store and indexed in tweets index)
1. tweetId [a twitter-snowflake id, sortable, allows range queries]
2. tags: [a list of String #tags]
3. userId
4. title
5. name
6. content
7. attachments [a lis of images/gifs/short video]
8. createdOn
9. lastUpdatedOn
10. topic [an internal topic classification]
11. classification [a list of classification tags]
12. keywords: [a bag of keywords]

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Tweeting

User posts a tweet
The tweet is added to a KV store - the tweet id is sortable using Twitter-Snowflake id (contains sign bit, time part, machine id, mac address, sequence number)
The KV store changes are captured via a Change Data Capture stream, the changes are written into a distributed broker. This ensures strong consistency in creating a tweet. The KV store is s distributed database with atleast 3 nodes having the tweet added.
A stream processing job picks up the changes and runs a few analysis on the tweet. This is eventual consistency to bring up topic, trends, keywords etc.
1. sentiment analysis
2. Removes stop words, normalise/correct words, topic identification by TF-IDF, does sentiment analysis, identifies keywords by TF-IDF
Tweet Index - the result of stream processing is posted onto tweet index for faster retrieval by topic, keyword, id, sentiments, trends

Finding Top K HashTags/Topics

Twitter stream processing finds top trending topics, keywords, top k tweets around the globe. Its bucketised to regional (country, city). The real-time stream processing engine computes this incrementally

Named Entity Extraction - processor extracts named entities - names, places, things from tweet content body
Sliding window for tracking top terms in the tweet
Top K terms - maintains a list of top K terms found in the tweet plus historical tweets
Geo & Language segmentation - segment tweet content to geo location/language used

Sliding Window

Track frequency in last n minutes, hours

Surge Detection

Compare current frequency vs historical baseline

Top K

Count Min-Sketch - keep track of most frequent hash tags with low memory footprint, avoid botnet activity or spam, demote such active hashtags or hashtags used in a private group.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Search Engine - Tweets are inserted into a distributed search engine from main database post processing, for faster lookup and multi-faceted search. The top trending, hash tags, etc populated in search index. This is an index similar to Elastic search
KV store - a distributed key/value store for persisting tweets, the key is tweet_id

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Hot partitions when a hashtag or tweet becomes viral. lot of people reads, tweets, retweets the original tweet. This creates a load onto the database, stream pipeline. This surge is usually region or place specific

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Spam detection, malicious content filtering
Security
Avoid spreading scare - demote such tweets