My Solution for Design Twitter with Score: 9/10
by serenade_xenon816
System requirements
Functional:
List functional requirements for the system (Ask the chat bot for hints if stuck.)...
Tweet Management
- Post a tweet - max 280 chars
- Tweet can have optional media - GIFs, images, short videos (1MB)
- Reply to a tweet
- Retweet a tweet
- Like/Unlike a tweet
- Delete a tweet
Follow System
- User can follow/unfollow others
Search
- Search by keywords - tweets, users, hashtags
- Suggestions for people to follow
Out of Scope
- User registration
- Direct messaging to other users
- Analytics - message count, leaderboard
- Moderation - message filtering, abuse filtering, blocking users
- Security - user authentication/authorisation
- Language support - translation and mult-language support
- API support - Read/write tweets
System Requirements
- A tweet and comment is of size 280 chars
- Support URLs but altogether has to be within 280 chars
- Multi-meida in a tweet to be in 2KB
Non-Functional:
List non-functional requirements for the system...
- Scale : supports hundreds of millions of users
- Performance: low latency timeline ( < 200ms), tweets are visible to others within secs
- Availability - Highly available (99,9%)
- Durability - once tweeted, the tweets are not lost
- Consistency - strong consistency for likes, eventual consistency for timeline or new tweets appearing in home page
- Observability - monitoring system and components - health and wellness
Capacity estimation
Estimate the scale of the system you are going to design...
- Total No of Users - 500M
- Active Users 40% - 200M
- Each user tweeting 10 tweets a day - 200M x 10 = 2B
- Total Tweets/comments a day - 2B x 300 chars = 600GB
- Total multimedia a day - assuming 20% of tweets has multi-media - 20M tweets x 2MB = approx 40 GB
API design
Define what APIs are expected from the system...
Core Entities
- Tweets
- Users
- Followers
- Following
- Timeline
- Trends
Tweets API
Get a Tweet
GET twitter.com/api/1/tweets/:id
Request Header: [
Authorization : Bearer <JWT Token>
]
Response : 200 OK
Respond Body : [Tweet instance]
Get Tweets by user
GET twitter.com/api/1/users/:id/tweets/
Request Header: [
Authorization : Bearer <JWT Token>
]
Response : 200 OK
Respond Body : [a list of tweets]
Delete a tweet
DELETE twitter.com/api/1/tweets/:id
Request Header: [
Authorization : Bearer <JWT Token>
]
Response : 200 OK
Post a Tweet
POST twitter.com/api/1/tweets
Request Header: [
Authorization : Bearer <JWT Token>
]
Request Body: [A tweet]
Response: 201 Created
Update a tweet
PUT twitter.com/api/1/tweets
Request Header: [
Authorization : Bearer <JWT Token>
]
Request Body: [A tweet]
Response: 200 OK
Retweet a tweet
POST twitter.com/api/1/users/:id/tweets/:source-tweet-id
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Unlike/Like API
PATCH twitter.com/api/1/tweets/:id/like
Request Header: [
Authorization : Bearer <JWT Token>
]
Request Body: [a tweet identifier, like/unlike]
Search Trends
GET twitter.com/api/1/trending/topics
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of Trending tweets]
Get User API
GET twitter.com/api/1/users/:id/
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [an instance of user]
Get User by username
GET twitter.com/api/1/users/by/username/:name
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]
Get following
GET twitter.com/api/1/users/:id/following/
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]
Get followers
GET twitter.com/api/1/users/:id/followers/
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Response Body: [a list of users]
Follow a user
GET twitter.com/api/1/users/:source-user-id/following/
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
UnFollow a user
GET twitter.com/api/1/users/:source-user-id/following/:target-user-id
Request Header: [
Authorization : Bearer <JWT Token>
]
Response: 200 OK
Timeline API
PATCH twitter.com/api/1/users/:id/timeline?cursor={date}®ion={region}
Request Header: [
Authorization : Bearer <JWT Token>
]
Response : 200 OK
Response Body : [a list of 1000 tweets in reverse_chronological order]
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
- Tweets (In KV store and indexed in tweets index)
- tweetId [a twitter-snowflake id, sortable, allows range queries]
- tags: [a list of String #tags]
- userId
- title
- name
- content
- attachments [a lis of images/gifs/short video]
- createdOn
- lastUpdatedOn
- topic [an internal topic classification]
- classification [a list of classification tags]
- keywords: [a bag of keywords]
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Tweeting
- User posts a tweet
- The tweet is added to a KV store - the tweet id is sortable using Twitter-Snowflake id (contains sign bit, time part, machine id, mac address, sequence number)
- The KV store changes are captured via a Change Data Capture stream, the changes are written into a distributed broker. This ensures strong consistency in creating a tweet. The KV store is s distributed database with atleast 3 nodes having the tweet added.
- A stream processing job picks up the changes and runs a few analysis on the tweet. This is eventual consistency to bring up topic, trends, keywords etc.
- sentiment analysis
- Removes stop words, normalise/correct words, topic identification by TF-IDF, does sentiment analysis, identifies keywords by TF-IDF
- Tweet Index - the result of stream processing is posted onto tweet index for faster retrieval by topic, keyword, id, sentiments, trends
Finding Top K HashTags/Topics
Twitter stream processing finds top trending topics, keywords, top k tweets around the globe. Its bucketised to regional (country, city). The real-time stream processing engine computes this incrementally
- Named Entity Extraction - processor extracts named entities - names, places, things from tweet content body
- Sliding window for tracking top terms in the tweet
- Top K terms - maintains a list of top K terms found in the tweet plus historical tweets
- Geo & Language segmentation - segment tweet content to geo location/language used
Sliding Window
Track frequency in last n minutes, hours
Surge Detection
Compare current frequency vs historical baseline
Top K
Count Min-Sketch - keep track of most frequent hash tags with low memory footprint, avoid botnet activity or spam, demote such active hashtags or hashtags used in a private group.
Count Min-Sketch (CMS)
Instead of keeping all the hashtags or trending topics in to cache or database to compute the popularity of topics, we use a count min-sketch which gives and approx estimate of popular topics. This is computed for each hashtag and the max hash count is populated onto a min heap kept in cache. The CMS is efficient data structure that it takes very less memory,
if we have 2 hashtags "#ukraine" and #tariff" and depth = 5, we compute 5 hashes for these #ukraine and #tariff, and take the min value of these 5 hashes. The min value is compared in the Min heap, if the min value in min heap is smaller that count from CMS then replace the min heap. This takes less memory and efficient. Only space used is Min heap of K size
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Handle Hot Tweets
We use Twitter-Snowflake ID as tweetId, this is sortable by time and allows range queries. A tweetId consists of
timestamp(48 bits), datacenterId(4 bits), MachineId (8 bits) sequenceNo(4 bits)
The first 48 bits is the timestamp of the machine that received the tweet, the datacenter id in which the machine belongs, MAC address or machine id assigned, a running counter in the machine. This ensures the tweet id is unique and used for sharding. i.e. tweetId % N database nodes gives us the location where to store the data.
A users timeline is sharded across thousands of machine using userid i.e shard_id = hash(userId) % N shards
Hot hashtags
What happens when a hashtag goes viral, for instance a hashtag #worldcup goes viral, this introduces millions of tweets, retweets, likes etc. We use a separate inverse index for hashtag and attach the tweets to the hashtag.
- This index is sharded by hashtag not tweetId, for instance "hash(hashtag) % N"
- These hashtags are cached onto memory - Redis/Memcached/Dragonfly
- if a tweet is liked or shared, a fan out write is performed, the tweet is written onto a Kafka broker and asynchronously written onto followers timeline
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- Search Engine - Tweets are inserted into a distributed search engine from main database post processing, for faster lookup and multi-faceted search. The top trending, hash tags, etc populated in search index. This is an index similar to Elastic search
- KV store - a distributed key/value store for persisting tweets, the key is tweet_id
- CDC - a change data capture from KV store pushes the tweets onto Kafka broker
- Broker - a distributed kafka cluster for tweets to flow through, the Kafka becomes the source for the stream processing pipeline
- Stream processing pipeline - A flink streaming processor that pulls tweets from kafka and perform operations
- Top K - uses a CMS (count min sketch) in combination of Min/max heap of hashtags to identify top popular hashtags
- The min/max heap is stored in a cache
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- Hot partitions when a hashtag or tweet becomes viral. lot of people reads, tweets, retweets the original tweet. This creates a load onto the database, stream pipeline. Hashtags are sharded on its own without using tweetId, the hot hashtags are found using CMS (couint min-sketch) probablistic than with actual tweet count.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
- Spam detection, malicious content filtering
- Security
- Avoid spreading scare - fear mongering - demote such tweets