Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter Search with Score: 9/10

by alchemy1135

System requirements

Functional:

User Authentication:
Users should be able to create accounts and log in securely
Tweet Storage:
Store tweets with associated metadata, including user information, content, timestamp, and media.
Support the efficient storage and retrieval of a large volume of tweets.
Search Functionality:
Allow users to search for tweets based on hashtags, keywords, user mentions, and specific users.
Implement features such as autocomplete, spell check, and query expansion to enhance search capabilities.
Provide relevant, timely, and accurate search results.
Real-time Updates:
Support real-time updates to display new tweets as they are posted.
Notify users of new tweets based on their search queries or personalized content.
Personalized Content:
Enable users to view personalized content based on their interests, followers, and past engagement.
Implement algorithms to recommend relevant content to users.
Trending Topics:
Highlight trending topics and hashtags to keep users informed about popular discussions.
Display trending topics based on real-time data and user engagement.
Tweet Interaction:
Allow users to like, retweet, reply to, and share tweets.
Track and display engagement metrics such as likes, retweets, and replies.
Indexing System:
Implement an indexing system based on inverted indexes to efficiently store and retrieve tweet information. This will involve mapping terms in tweets to the documents where they appear, enabling fast and accurate search results.
Query Processing:
Develop a query processing system that leverages inverted indexes for optimized search queries. Include features such as the intersection of posting lists for multi-term queries and ranking based on relevance metrics derived from the indexes.
Search Optimization:
Utilize inverted indexes to enhance search performance by minimizing the number of documents scanned, speeding up query processing, and improving the accuracy of search results.

Non-Functional:

Scalability:
The system should be highly scalable to handle a growing number of users, tweets, and search queries.
Support horizontal scaling and distributed systems to accommodate increased traffic.
Performance:
Ensure low-latency search results and real-time updates.
Optimize tweet storage and retrieval processes for efficient performance.
Reliability:
Provide a reliable and robust system with minimal downtime.
Implement backup and recovery mechanisms to ensure data integrity.
Security:
Safeguard user data through secure authentication and authorization mechanisms.
Encrypt sensitive information during storage and transmission.
Regularly conduct security audits and implement measures to prevent data breaches.
Monitoring and Analytics:
Implement monitoring tools to track system metrics, user behavior, and search performance.
Utilize analytics to analyze user patterns and improve the search experience.
Scalability:
Ensure the system can handle varying loads and is easily scalable to meet user demands.
Implement caching mechanisms to improve response times and reduce server loads.

Capacity estimation

Assumptions

The total user base of 1 billion users.
Monthly active users 200 million.
Average number of tweets per user - 5
Each tweet has an average size of 500 bytes
10% of these tweets contain images or videos
Each media tweet will have a size of 500KB

With the above assumptions, we can say the below

Total number of tweets per day: 200M * 5 = 1 billion tweets/day

Tweets with media = 10 million tweets/day

Number of write requests per second: 1 billion / (24 * 3600) = 12k requests/sec

Storage Estimations

Text Tweets : 1 billion * 500 bytes = 500 GB / day => ~ 950 TB for 5 years

Media Tweets : 10 million * 500 KB = 5 TB / day => ~ 1 Petabyte for 5 years

Peak Traffic Loads:

During a big news break or celebrity interaction we would have huge traffic on the platform, let's try to calculate the Peak traffic in an hour.

Assumptions:

During peak hours, user activity may be 5 times higher than the average.
Each user might issue one search request every 2 minutes on average.

Calculation:

Peak User Activity = 5 * Monthly Active Users = 5 * 200 million = 1 billion users.
Search Requests per User per Hour = 60 minutes / 2 minutes = 30 requests per hour.
Total Search Requests during Peak Hour = Peak User Activity * Search Requests per User per Hour = 1 billion * 30 = 30 billion search requests per hour.

Concurrent Users= 30 billion searches / 10 seconds = 3 billion concurrent users

Search Requests Per Second:

Now, that we already know we have 200M active users, let's make the below assumptions.

Assumption:

Each user makes a total of 30 search requests per day.
let's assume each server can handle 100K requests per second.

Total Search Requests per Day = 200M * 30 = 6 billion

Number of Servers for Search Requests = 6 Billion / 100K * 1440 sec = ~416 servers

API design

1. User Authentication API:

Description: This API handles user authentication and authorization.
Input: User credentials (username, password), authentication method (OAuth tokens, MFA codes).
Output: Access tokens for authenticated users, allowing secure access to their accounts.

2. Tweet Creation API:

Description: Allows users to create and post tweets.
Input: User authentication token, tweet content, optional media attachments.
Output: Newly created tweet with metadata (tweet ID, timestamp, user information).

3. Tweet Retrieval API:

Description: Retrieves tweets based on specified search parameters.
Input: User authentication token, search parameters (hashtags, keywords, user mentions).
Output: List of relevant tweets with metadata (user, content, timestamp, engagement metrics).

4. Real-time Updates API:

Description: Sends real-time updates to users for new tweets matching their search criteria.
Input: User authentication token, last viewed timestamp.
Output: Real-time notifications of new tweets, allowing timely updates for users.

5. Personalized Content API:

Description: Generates personalized content recommendations for users.
Input: User authentication token, user preferences, past engagement data.
Output: Recommended tweets tailored to the user's interests and engagement history.

6. Trending Topics API:

Description: Retrieves trending topics and hashtags.
Input: User authentication token, location (optional).
Output: List of trending topics with associated metadata (hashtags, tweet counts).

7. Tweet Interaction API:

Description: Handles user interactions with tweets, including likes, retweets, replies, and sharing.
Input: User authentication token, tweet ID, interaction type.
Output: Updated tweet metrics (likes, retweets, replies) and notifications for relevant users.

Database design

For the database design we will create the following tables

User Table
Tweet Table
Hashtag Table
Mention Table
Like Table
Follower Table
Notification Table
Media Table

To see how these tables are linked together have a look at the class diagram.

Database Choices:

User Data (Relational Database):
Database Type: Relational Database (e.g., PostgreSQL, MySQL)
Reasoning: Relational databases prioritize Consistency and Partition Tolerance, making them suitable for ensuring data integrity and handling complex queries even in the face of network partitions.
CAP Theorem: Consistency-focused.
Tweet and Metadata (NoSQL Database):
Database Type: NoSQL Database (e.g., MongoDB)
Reasoning: NoSQL databases emphasize Availability and Partition Tolerance, providing flexibility for handling unstructured data and accommodating scalable, distributed systems.
CAP Theorem: Availability-focused.
Media Content (Object Storage):
Database Type: Object Storage (e.g., Amazon S3, Azure Blob Storage)
Reasoning: Object storage prioritizes Partition Tolerance and Availability, delivering a highly scalable and distributed solution for storing and serving binary media content.
CAP Theorem: Availability-focused.
Search Index and Metadata (Search Engine):
Database Type: Search Engine (e.g., Elasticsearch)
Reasoning: Search engines focus on Availability and Partition Tolerance, ensuring fast and efficient search functionality even in the presence of network partitions.
CAP Theorem: Availability-focused.

Each database type is chosen based on its strengths in handling specific types of data within the Twitter Search System, providing a balanced and efficient solution

Data Partitioning:

Which Partitioning Strategy should we apply here?
Best Strategy: Hash-based Partitioning.
Reasoning: Hash-based partitioning is well-suited for even distribution of data across nodes, crucial for a system like Twitter where tweets and user interactions can generate high write and read loads. This approach ensures a balanced distribution of data and efficient utilization of resources.
Regional or Geographical Partitioning: Considering the global nature of Twitter, regional or geographical partitioning may not be optimal, as it could lead to uneven data distribution and impact system performance. Hash-based partitioning provides a more uniform approach.

Sharding:

Which Sharding Strategy works the best in this scenario?
Best Strategy: Range Sharding or Composite Sharding.
Reasoning: As tables grow in size, range sharding, based on a specific range of values (e.g., Tweet creation timestamp), or composite sharding, considering a combination of factors like user and geographical location, can help distribute the load more evenly and ensure scalability without overwhelming individual shards.
Considerations: The choice between range and composite sharding depends on the specific characteristics of the data and the anticipated query patterns.

Replication:

Which Replication Strategy?
Best Strategy: Multi-Datacenter Replication.
Reasoning: Given the global user base of Twitter, multi-datacenter replication ensures high availability and fault tolerance. This approach involves replicating data across geographically distributed data centers, reducing latency and improving user experience.

Load Balancing:

Which Load-balancing approach works the best here?
Best Strategy: Global Load Balancing with Dynamic Routing.
Reasoning: Global load balancing ensures even distribution of incoming requests across servers in different regions. Dynamic routing allows for intelligent routing based on server health and load, optimizing response times and minimizing downtime.

In summary, employing hash-based partitioning, range or composite sharding, multi-datacenter replication, and global load balancing with dynamic routing will contribute to the scalability, availability, and performance of the Twitter Search System, aligning with the characteristics of the data and user interactions on the platform.

High-level design

User Interface (UI):
Responsibility: Provides a user-friendly interface for users to interact with the Twitter platform, including the search functionality.
Authentication Service:
Responsibility: Manages user authentication and authorization, ensuring secure access to user accounts and search functionalities.
Search Service:
Responsibility: Handles user search queries, interacting with various components to retrieve relevant tweets based on hashtags, keywords, user mentions, and specific users.
Indexing and Query Processing:
Responsibility: Utilizes technologies like Elasticsearch for efficient indexing and query processing, enabling fast and accurate retrieval of tweets based on search parameters.
Tweet Service:
Responsibility: Manages the storage and retrieval of tweet data, including information such as user details, tweet content, timestamp, and associated media.
Media Service:
Responsibility: Stores and serves media content associated with tweets, handling the upload, storage, and retrieval of images and videos.
User Service:
Responsibility: Manages user-related data, including user profiles, followers, and interactions. Supports functionalities like liking, retweeting, and following.
Trending Topics Service:
Responsibility: Identifies and updates trending topics and hashtags based on user interactions and tweet popularity.
Notification Service:
Responsibility: Handles the generation and delivery of notifications to users for activities such as likes, retweets, and mentions.
Load Balancer:
Responsibility: Distributes incoming requests across multiple servers to ensure even load distribution and prevent bottlenecks.
Caching Layer:
Responsibility: Implements caching mechanisms to store frequently accessed search results, improving response times and reducing load on the database.
Monitoring and Analytics:
Responsibility: Incorporates tools like Prometheus and Grafana to monitor system metrics, track user behavior, and analyze search patterns.
Database (Distributed):
Responsibility: Utilizes distributed databases (e.g., Cassandra, MongoDB) to store user data, tweet content, media, and other related information.
Content Delivery Network (CDN):
Responsibility: Enhances the performance by caching and delivering static content, reducing latency for media and other non-dynamic assets.
Global DNS Resolution:
Responsibility: Ensures global availability by utilizing a global DNS resolution system, directing users to the nearest server or data center.

Below is a simplified component diagram that shows what happens when user makes a search request.

Request flows

Here is a simple sequence diagram that shows search sequence flow.

Detailed component design

Let's now talk about the components that we will have in our Twitter search engine, we are considering only a few components, in reality, this would contain a lot of components, but for the scope of our problem for this interview, we will discuss the below components.

1. Ingestor:

Responsibility:
Processes incoming tweets, extracting relevant information, and performing text preprocessing tasks.
Tokenizes the tweet content, removing stop words to focus on essential keywords.
Applies stemming to reduce words to their root forms, enhancing search accuracy.
Passes the processed data to the Search Index component for storage and retrieval.
Tokenization Algorithms:
Whitespace Tokenization
N-gram Tokenization
Natural Language Toolkit (NLTK)

2. Search Index:

Responsibility:
Creates and maintains an index of tweets, metadata, and other relevant details for efficient search operations.
Stores the index in a dedicated Search Index Database for persistent storage.
Implements caching mechanisms to store frequently accessed search results, reducing response times.
Indexing Algorithms:
Inverted Indexing: Maps terms (tokens) to document IDs, facilitating quick retrieval of relevant documents.
Forward Indexing: Stores a list of terms and their occurrences in each document, aiding in scoring and ranking.
BM25 (Best Matching 25): A relevance scoring algorithm that considers term frequency and document length for ranking.
Caching Strategies:
LRU (Least Recently Used): Evicts the least recently used search results from the cache.
LFU (Least Frequently Used): Evicts the least frequently used search results from the cache.
Time-Based Expiry: Sets a time limit for cached results, ensuring freshness.

3. Query Processor:

Responsibility:
Receives user search queries and processes them to identify relevant terms.
Interacts with the Search Index to retrieve matching tweets based on query terms.
Utilizes ranking algorithms to determine the order of search results.
Query Processing Algorithms:
TF-IDF (Term Frequency-Inverse Document Frequency): Calculates the relevance of terms based on their frequency in a document and rarity across documents.
Boolean Retrieval: Matches documents based on exact matches of terms, suitable for certain queries.
Vector Space Model: Represents documents and queries as vectors for similarity comparison.

4. Ranking Algorithms:

Responsibility:
Determines the order of search results based on various factors like recency, relevance, and user engagement.
Enhances user experience by presenting the most pertinent tweets first.
Ranking Strategies:
Recency Weighting: Boosts the importance of recent tweets for timely and relevant results.
Popularity Scoring: Considers the number of likes, retweets, and replies to gauge a tweet's popularity.
User Engagement: Incorporates metrics like the user's past interactions and engagement with tweets.

5. Rate Limiting and Scalability:

Rate Limiting:
Implements token bucket or leaky bucket algorithms to control the rate of incoming search queries per user.
Prevents abuse and ensures fair access to the search functionality.
Scalability Considerations:
Horizontal Scaling: Distributes search functionality across multiple servers to handle increased traffic.
Load Balancing: Uses load balancers to evenly distribute search queries, optimizing resource utilization.

Future improvements

Several potential future improvements could enhance our design for the Twitter Search System:

Advanced Natural Language Processing (NLP):
Integrate advanced NLP techniques to improve tweet understanding, sentiment analysis, and contextual relevance in search results.
Enhanced Ranking Algorithms:
Develop more sophisticated ranking algorithms, possibly incorporating machine learning models to personalize search results based on individual user preferences and behaviors.
Improved Personalization:
Enhance the personalization features by considering a broader range of user interactions, including saved searches, past engagement patterns, and user preferences.
Query Suggestions and Autocomplete:
Implement intelligent query suggestions and autocomplete features to assist users in formulating queries and discovering relevant content more efficiently.