Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter with Score: 9/10

by serenade3523

System requirements

Functional:

Tweets:

Users can create short text posts (tweets).
Tweets can include media (images, videos).
Tweets have a character limit (e.g., 280 characters).

Following:

Users can follow other users.
A timeline displays tweets from followed users in reverse chronological order.

Favoriting (Liking):

Users can mark tweets as favorites.
A list of favorited tweets is accessible.

Notifications:

Users receive notifications for mentions, replies,

Non-Functional:

Read heavy - The read to write ratio for twitter is very high, so our system should be able to support that kind of pattern
Fast rendering
Fast tweet.
Lag is acceptable - From the previous two NFRs, we can understand that the system should be highly available and have very low latency. So when we say lag is ok, we mean it is ok to get notification about someone else’s tweet a few seconds later, but the rendering of the content should be almost instantaneous.
Scalable - 5k+ tweets come in every second on twitter on an average day. On peak times it can easily double up. These are just tweets, and as we have already discussed read to write ratio of twitter is very high i.e. there will be an even higher number of reads happening against these tweets. That is a huge amount of requests per second.

So how do we design a system that delivers all our functional requirements without compromising the performance? Before we discuss the overall architecture, let’s split our users into different categories. Each of these categories will be handled in a slightly different manner.

Famous Users: Famous users are usually celebrities, sportspeople, politicians, or business leaders who have a lot of followers
Active Users: These are the users who have accessed the system in the last couple of hours or days. For our discussion, we will consider people who have accessed twitter in the last three days as active users.
Live Users: These are a subset of active users who are using the system right now, similar to online on Facebook or WhatsApp..
Passive Users: These are the users who have active accounts but haven’t accessed the system in the last three days.
Inactive Users: These are the “deleted” accounts so to speak. We don’t really delete any accounts, it is more of a soft delete, but as far as the users are concerned the account doesn’t exist anymore.

Capacity estimation

Assume each request is 1000 Bytes, the request is 10K QPS, the capacity for each day is:

1000 Bytes * 10K * 60 * 60 * 24 ~ 860 G bytes per day

API design

User APIs:

POST /users/register: Create a new user account.
POST /users/login: Authenticate a user.
GET /users/me: Get the current user's profile.
PUT /users/me: Update the current user's profile.
GET /users/{userId}: Get a user's profile by ID.
POST /users/{userId}/follow: Follow a user.
POST /users/{userId}/unfollow: Unfollow a user.
GET /users/{userId}/followers: Get a user's followers.
GET /users/{userId}/following: Get the users a user is following.

Tweet APIs:

POST /tweets: Create a new tweet.
GET /tweets: Get a feed of tweets (timeline).
GET /tweets/{tweetId}: Get a specific tweet by ID.
DELETE /tweets/{tweetId}: Delete a tweet.
POST /tweets/{tweetId}/like: Like (favorite) a tweet.
POST /tweets/{tweetId}/unlike: Unlike a tweet.
GET /tweets/search: Search for tweets based on keywords, hashtags, or users.

Timeline APIs:

GET /me/home_timeline: Get the home timeline (tweets from followed users).
GET /users/{userId}/timeline: Get a user's tweets.
GET /me/mentions: Get tweets that mention the current user.

Additional APIs (Optional):

POST /tweets/{tweetId}/retweet: Retweet a tweet.
GET /trends: Get trending topics.
POST /direct_messages: Send a direct message to a user.
GET /direct_messages: Get direct messages for the current user.

Authentication and Authorization:

All APIs (except register and login) should require authentication using a token-based mechanism (e.g., JWT).
Some APIs may require additional authorization checks (e.g., deleting a tweet should only be allowed by the author).

Database design

Entities:

User:
UserID (Primary Key)
Username (Unique)
Email (Unique)
Password (Hashed)
ProfilePicture
Bio
CreatedAt
UpdatedAt
Tweet:
TweetID (Primary Key)
UserID (Foreign Key referencing User)
Text
Media URLs (Array of strings)
CreatedAt
Follow: (Represents the relationship between users)
FollowerID (Foreign Key referencing User)
FolloweeID (Foreign Key referencing User)
CreatedAt
(Composite Primary Key: FollowerID, FolloweeID)
Like: (Represents the relationship between users and tweets)
UserID (Foreign Key referencing User)
TweetID (Foreign Key referencing Tweet)
CreatedAt
(Composite Primary Key: UserID, TweetID)

Relationships:

User-Tweet: One-to-many relationship. A user can create many tweets, but a tweet belongs to only one user.
User-Follow: Many-to-many relationship. A user can follow many users, and a user can be followed by many users.
User-Like: Many-to-many relationship. A user can like many tweets, and a tweet can be liked by many users.

Considerations:

Indexes:
Create indexes on UserID in the Tweet table and on FollowerID and FolloweeID in the Follow table to optimize timeline generation queries.
Create indexes on frequently queried fields like Username for faster user lookups.
Denormalization:
Consider storing the author's username in the Tweet table to avoid joining the User table for every tweet in the timeline.
Partitioning (Optional):
If the service becomes very large, consider partitioning the Tweet table by date or some other criteria to improve query performance.

Implementation:

You can choose a relational database (e.g., PostgreSQL, MySQL) or a NoSQL database (e.g., MongoDB, Cassandra) based on your scalability and data model flexibility requirements.
Consider using an ORM (Object-Relational Mapper) like SQLAlchemy (Python), Hibernate (Java), or GORM (Go) to simplify database interactions.

Additional Tips:

Use meaningful names for tables, columns, and relationships.
Consider future features and their impact on the database schema.
Regularly optimize and maintain the database to ensure performance and data integrity.

High-level design

Components:

Client Applications:

Mobile Apps (iOS, Android): Native apps for mobile users.
Web Application: Web interface for desktop and mobile browsers.

Load Balancer:

Distributes incoming traffic across multiple API servers for scalability and high availability.

API Gateway (Optional):

Provides a single entry point for clients, handling routing, authentication, rate limiting, and other cross-cutting concerns.

API Servers:

Handles API requests from clients.
Implements the core business logic (e.g., tweet creation, timeline generation, user management).
Interacts with the database and other backend services.

Database:

Stores user data, tweet data, follow relationships, likes, and other application data.
Can be a relational database (e.g., PostgreSQL, MySQL) or a NoSQL database (e.g., MongoDB, Cassandra) based on your requirements.

Cache (Optional):

Stores frequently accessed data (e.g., timelines, user profiles) in memory to improve performance and reduce database load.
Can use a technology like Redis or Memcached.

Search Index (Optional):

Enables full-text search of tweets and user profiles.
Can use a technology like Elasticsearch or Solr.

Background Job Workers:

Performs asynchronous tasks like:
Processing media uploads.
Sending notifications.
Generating analytics.

Notification Service:

Sends real-time notifications to users (e.g., mentions, replies, likes).

Data Flow:

Client sends a request to the Load Balancer.
Load Balancer forwards the request to an available API Server.
API Server processes the request:

Fetches data from the database and/or cache.
Performs business logic.
May enqueue tasks for background workers.

API Server sends a response back to the client.
Background workers perform tasks asynchronously.
Notification Service sends real-time notifications to users.

Scalability Considerations:

Horizontal Scaling: Add more API servers and database replicas as traffic increases.
Caching: Cache frequently accessed data to reduce database load.
Asynchronous Processing: Offload heavy tasks to background workers.
Database Sharding (Optional): Distribute data across multiple database servers for very large datasets.
Content Delivery Network (CDN): Use a CDN to distribute media files to users from servers closer to them.

Additional Considerations:

Security: Implement measures to protect user data, prevent unauthorized access, and mitigate common attacks.
Monitoring: Monitor system health, performance metrics, and errors to proactively identify and address issues.
Analytics: Collect data on user behavior and system performance to gain insights and improve the service.

Request flows

Sequence Diagram for Posting a Tweet:

Request Flow Description:

Client Initiates Request:

The user composes a tweet in the client application (mobile app or web app).
The client sends a POST /tweets request to the Load Balancer, including the tweet text and any media files.

Load Balancer Distributes Request:

The Load Balancer selects an available API server and forwards the request.

API Server Processes Request:

Authentication: The API server verifies the user's authentication token.
Validation: The server validates the tweet data (e.g., checks character limits, ensures media files are valid).
Database Interaction:
The server inserts the tweet data (text, media URLs, user ID, timestamp) into the database.
It updates the user's profile (e.g., increments tweet count).
Cache Update (Optional): If a cache is used, the server updates relevant cached timelines (e.g., the user's timeline, the home timelines of the user's followers).

Background Job Triggered:

The API server may enqueue a background job to process the media files (e.g., resizing images, transcoding videos).

Response Sent to Client:

The API server sends a success response (HTTP 201 Created) to the client, including the ID of the newly created tweet.

Notifications Sent (Optional):

If the tweet mentions other users, the API server (or a background worker) sends notifications to those users.

Background Job Processing:

The background job worker processes the media files and updates the tweet's media URLs in the database.

Additional Request Flows:

Similar request flows exist for other actions, such as:

Fetching the Home Timeline:
The client sends a GET /me/home_timeline request.
The API server fetches tweets from the database for the user's followed users, considering the timeline generation algorithm (e.g., recency, relevance).
The server may use the cache to optimize timeline retrieval.
The server sends the tweets in the response.
Liking/Unliking a Tweet:
The client sends a POST /tweets/{tweetId}/like or POST /tweets/{tweetId}/unlike request.
The API server updates the database accordingly and potentially invalidates cached timelines.

Detailed component design

1. Timeline Generation Service

The Timeline Generation Service is responsible for creating personalized timelines for each user, showing tweets from the accounts they follow in a relevant order.

Scaling:

Caching: To scale efficiently, aggressive caching is essential.
User Timelines: Each user's timeline is cached. Updates to the cache are triggered when a user posts a tweet, when someone they follow posts a tweet, or when they follow/unfollow someone.
Pre-calculated Timelines: To optimize the initial timeline loading, the service can pre-calculate timelines for active users during off-peak hours and store them in a cache.
Fan-out on Write: When a user posts a tweet, the tweet is immediately written to the timelines of all their followers. This approach ensures that timelines are up-to-date but can become a bottleneck for users with a massive number of followers.
Mitigation: For users with a huge following, you can use a hybrid approach. Write the tweet to a subset of followers' timelines immediately and use background jobs to update the rest.
Sharding: The underlying database can be sharded based on user IDs to distribute the read/write load.

Algorithm and Data Structures:

Algorithm: The timeline generation algorithm should consider:
Recency: Newer tweets should be ranked higher.
Relevance: Tweets from accounts the user interacts with more frequently should be ranked higher.
Popularity: Tweets with more likes, retweets, or replies can be considered more relevant.
Algorithm Options:
Simple reverse chronological ordering.
Weighted ranking based on the above factors.
Machine learning models to personalize the timeline further.
Data Structures:
Sorted Sets (e.g., Redis Sorted Sets): Store tweets with scores based on the ranking algorithm to efficiently fetch the top N tweets for a timeline.
Inverted Indexes: If you want to support real-time search within timelines, inverted indexes can help quickly find tweets matching specific keywords.

2. Search Service

The Search Service allows users to search for tweets, users, and hashtags.

Scaling:

Distributed Search Index: Use a distributed search engine like Elasticsearch or Solr to handle large volumes of tweets and provide fast search results.
Sharding: Shard the index based on tweet IDs or other criteria to distribute the search load.
Caching: Cache popular search results to reduce the load on the search index.
Rate Limiting: Limit the number of search requests per user to prevent abuse.

Algorithm and Data Structures:

Algorithms:
Full-Text Search: Implement full-text search using techniques like tokenization, stemming, and stop-word removal.
Ranking: Rank search results based on relevance, recency, popularity, and user preferences (e.g., accounts the user follows).
Auto-Completion: Provide auto-completion suggestions for search queries to improve the user experience.
Data Structures:
Inverted Index: The core data structure used for full-text search. It maps words (terms) to the documents (tweets) containing them.
Trie: Used for efficient auto-completion of search queries.
Bloom Filters (Optional): Can be used to quickly check if a term exists in the index, potentially avoiding unnecessary disk lookups.

Additional Considerations:

Real-time Updates: Consider how to keep the search index updated in near real-time as new tweets are created.
Relevance Tuning: Experiment with different ranking algorithms and weighting factors to improve search relevance.
Analytics: Track search metrics (e.g., popular queries, click-through rates) to gain insights and improve the search experience.

Trade offs/Tech choices

Database Choice:

Relational vs. NoSQL:
Trade-off: Relational databases (like PostgreSQL, MySQL) offer strong consistency and transactional guarantees, which are crucial for maintaining data integrity in social networking applications. However, they might not scale as easily as NoSQL databases (like MongoDB, Cassandra) for massive amounts of data and high write throughput.
Choice: We could initially choose a relational database for its ease of use and strong guarantees. As the service grows, we might consider migrating to a NoSQL database or using a hybrid approach (relational for structured data, NoSQL for unstructured data) to optimize performance.
Sharding vs. Replication:
Trade-off: Sharding distributes data across multiple servers, improving scalability but adding complexity. Replication creates copies of data, improving read performance and availability but potentially increasing write latency.
Choice: We would likely start with replication for better read performance and availability. As the dataset grows, we would introduce sharding to handle the increasing write load.

2. Caching:

In-Memory vs. Distributed Cache:
Trade-off: In-memory caches (like Redis) offer fast access but are limited by the memory on a single server. Distributed caches (like Memcached) can scale across multiple servers but have a higher latency compared to in-memory caches.
Choice: We could start with an in-memory cache like Redis for its simplicity and speed. If memory becomes a bottleneck, we would migrate to a distributed cache.
Cache Invalidation Strategy:
Trade-off: Write-through caches are simpler but can have higher write latency. Write-back caches offer better write performance but can introduce data inconsistency if the cache fails.
Choice: A write-through cache would be a good starting point for its simplicity. We could explore write-back caches later if write performance becomes a major concern.

3. Search:

Real-time vs. Batch Updates:
Trade-off: Real-time updates provide the most up-to-date search results but require more frequent updates to the index, potentially impacting write performance. Batch updates are less frequent and easier to manage but can lead to slightly stale results.
Choice: We could start with batch updates for their simplicity and then gradually move towards near-real-time updates as the service grows.
In-house vs. Managed Service:
Trade-off: Building an in-house search solution (e.g., using Elasticsearch) offers more customization and control but requires more development and maintenance effort. Managed services (e.g., AWS Elasticsearch Service, Algolia) are easier to set up and manage but might be less flexible.
Choice: A managed service would be a good starting point to get up and running quickly. If we need more customization or control, we could consider switching to an in-house solution later.

4. Timeline Generation Algorithm:

Reverse Chronological vs. Ranked:
Trade-off: Reverse chronological ordering is simple and easy to understand but might not show the most relevant tweets to users. Ranked timelines (based on relevance, recency, popularity, etc.) can improve user engagement but are more complex to implement and maintain.
Choice: We could start with a reverse chronological timeline for its simplicity. As the service matures and we collect more data on user behavior, we could introduce a ranked timeline algorithm to personalize the experience.

Other Trade-offs:

Consistency vs. Availability: We might need to choose between strong consistency (all users see the same data at the same time) and high availability (the system remains available even if some nodes fail).
Features vs. Time-to-Market: We might need to prioritize certain features over others to launch the service sooner and get early feedback from users.
Cost vs. Performance: We might need to balance the cost of infrastructure and services with the desired performance and scalability.

I hope this explanation provides a deeper understanding of the trade-offs and tech choices involved in designing a Twitter-like service.

Failure scenarios/bottlenecks

1. Database:

Overloaded Database:
High traffic volumes, especially during peak hours or trending events, can overwhelm the database, leading to slow query responses, timeouts, and even crashes.
Mitigation:
Database sharding: Distribute data across multiple servers to scale horizontally.
Caching: Cache frequently accessed data to reduce database load.
Query optimization: Optimize database queries for better performance.
Data Corruption:
Software bugs, hardware failures, or human error can lead to data corruption in the database, affecting data integrity and availability.
Mitigation:
Regular backups: Create regular backups of the database to enable recovery in case of corruption.
Data validation and integrity checks: Implement mechanisms to validate data input and detect inconsistencies.
Single Point of Failure:
If the database server fails, the entire service becomes unavailable.
Mitigation:
Replication: Create multiple replicas of the database to ensure high availability.
Failover mechanisms: Implement automatic failover to a replica in case the primary database fails.

2. API Servers:

High Latency/Timeouts:
Overloaded API servers, network congestion, or inefficient code can result in slow response times or timeouts for API requests.
Mitigation:
Horizontal scaling: Add more API servers to handle increased traffic.
Load balancing: Distribute traffic evenly across API servers.
Code optimization: Optimize API server code for better performance.
Caching: Cache API responses where appropriate to reduce server load.
Server Crashes:
Software bugs, memory leaks, or other issues can cause API servers to crash, making the service unavailable for users connected to those servers.
Mitigation:
Monitoring and alerting: Monitor server health and set up alerts to detect potential issues before they cause crashes.
Load balancing: Distribute traffic so that a single server failure doesn't take down the entire service.
Auto-scaling: Automatically add more servers to handle increased traffic and replace failed servers.

3. Caching Layer:

Cache Misses:
If the requested data is not found in the cache, it needs to be fetched from the database, increasing latency.
Mitigation:
Cache warming: Preload frequently accessed data into the cache.
Optimize cache eviction policies: Choose the right cache eviction strategy (e.g., LRU, LFU) based on your access patterns.
Cache Stampedes:
When a cached item expires, multiple requests might try to fetch it from the database simultaneously, overloading the database.
Mitigation:
Cache locking: Use a locking mechanism to prevent multiple requests from updating the cache simultaneously.
Staggered cache expiration: Expire cached items at slightly different times to avoid simultaneous requests.

4. Network:

Network Congestion:
High network traffic or network infrastructure issues can lead to slow responses, timeouts, or even service unavailability.
Mitigation:
Network monitoring and optimization: Monitor network performance and optimize network infrastructure for better throughput.
Content Delivery Network (CDN): Use a CDN to distribute media files to users from servers closer to them.

5. Security:

Data Breaches:
Unauthorized access to user data can lead to data leaks, identity theft, and other security risks.
Mitigation:
Security best practices: Follow security best practices for authentication, authorization, input validation, and encryption.
Regular security audits: Conduct regular security audits to identify and address vulnerabilities.
Denial of Service (DoS) Attacks:
DoS attacks can overload servers and make the service unavailable for legitimate users.
Mitigation:
DDoS protection: Implement DDoS mitigation techniques, such as rate limiting, traffic filtering, and scrubbing.

Future improvements

1. Enhanced Personalization:

AI-Powered Recommendations: Utilize machine learning to suggest tweets, accounts, and topics based on individual user interests, engagement history, and social connections.
Adaptive Timeline Algorithm: Continuously refine the timeline algorithm to prioritize content that users find most valuable, taking into account their feedback (e.g., likes, retweets, replies, dwell time).
Customizable Feeds: Allow users to create multiple feeds focused on specific topics or interests, tailoring their experience even further.

2. Richer Content Experiences:

Longer Tweets: Expand the character limit or introduce a separate "long-form" tweet format for more in-depth discussions and storytelling.
Multimedia Integration: Enable seamless integration with other media platforms (e.g., YouTube, Spotify) to embed videos, music, and other rich content within tweets.
Collaborative Content Creation: Explore features that allow users to co-create tweets, threads, or even entire stories together.

3. Community Building and Engagement:

Groups/Communities: Facilitate the creation of dedicated spaces for users with shared interests, allowing for focused discussions and interactions.
Events and Live Spaces: Enable the hosting of virtual events, Q&As, and live audio/video spaces to foster real-time engagement and community building.
Gamification and Rewards: Introduce gamification elements (e.g., badges, points) and rewards for active participation and contribution to the community.

4. Monetization and Creator Support:

Tipping and Donations: Allow users to tip or donate to their favorite creators directly on the platform.
Subscriptions and Exclusive Content: Enable creators to offer premium subscriptions for exclusive content and experiences.
Ads and Sponsored Content: Introduce targeted advertising and sponsored content that aligns with user interests and respects their privacy.

5. Accessibility and Inclusivity:

Improved Accessibility Features: Implement comprehensive accessibility features for users with disabilities, such as screen reader compatibility, text-to-speech, and high contrast modes.
Multilingual Support: Expand language support to reach a wider global audience and foster cross-cultural communication.

6. Cutting-Edge Technologies:

Decentralization: Explore decentralized technologies (e.g., blockchain) to give users more control over their data and potentially reduce reliance on a central platform.
Web3 Integration: Integrate with Web3 technologies (e.g., NFTs) to enable new forms of digital ownership and interaction.
Augmented Reality (AR) and Virtual Reality (VR): Experiment with AR and VR experiences to create immersive social interactions and new forms of content consumption.

Additional Considerations:

Privacy and Data Protection: Prioritize user privacy by implementing robust data protection measures, transparent data policies, and user-friendly privacy controls.
Content Moderation: Develop effective content moderation strategies to address issues like misinformation, hate speech, and harassment, while respecting freedom of expression.
Ethical AI: Ensure that AI-powered features are developed and used ethically, with transparency and accountability.

By continuously innovating and adapting to the evolving needs of users, a Twitter-like service can continue to thrive as a vibrant platform for communication, community building, and content sharing.