Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter with Score: 7/10

by blaze1807

System requirements

Functional:

Users must be able to create post
User must be able to see the post of people/User their are connected with
User must be able to create and manage connections with other users [Follower/Followee]
Users must be able to Refresh their feed.
User's must be able to search for post/Other Users, etc..
User's must be able to Like and Comment on other people post
User's must receive notification about their post activities,etc..

Non-Functional:

Availability

Users must be able to access their accounts and the platform from anywhere in the world.

Strategies:

Geographically Distributed Data Centers: Deploy multiple data centers across different geographical regions to ensure redundancy and availability close to users.
Load Balancing: Implement global load balancers (like AWS Route 53) to route user requests to the nearest data center based on latency and availability.
Health Checks and Failover: Set up health checks to route traffic away from unhealthy instances and enable automatic failover mechanisms.

Metrics:

Uptime Percentage: Aim for 99.99% uptime.
Latency Measurements: Monitor response times across different regions to ensure they remain under 1 second.

Scalability

Requirement: The system must be able to handle requests at high scale with potential hundreds of millions of requests daily.

Strategies:

Horizontal Scaling: Use microservices architecture to allow different services (e.g., Tweet, User, Notification) to scale independently. Use container orchestration (like Kubernetes) for dynamic scaling.
Load Testing: Perform regular load testing with tools like Apache JMeter to understand how the system behaves under peak loads and adjust scaling policies accordingly.

Metrics:

Requests per Second (RPS): Continuously monitor the number of requests handled per second, aiming for scalability to handle peak loads during events.
CPU and Memory Utilization: Keep an eye on resource usage metrics and maintain thresholds for scaling to ensure responsiveness during heavy traffic.

Security

Requirement: The system should enforce access control and encrypt request transmission.

Strategies:

OAuth 2.0 for Authentication: Implement OAuth 2.0 or OpenID Connect for secure user authentication and authorization mechanisms.
Transport Layer Security (TLS): Use TLS for all data transmissions to encrypt data between the client and the server.
Role-Based Access Control (RBAC): Ensure that user roles restrict access to sensitive actions and data.

Metrics:

Successful Authentication Rate: Track the percentage of successful logins to identify potential issues with access.
Incident Response Time: Measure how quickly security incidents are identified and responded to.

Consistency

Requirement: Users must see the most recent data upon request.

Strategies:

Eventual Consistency Model: Design the system to handle eventual consistency where real-time updates are not critical. This is particularly relevant for non-immediate reads like feeds.
Optimistic Concurrency Control: Implement mechanisms to handle concurrent modifications gracefully, particularly in write-heavy scenarios.

Metrics:

Staleness Time: Measure how long it takes for data to propagate through the system after a write occurs.
User Feedback on Data Freshness: Periodically survey users to assess their perception of data freshness.

Speed/Performance

Requirement: Users must see their feed and search results within around 500ms.

Strategies:

Caching Strategies: Utilize in-memory caching (like Redis) for frequently accessed data (e.g., user feeds) to reduce database hits.
Indexing in Database: Implement indexing on frequently queried fields to speed up data retrieval operations.

Metrics:

Response Time Tracking: Continuously monitor the average response time for critical API endpoints, aiming for responses to be under 500ms.
Throughput Ratio: Measure the ratio of read-to-write operations (targeting 50:1) to ensure the system remains optimized for read-heavy workloads.

Compliance with Regulations

Requirement: The application should comply with applicable regulations (e.g., GDPR).

Strategies:

Data Residency and Protection: Implement features for data localization, ensuring user data remains within specified regions as per legal requirements.
Regular Audits and Legal Reviews: Conduct regular compliance audits to ensure adherence to legal requirements, including user data access and privacy controls.

Metrics:

Audit Frequency: Schedule and document audits on a quarterly basis, or as dictated by regulation.
Compliance Issue Tracking: Maintain a record of compliance issues and response plans with timelines for remediation.

Capacity estimation

Assuming that 30% of the Post Include some form of media which in average 1MB per files.

Average Read Request Payload Size is 500 Bytes

Average Write Request Payload Size is 1.2KB

Request per Second - 10000 requests

#Requets/Hr - 3.6 millions requests

#Requets/day - 86.4 millions requests

#Requets/Month - 2,592 Billions requests

#Requets/Year - around 32-35 millions requests, if we assume a 5-8% increase in Application usage for the year

Numbers of write based on this estimation with a 1 to 50 ratio is 640 Millions Write/Request where 180 millions would be write request with Media Attachment and 420 Millions would be Simple write request

API design

User Management APIs

Create User
- Endpoint: POST /api/users
- Request Body:
- { "nickname": "user123", "email": "[email protected]", "dob": "1990-01-01", "first_name": "John", "last_name": "Doe", "gender": "male"}
- Response:
  - 201 Created with user information.
Get User Profile
- Endpoint: GET /api/users/{user_id}
- Response:
- { "user_id": "123", "nickname": "user123", "email": "[email protected]", "dob": "1990-01-01", "first_name": "John", "last_name": "Doe", "gender": "male"}
Update User Profile
- Endpoint: PUT /api/users/{user_id}
- Request Body:
- { "nickname": "new_nickname", "email": "[email protected]", "dob": "1991-01-01", "first_name": "Jane", "last_name": "Doe", "gender": "female"}

2. Tweet Management APIs

Create Tweet
- Endpoint: POST /api/tweets
- Request Body:
- { "user_id": "123", "tweet_url": "http://example.com/media.jpg", "hashtags": ["#fun", "#twitter"], "tagged_users_id": ["456", "789"]}
- Response:
  - 201 Created with tweet information including tweet_id.
Get Tweet
- Endpoint: GET /api/tweets/{tweet_id}
- Response:
- { "tweet_id": "101", "user_id": "123", "tweet_url": "http://example.com/media.jpg", "number_of_likes": 10, "hashtags": ["#fun", "#twitter"], "tagged_users_id": ["456"]}
Search Tweets
- Endpoint: GET /api/tweets/search
- Query Parameters: query (string for text search), user_id (optional)
- Response:
- [ { "tweet_id": "101", "user_id": "123", "content": "Learning API design!", "timestamp": "2023-01-01T12:00:00Z" }]

3. Favorites Management APIs

Like a Tweet
- Endpoint: POST /api/favorites
- Request Body:
- { "user_id": "123", "liked_tweet_id": "101"}
- Response:
  - 201 Created with confirmation of liked tweet.
Get Favorites
- Endpoint: GET /api/users/{user_id}/favorites
- Response:
- [ { "liked_tweet_id": "101", "timestamp": "2023-01-01T12:00:00Z" }]

4. Follow Management APIs

Follow a User
- Endpoint: POST /api/follows
- Request Body:
- { "follower_id": "123", "followee_id": "456"}
- Response:
  - 201 Created indicating successful follow.
Get Followers
- Endpoint: GET /api/users/{user_id}/followers
- Response:
- [ { "follower_id": "789", "timestamp": "2023-01-01T12:00:00Z" }]
Get Following Users
- Endpoint: GET /api/users/{user_id}/following
- Response:
- [ { "followee_id": "456" }]

5. Hashtag APIs

Get Hashtags
- Endpoint: GET /api/hashtags
- Response:
- [ { "hashtag_id": "1", "hashtag_content": "#fun" }]

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

In this instance, let's focus on the core functionalities. More specifically, the Tweet Management, Likes/Favorites Management, and Follower/Followee Management

Tweet Management:

This component handle the creation, retrieval, and management of tweets. It allows user to submit, retrieve tweets and/or search for tweets based on criteria or attributes

Scalability and Optimization:

Horizontal Scaling would be the preferred scaling method, through partitioning we can enhance retrieval speed and minimize latency. Moreover we can store frequently accessed tweets using services like Redis.

Storage Structure:

Database : We'll use a NoSQL Database to enhance storage and retrieval accordingly with OLTP systems

In Memory Cache: Use hash table for quick data access using Tweet ID as keys to retrieve tweet informations

Using Indexing, we can use ElasticSearcj to optimize retrieval based on Keywords or Hashtags

Likes/Favorites Management

This component manages the functionality for users to like a tweet, it stores the relationship between the users and the tweet they have liked.

Scalability:

This feature is write heavy, hence we can use append-only logs in combination with btach writes to improove write perfomance while lowering transaction costs. Additionally, this component can be partitioned based on userID's and TweetID's to optimize write and ready transactions

Data Structures:

This specific use case is perfectly suited for a relational database, as it maps a many-to-many relationship between both entities consisting of User_id and liked_Tweet_id

In Memory Structure: Considering the nature of the relationship we could map it as either a set or tuple for quick retrieval

Batch Processing who be required in this instance to minimize cost, hence the best option would be to store recent write operation in a virtual queue to later process them as batch on periodical intervals

Follower/Followee Management:

This component maintains a relationship table that tracks which user follows which other users.

Scalability:

Directed Graph Representation: Model the following relationships as a directed graph where nodes are users and edges represent following relationships. This allows easy navigability.
Batch Operations: Handle bulk follow/unfollow requests efficiently during a user's actions.

Data Structures:

Adjacency List/Matrix: Use an adjacency list or matrix to represent user relationships, allowing efficient querying of followers and followees.
In-memory Cache: Maintain an in-memory representation of follow relationships for fast access.

Algorithms:

Graph Traversal Algorithms: Use Depth-First Search (DFS) or Breadth-First Search (BFS) to explore followers for user recommendations and related functionalities.

Trade offs/Tech choices

Choice of Database

NoSQL, while a SQL database would provide the structure required to make complex queries necessary for our analytical service. However the number of read request in our system would make the management of cost and complexity quite challenging. On the other hand, NoSQL databases are more suited to Online Transactional systems, better suited for horizontal scaling.

Caching Strategy

Using an in-memory cache can significantly boost performance and reduce latency, maintaning cache consistency however oppose some challenges especially when it comes to maintaning data integrity and consistency. In this case by using Redis our abiliy to horizontally scale the In memory cache storage increase significantly

Batch Processing For Write Transactions

Real-time updates allow immediate reflection of likes in the UI, but can lead to increased write loads on the database. Batch processing can delay updates but decreases write frequency.

By leveraging NoSQL databases, in-memory caching, and graph-based representations for dynamic user relationships, we focus on creating a responsive and reliable platform. However, each decision also comes with potential risks, whether that involves managing cache coherence or ensuring that the database schema remains adaptable.

In a system like this, the trade-offs effect usability for the end-users while also allowing for future growth and feature expansions. Continuous monitoring and iteration based on usage patterns would also be crucial to adjust as user engagement evolves.

Failure scenarios/bottlenecks

Server Failures

Scenario: One or more instances of application servers become unavailable due to hardware failure or crash.
Impact: Users experience downtime or delayed requests.
Mitigation: Implement load balancing with health checks to reroute traffic to healthy servers. Use redundant infrastructure with auto-scaling to shift the load.

Database Failures

Scenario: A primary database becomes unresponsive or corrupt.
Impact: Write operations fail, leading to the inability to create tweets or update user interactions.
Mitigation: Deploy active-active or active-passive database replicas with automatic failover mechanisms. Regular backups ensure data recovery.

Cache Stampede

Scenario: Multiple requests for the same data lead to cache misses, overwhelming the backend.
Impact: Causes increased load on the database and can lead to timeouts.
Mitigation: Utilize request coalescing strategies, where only one request for a given data point is allowed to fill the cache, while others wait. Introduce cache jittering to smoothen access patterns.

DDoS Attacks

Scenario: A denial-of-service attack overwhelms the system with traffic, causing downtime.
Impact: Legitimate users can experience significant service degradation or failure to connect entirely.
Mitigation: Use rate limiting, IP blacklisting, and traffic filtering to thwart DDoS attacks. Deploy Web Application Firewalls (WAF) to detect and block malicious traffic.

Network Latency

Scenario: High network latency disrupts communication between clients and servers.
Impact: Users face long wait times for actions like fetching their home feed or liking tweets.
Mitigation: Use content delivery networks (CDNs) for static content, implement caching strategies, and optimize API responses to minimize payload size.

Future improvements

Advertisement and Business Profiles:

Implement a feature to enable business profiles to advertisement their marchandise events, or products based on a targeted audiences and selected trends

Third-Party API Integration

Integrate with third-party services for additional functionalities (e.g., payment systems for premium features) or external content-sharing options (e.g., Reddit or Instagram).

Advanced Analytics and ML Recommendation:

Implement features that leverage Machine Learning to improve personalized content based on user's activities, this feature should also have an impact on ads caampaigns.

Enhanced Security Protocols:

Improvement: Regularly update security measures, including multi-factor authentication (MFA) and end-to-end encryption for sensitive user data.

Mitigation Strategies For Failure Scenarios

Server Failures

Employ multi-region deployments for critical services. Use automated failover systems to switch to backup servers seamlessly if primary ones fail. Leverage container orchestration platforms (like Kubernetes) for efficient resource distribution and quick recovery.

Security

Use rate limiting along with security event monitoring systems. Invest in continuous integration/continuous deployment (CI/CD) pipelines that automate testing, reduce the likelihood of bugs reaching production, and ensure quicker deployment of fixes.

Database Failures

Use a multi-database strategy (operational and analytical databases) to segregate workloads and avoid resource contention.

Traffic Spikes

Introduce auto-scaling groups that can rapidly adjust resources during peak times.