Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter with Score: 8/10

by quantum_vortex687

System requirements

Functional:

Users can post a tweet (up to a certain character limit, typically 140 characters).
Users can follow other users.
Users can view tweets from the accounts they follow in their feed.
Users can like tweets.

Non-Functional:

Let's summarise our Non-Functional Requirements first before moving to technologies and work out our capacity estimation.

Non-Functional Requirements

Scalability: The system should handle up to 500 million Daily Active Users, with the ability to scale efficiently like Twitter.
Availability: Maximum availability is needed, with strategies such as:
- Redundant systems to ensure continuous service.
- Use of Content Delivery Networks (CDNs) for static content.
- Geo-distributed data centers to minimize outages.
Latency: Aim for a response time of under 500 milliseconds for displaying tweets.
Security: Implement measures for user authentication and authorization to ensure that only legitimate users can access the system features.

Capacity estimation

Daily Tweets:
- Daily Tweet Volume: 500 million DAU * 2 tweets/user = 1 billion tweets per day.
Views of Tweets:
- Daily Views: 500 million DAU * 100 views/user = 50 billion views per day.
Storage Requirements:
- Daily Storage: 1 billion tweets * 512 bytes = 512 GB of data per day.
Peak Active Users:
- Peak Load: 500 million DAU * 20% = 100 million active users during peak times.

We can see our system is READ Heavy. Which will impact our system.

API design

1. RegistrationAPI

POST /register → Create a new user entry in the database.
Statelessness is preserved: The client sends the registration payload; the server responds independently without remembering previous client interactions.

2. LoginAPI

POST /login → User sends credentials, server authenticates, and returns a JWT token.
Using JWT is a smart move because the token is self-contained (no server-side session needed) — perfect for scaling horizontally and working with load balancers.

3. TweetAPI (CRUD for tweets)

POST /tweets → Create a tweet.
GET /tweets/{tweet_id} → Retrieve a specific tweet.
GET /timeline → Retrieve timeline (could include batch fetching tweets user follows).
PUT /tweets/{tweet_id} → Update/edit a tweet.
DELETE /tweets/{tweet_id} → Delete a tweet.

You’re right that retrieving multiple tweets will need to be optimized, likely with pagination (e.g., ?limit=20&offset=40).

4. TwitterInteractionAPI

POST /tweets/{tweet_id}/like → Like a tweet.
DELETE /tweets/{tweet_id}/like → Unlike a tweet.
POST /users/{user_id}/follow → Follow a user.
DELETE /users/{user_id}/follow → Unfollow a user.
- - Separating interactions into their own API keeps concerns modular and scalable.

Database design

Database System:

Choice: Use Cassandra for its horizontal scalability and ability to handle high volumes of reads and writes efficiently. Tables User, Tweets, Followers
Architecture: Implement a Master/Slave approach.
- Master DB: Handles writes (tweets, user actions).
- Slave DBs: Acts as read replicas to efficiently handle read requests.

Consistency:

Eventual Consistency: Accepting eventual consistency is a good fit for a social media platform like Twitter. It allows for high availability and scalability while providing acceptable user experiences, even when updates might not be immediately reflected across all users simultaneously.

Caching Layer - Protect the DB - faster to read from in-memory than disk:

Caching System: Use Redis to cache frequently accessed tweets.
Eviction Policy: Implement a Least-Frequently Used (LFU) policy to ensure that only the most accessed tweets remain in the cache.
TTL (Time To Live): Apply a TTL to cached items to invalidate old or less relevant tweets.
Prevent Cache Stampede: Use a Mutex mechanism on the cache key to prevent multiple concurrent requests from overwhelming the backend when a cache entry is missed.

High-level design

I'll focus on Users already being authenticated and using Twitter

The Client (User Interface) sends a request to view the timeline.
CDN returns static assets like video files for tweets, etc...
The API Gateway, which acts as a Load Balancer (Layer 4), routes the request, authenticates/authorises users. Rate Limiter would be used a this point to protect API endpoints.
The Twitter Interaction API checks the Cache Layer (using Redis with LFU).
If the data is present in the cache, it's returned to the client. If not, the request is forwarded to the Database Layer (Cassandra).
Cassandra retrieves the timeline data based on the user's subscriptions. (Assumed this updated by a separate service). Cache is asynchronously updated after DB hit.
The response is sent back to the API Gateway, which then returns it to the Client.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Client interacts with system.

CDN serves our static assets.

Request is directed to API Gateway (Load Balancer, Authenticates/Authorises requests, rate limits before redirecting to a service that hosts the APIs).

API checks cache, if cache key exists returns data, if not queries Cassandra.

Detailed component design

API Gateway
- L4 Transport Layer Load Balancer with Round Robin Algorithm dynamically redirects requests to our horizontally replicated servers that host the API.
- Authenticate/Authorise User based on JWT.
- Rate Limit - Protect APIs
CDN
- Serve Static Assets closer to the user to reduce latency.
Redis
- Use Redis Cluster to handle in-memory data cache. Can be horizontally scaled to handle load.
Cassandra
- Distributed Database. Can be horizontally scaled easily.

Trade offs/Tech choices

Database Choice: Cassandra vs. Relational Databases

Trade-Off: Choosing Cassandra, a NoSQL database, offers a horizontally scalable solution suitable for high write and read operations. However, it sacrifices strong consistency in favour of availability and partition tolerance (AP in the CAP theorem).
Reason: Given the nature of the application where a vast number of users are likely to be posting tweets simultaneously, Cassandra's ability to handle large volumes of write requests efficiently was prioritized to maintain performance.

Eventual Consistency vs. Strong Consistency

Trade-Off: The decision to adopt eventual consistency allows the system to provide high availability and better performance across distributed nodes. It may lead to situations where users do not see the latest updates immediately.
Reason: Users of social media platforms typically tolerate slight delays in updates. The choice helps maintain a responsive system, especially during peak loads.

Caching Strategy: Redis with LFU

Trade-Off: Implementing a caching layer using Redis with an LFU eviction policy ensures that frequent requests can be served quickly. However, there is complexity introduced in managing the cache and the potential for stale data.
Reason: Speed is crucial for user experience in a real-time application like Twitter. Using Redis helps mitigate database load by caching popular tweets, reducing latency for users.

Load Balancer / API Gateway

Trade-Off: Relying on a Layer 4 Load Balancer abstracts network routing at the transport layer instead of the application layer, limiting the ability to perform some API-level decisions or application-level metrics.
Reason: The focus was on ensuring efficient handling of simple routing and load balancing for high traffic volumes without overcomplicating the architecture at the initial stage.

Authentication with JWT

Trade-Off: Using JSON Web Tokens (JWT) for authentication enables stateless sessions, improving scalability since there's no need to manage user sessions on the server. However, it complicates the handling of token expiration and revocation.
Reason: In a high-traffic environment, maintaining server-side sessions can become a bottleneck. JWT provides a lightweight mechanism for securely passing user information.

Service Decomposition vs. Monolithic Architecture

Trade-Off: Adopting a microservices architecture means more complexity, with services needing to communicate over the network, potentially increasing latency and troubleshooting challenges.
Reason: This architecture promotes scalability and flexibility, allowing each service to be developed and deployed independently. This is crucial for adapting to varying loads and evolving features.

Conclusion

In summary, every technology choice and architectural decision is accompanied by trade-offs concerning consistency, availability, performance, and complexity. The design choices made aim to create a scalable, efficient, and user-friendly platform that meets the needs of a social media application

Failure scenarios/bottlenecks

1. Cache Misses

Scenario: If a large percentage of users try to access newly created or seldom-tweeted content, the cache might miss and cause a heavy load on the database, e.g Cache Stampede.
Bottleneck: The database may become overwhelmed with queries, leading to increased response times and potentially downtime.
Mitigation: Use LFU Cache alongside TTL with Redis. Add a lock to a missed cache key, so only we DB check is required still to update it.

2. Database Overload

Scenario: During peak usage times, if many users are posting tweets or retrieving timelines simultaneously, the database may become a bottleneck.
Bottleneck: High read and write concurrency could lead to increased latency or failures in database transactions.
Mitigation: Design the database schema for horizontal scaling and optimize read/write paths. Use sharding to distribute the load across multiple database instances.

3. Network Latency

Scenario: If network conditions fluctuate, requests and responses between services (Client, API Gateway, and Database) can be delayed.
Bottleneck: This can lead to increased latency for users and poor response times for requests.
Mitigation: Consider deploying services closer to end-users (geo-distribution) and use Content Delivery Networks (CDNs) for static resources.

4. Service Downtime

Scenario: If any microservices (e.g., Tweet Service, User Service) become unavailable due to crashes or maintenance, the corresponding functionality would be impaired.
Bottleneck: This may lead to partial outages where users can’t post tweets, view timelines, or interact with tweets.
Mitigation: Implement health checks and auto-scaling mechanisms. Incorporate redundancy for critical services and design the system to gracefully degrade.

5. Rate Limiting Issues

Scenario: Users generating excessive requests (for instance, rapidly liking tweets) could trigger rate limiting, preventing legitimate interactions.
Bottleneck: Rate limiting might result in frustrated users unable to perform actions they expect.
Mitigation: Adjust rate limits considering user behavior patterns while allowing users a certain number of burst requests.

6. Ineffective Load Balancing

Scenario: If the Load Balancer does not distribute requests evenly across services, some instances may become overloaded while others are underutilized.
Bottleneck: This imbalance could lead to performance degradation and slow response times.
Mitigation: Continuously monitor traffic patterns and periodically rebalance services based on load.

7. Error Handling and User Experience

Scenario: Users encounter errors when performing actions (e.g., posting or liking tweets).
Bottleneck: If errors aren't handled gracefully or logged appropriately, it can lead to confusion, more support requests, and a poor user experience.
Mitigation: Incorporate robust error handling, provide user-friendly error messages, and log errors for analysis.

8. Content Moderation Bottlenecks

Scenario: If moderation for user-generated content (tweets) becomes a bottleneck, it could lead to content violations or abuse going unchecked.
Bottleneck: Increased flagged content may overwhelm moderation systems, affecting platform reputation.
Mitigation: Use automated moderation tools along with human reviewers, set up alert systems for recurrent violations, and implement user reporting features.

9. Data Consistency Issues

Scenario: In a system that allows eventual consistency, users may not see the latest updates from individuals they follow immediately.
Bottleneck: This could confuse users, as they might think their interactions (likes or retweets) didn't go through.
Mitigation: Provide feedback in the client interface to show pending actions or notify users of updates.

Conclusion

Addressing these potential failure scenarios and bottlenecks is essential for ensuring the system remains reliable, performant, and user-friendly. Continuous monitoring, testing, and refinement of the architecture can greatly mitigate the risks associated for each scenario.

If you'd like to dive deeper into any specific scenario or discuss further mitigation strategies, feel free to ask!

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Primary/Secondary Load Balancer to reduce Single-Point of Failure.
Use a Multi-Region-Active Strategy - deploy services in multiple locations to ensure no single-point of failure and reduce latency.