System requirements
Functional:
- Users can post a tweet (up to a certain character limit, typically 140 characters).
- Users can follow other users.
- Users can view tweets from the accounts they follow in their feed.
- Users can like tweets.
Non-Functional:
Let's summarise our Non-Functional Requirements first before moving to technologies and work out our capacity estimation.
Non-Functional Requirements
- Scalability: The system should handle up to 500 million Daily Active Users, with the ability to scale efficiently like Twitter.
- Availability: Maximum availability is needed, with strategies such as:
- Redundant systems to ensure continuous service.
- Use of Content Delivery Networks (CDNs) for static content.
- Geo-distributed data centers to minimize outages.
- Latency: Aim for a response time of under 500 milliseconds for displaying tweets.
- Security: Implement measures for user authentication and authorization to ensure that only legitimate users can access the system features.
Capacity estimation
- Daily Tweets:
- Daily Tweet Volume: 500 million DAU * 2 tweets/user = 1 billion tweets per day.
- Views of Tweets:
- Daily Views: 500 million DAU * 100 views/user = 50 billion views per day.
- Storage Requirements:
- Daily Storage: 1 billion tweets * 512 bytes = 512 GB of data per day.
- Peak Active Users:
- Peak Load: 500 million DAU * 20% = 100 million active users during peak times.
We can see our system is READ Heavy. Which will impact our system.
API design
1. RegistrationAPI
- POST /register → Create a new user entry in the database.
- Statelessness is preserved: The client sends the registration payload; the server responds independently without remembering previous client interactions.
2. LoginAPI
- POST /login → User sends credentials, server authenticates, and returns a JWT token.
- Using JWT is a smart move because the token is self-contained (no server-side session needed) — perfect for scaling horizontally and working with load balancers.
3. TweetAPI (CRUD for tweets)
- POST /tweets → Create a tweet.
- GET /tweets/{tweet_id} → Retrieve a specific tweet.
- GET /timeline → Retrieve timeline (could include batch fetching tweets user follows).
- PUT /tweets/{tweet_id} → Update/edit a tweet.
- DELETE /tweets/{tweet_id} → Delete a tweet.
You’re right that retrieving multiple tweets will need to be optimized, likely with pagination (e.g., ?limit=20&offset=40).
4. TwitterInteractionAPI
- POST /tweets/{tweet_id}/like → Like a tweet.
- DELETE /tweets/{tweet_id}/like → Unlike a tweet.
- POST /users/{user_id}/follow → Follow a user.
- DELETE /users/{user_id}/follow → Unfollow a user.
- Separating interactions into their own API keeps concerns modular and scalable.
Database design
Database System:
- Choice: Use Cassandra for its horizontal scalability and ability to handle high volumes of reads and writes efficiently. Tables User, Tweets, Followers
- Architecture: Implement a Master/Slave approach.
- Master DB: Handles writes (tweets, user actions).
- Slave DBs: Acts as read replicas to efficiently handle read requests.
Consistency:
- Eventual Consistency: Accepting eventual consistency is a good fit for a social media platform like Twitter. It allows for high availability and scalability while providing acceptable user experiences, even when updates might not be immediately reflected across all users simultaneously.
Caching Layer - Protect the DB - faster to read from in-memory than disk:
- Caching System: Use Redis to cache frequently accessed tweets.
- Eviction Policy: Implement a Least-Frequently Used (LFU) policy to ensure that only the most accessed tweets remain in the cache.
- TTL (Time To Live): Apply a TTL to cached items to invalidate old or less relevant tweets.
- Prevent Cache Stampede: Use a Mutex mechanism on the cache key to prevent multiple concurrent requests from overwhelming the backend when a cache entry is missed.
High-level design
I'll focus on Users already being authenticated and using Twitter
- The Client (User Interface) sends a request to view the timeline.
- CDN returns static assets like video files for tweets, etc...
- The API Gateway, which acts as a Load Balancer (Layer 4), routes the request, authenticates/authorises users. Rate Limiter would be used a this point to protect API endpoints.
- The Twitter Interaction API checks the Cache Layer (using Redis with LFU).
- If the data is present in the cache, it's returned to the client. If not, the request is forwarded to the Database Layer (Cassandra).
- Cassandra retrieves the timeline data based on the user's subscriptions. (Assumed this updated by a separate service). Cache is asynchronously updated after DB hit.
- The response is sent back to the API Gateway, which then returns it to the Client.
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Client interacts with system.
CDN serves our static assets.
Request is directed to API Gateway (Load Balancer, Authenticates/Authorises requests, rate limits before redirecting to a service that hosts the APIs).
API checks cache, if cache key exists returns data, if not queries Cassandra.
Detailed component design
- API Gateway
- L4 Transport Layer Load Balancer with Round Robin Algorithm dynamically redirects requests to our horizontally replicated servers that host the API.
- Authenticate/Authorise User based on JWT.
- Rate Limit - Protect APIs
- CDN
- Serve Static Assets closer to the user to reduce latency.
- Redis
- Use Redis Cluster to handle in-memory data cache. Can be horizontally scaled to handle load.
- Cassandra
- Distributed Database. Can be horizontally scaled easily.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
- Primary/Secondary Load Balancer to reduce Single-Point of Failure.
- Use a Multi-Region-Active Strategy - deploy services in multiple locations to ensure no single-point of failure and reduce latency.