Build a fault-tolerant Caching Pipeline

Question

Design a fault-tolerant caching system that handles millions of requests. Discuss trade-offs in consistency, availability, and performance.

Codemia · Accepted Answer

### Functional Requirements
1. **Request Handling**: The caching system must handle millions of requests per second with very low latency (sub-100ms).
2. **Fault Tolerance**: The system must ensure that it continues to serve requests even in the event of partial failures.
3. **Model Serving**: It should serve the latest machine learning models for pricing predictions, fraud detection, and user behavior modeling.
4. **Caching Layer**: Must cache frequently accessed data such as user wallets, transaction history, and market data.
5. **Real-time Updates**: The system should allow for real-time updates to the cache based on new incoming data.

### Non-Functional Requirements
1. **Scalability**: The system should scale horizontally to accommodate variable load.
2. **Consistency**: The system should provide eventual consistency for cached data.
3. **Availability**: Must achieve 99.99% uptime to ensure that users can access critical functionalities.
4. **Latency**: The caching layer should have a maximum response time of 50ms.

### Capacity Estimation
Assuming Coinbase has approximately 10 million active users.
- **Peak Requests**: If we estimate that each user generates 5 requests per minute at peak times, that results in:
  - 10,000,000 users * 5 requests/min = 50,000,000 requests/min = approximately 833,333 requests/sec.
- **Cache Size**: If each user’s wallet and transaction data averages around 1 KB, the total cache size would be:
  - 10,000,000 users * 1 KB = approximately 10 GB. To accommodate growth and redundancy, we should plan for at least 100 GB of cache storage.

### High-Level Architecture
The architecture consists of the following components:
1. **Load Balancer**: Distributes incoming requests among multiple caching servers to ensure even load distribution.
2. **Caching Layer**: 
   - **Redis**: Used for in-memory caching due to its speed and support for various data structures.
   - **Persistent Storage**: **Amazon S3** for fallback storage of less frequently accessed data.
3. **Model Serving Layer**: 
   - **TensorFlow Serving**: For deploying machine learning models, allowing for easy updates and rollbacks.
4. **Monitoring & Logging**: 
   - **Prometheus** for monitoring and alerting, and **ELK Stack** for logging and debugging.
5. **Feature Store**: 
   - **Feast**: To manage feature engineering and storage, ensuring features are up-to-date and retrievable in real-time.

### Data Model
#### Schema Design
- **User Cache Table**:
  - `user_id` (Primary Key, String)
  - `wallet_balance` (Float)
  - `transaction_history` (List of Strings)
  - `last_updated` (Timestamp)
- **Feature Table** for ML Models:
  - `feature_id` (Primary Key, String)
  - `feature_name` (String)
  - `feature_value` (Float)
  - `last_updated` (Timestamp)
#### Access Patterns
- Frequent access patterns include fetching user wallet data and recent transaction history, making this schema efficient for read-heavy operations.

### Key Trade-offs
1. **Consistency vs. Availability**: By choosing eventual consistency for the caching layer, we prioritize availability and performance. However, this may lead to outdated data being served immediately after updates.
2. **Memory vs. Storage**: We must balance the amount of in-memory cache (using Redis) against long-term storage costs (using S3). Using Redis for frequently accessed data while offloading less critical data to S3 optimizes costs and improves latency.
3. **Model Freshness vs. Latency**: Regularly retraining models ensures fresh predictions but incurs latency. Implementing a canary deployment strategy allows us to roll out updates gradually, minimizing the impact on users while maintaining model freshness.

Build a fault-tolerant Caching Pipeline

Coinbase

What the Interviewer Expects

Key Topics to Cover

How to Approach This

Possible Follow-up Questions

Practice a Similar Problem on Codemia

Sample Answer

Requirements

Functional Requirements

Capacity Estimation

Capacity Estimation

Submit Your Answer

Coinbase Software Engineer Interview Guide

Related Questions

Design a high-throughput Inventory Management System

Design a low-latency Rate Limiting System

Design a fault-tolerant Payment System

Design a fault-tolerant Messaging System

Design Walmart Product Search