Build a low-latency Image Processing Pipeline

Question

Design a low-latency image processing system that handles millions of requests. Discuss trade-offs in consistency, availability, and performance.

Codemia · Accepted Answer

### Functional Requirements:
1. **Image Upload:** Users can upload images through Slack channels. 
2. **Real-time Processing:** Images should be processed within 200ms to ensure low latency.
3. **Image Transformations:** Support for various transformations like resizing, filtering, and enhancement.
4. **Feedback Loop:** Allow users to rate image processing quality to improve ML models.

### Non-Functional Requirements:
1. **Scalability:** Support millions of image uploads per day, scaling horizontally.
2. **Reliability:** 99.9% uptime with efficient error handling.
3. **Performance:** Maintain a response latency of under 200ms for 95% of requests.
4. **Security:** Ensure user data privacy and secure image handling.

Assuming Slack has around 10 million daily active users, and if we estimate that 20% will upload images daily:
- Daily Image Uploads = 10M users * 20% = 2M images/day.
- Hourly Image Uploads = 2M images / 24 hours = ~83,333 images/hour.
- Peak Load: Assuming peak times see 10x the average load, around 833,330 images/hour.
- Processing Latency: Targeting 200ms processing time per image means we should handle:
  - 833,330 images/hour = 231 images/second.
  - To ensure we meet latency requirements, our system should ideally support 500-1000 images/second at peak.

### Component Diagram:
1. **Image Upload Service:** Receives images via Slack API.
2. **Queue System (Kafka):** Buffers incoming images for processing.
3. **Image Processing Workers (Kubernetes):** Processes images using a scalable architecture.
4. **Feature Store (Redis):** Caches frequently accessed features for fast retrieval.
5. **Model Serving (TensorFlow Serving):** Hosts ML models for real-time inference.
6. **Monitoring (Prometheus & Grafana):** Tracks system performance and model degradation.
7. **Feedback Collection Service:** Gathers user feedback on image processing quality.
8. **Database (PostgreSQL):** Stores user uploads, processing results, and feedback.

### Schema Design:
**Image Table:**
- `image_id`: UUID (Primary Key)
- `user_id`: UUID (Foreign Key)
- `upload_timestamp`: TIMESTAMP
- `processed_image_url`: STRING
- `processing_time`: INTEGER (ms)
- `feedback_score`: INTEGER (1-5)

**Feedback Table:**
- `feedback_id`: UUID (Primary Key)
- `image_id`: UUID (Foreign Key)
- `user_id`: UUID (Foreign Key)
- `feedback_text`: TEXT
- `submitted_at`: TIMESTAMP

This design allows efficient access patterns for retrieving images and their associated feedback while supporting quick updates for user ratings.

1. **Consistency vs Availability:** Opted for eventual consistency in feedback processing to maintain availability during peak loads. Immediate feedback updates may affect latency.
2. **Complexity vs Performance:** Chose a simpler CNN model for image processing to meet latency requirements rather than a more complex model that could provide marginally better accuracy but at higher latency.
3. **Batch Processing vs Real-time:** While real-time processing is critical, we must balance this with the need for batch processing for model retraining, ensuring we have sufficient labeled data to improve models.
4. **Feature Store Complexity:** Using Redis for caching simplifies retrieval but adds complexity in maintaining the cache, requiring robust invalidation strategies.

Build a low-latency Image Processing Pipeline

Slack

What the Interviewer Expects

Key Topics to Cover

How to Approach This

Possible Follow-up Questions

Practice a Similar Problem on Codemia

Sample Answer

Requirements

Functional Requirements:

Capacity Estimation

Submit Your Answer

Slack Software Engineer Interview Guide

Related Questions

Design a high-throughput Inventory Management System

Design a low-latency Rate Limiting System

Design a fault-tolerant Payment System

Design a fault-tolerant Messaging System

Design Walmart Product Search