Requirements
Functional Requirements:
- Allow users to upload files to the system.
- Enable users to download uploaded files.
- Ensure synchronization of files between local and server storage.
- Batch upload & download
Non-Functional Requirements:
- app availability 99.99%, 52min downtime in a year
- file/data eventual consistency
- scalability, when a server fails, the failover
- security, only user with permission can upload/view. encryption for file object
- durability, needs replication for each component.
world 8B, 10% users from pop = 0.8B
system DAU will be 10% of users = 80M
R:W = 10:1
W DAU
Durability
- Once upload returns success, the raw file must already be persisted in durable object storage.
- Metadata must survive single-node or single-AZ failures.
- Chunk processing must be retryable without data loss.
- The system should tolerate consumer crash or retry under at-least-once delivery.
All APIs require JWT-based authentication. The token is validated by the API Gateway before requests are routed to backend services.
In addition to authentication, the system enforces authorization checks to ensure that users can only upload, download, or read metadata for files they own or are explicitly allowed to access.
I would separate security into authentication, authorization, and storage protection. JWT proves user identity, but the backend still needs to verify ownership before returning file metadata or chunk URLs. All traffic should use TLS, objects should stay in private buckets, and downloads should use signed URLs or authenticated CDN access. I’d also add malware scanning and audit logging.
API Design
1x1
POST /v1/file {file info}-> return {status, error code, location url}
batch
POST /v1/files {files info}-> return {status, error code, location url}
download a file
GET /v1/file/id
GET /v1/files/id={id1, id2,..}
Finalize upload
POST /v1/files/{fileId}/finalize
{
"checksum": "sha256-xxx",
"size": 1048576
} request example,
return example: {
"fileId": "f123",
"status": "processing"
}
- Called after the client finishes uploading the raw object
- Verifies object existence and upload completeness
- Updates file status from
uploadingtoprocessing - Publishes a message to Kafka for asynchronous chunk processing
Get file info
GET /v1/files/{fileId}/metadata
- Returns file metadata and current processing status
all above API using JWT as user security token to access endpoints
High-Level Design
fileInfo: fileId, name, author, chunkIds:[], updated, created
chunkInfo: fileId, chunkId, chunkUrl
Client → API Gateway
- Client calls:
upload(fileId)
- API Gateway responsibilities:
- Route request to available upload servers
- Use least-connections load balancing
- Handle failover when a node goes down
👉 Design intent
- Prevent hotspot
- Ensure high availability at entry point
2. Raw Object Storage (S3)
- Upload server:
- Directly streams raw file → S3 (raw bucket)
👉 Why first write to S3?
- Avoid coupling with metadata or processing
- Durable storage immediately (no data loss)
3. Metadata Registration (FileInfo Service)
- Client sends:
POST upload(fileInfo)
- FileInfo Service:
- Stores metadata into DB
fileInfo {
fileId,
name,
author,
created,
updated
}
👉 Design intent
- Metadata and file storage are decoupled
- Enables independent scaling
4. Async Processing via Kafka
- System publishes event:
topic: file-uploaded
👉 This triggers downstream processing pipeline
5. Object Processing Pipeline (Async Consumers)
A dedicated Object Handler Service consumes Kafka events:
Pipeline stages:
Raw S3 Object
↓
Split into chunks
↓
Fraud Detection
↓
Compression Service
👉 Why pipeline?
- Each stage is independently scalable
- Failures are isolated (retry per stage)
6. Chunk Storage
- After processing:
- Each chunk uploaded to Chunk S3
👉 Guarantees:
- All chunks successfully uploaded
- Idempotent writes (important for retries)
7. Chunk Notification (Kafka)
- Publish event:
topic: chunk-created
8. Chunk Metadata Service
- Chunk Service consumes event
- Stores chunk metadata:
chunkInfo {
fileId,
chunkId,
chunkUrl
}
👉 Design intent
- Enables parallel download later
- Avoids scanning S3 during reads
Client → API Gateway
2. CDN Check
- First check:
- CDN cache (edge)
👉 If hit → return immediately
👉 If miss → go backend
3. File Metadata Lookup
- Request goes to FileInfo Service
- Service fetches:
- fileInfo DB
- chunkInfo DB
4. Return Chunk URLs
- Response contains:
[
{chunkId, chunkUrl},
...
]
5. Client Parallel Download
- Frontend:
- Downloads chunks in parallel
- Reconstructs file locally
👉 Why client-side merge?
- Reduces backend load
- Maximizes bandwidth usage
Detailed Component Design
Fault Tolerance
- Raw file already in S3 → no data loss
- Kafka enables retry
- Chunk processing is idempotent
Scalability
- Chunking enables:
- parallel processing
- parallel download
- Each service scales independently
- Faster uploads
Failover / Replication Design
1. Local-region design
Write path
- Writes go to the primary node
- The primary replicates data to local replicas
- The write is considered successful only after:
- primary write succeeds, and
- at least one replica acknowledges the write
Read path
- Reads are served from local replicas whenever possible
- This reduces read latency and offloads traffic from the primary
Why this design
- Compared with waiting for all replicas, waiting for one replica ack gives a better balance:
- better durability than primary-only ack
- lower latency than full synchronous replication
2. Multi-region design
Replication strategy
- Use quorum-based replication across regions
For example:
- total replicas = 5
- write quorum = 3
Then:
- upload is considered successful once 3 out of 5 replicas acknowledge the write
Why quorum
- It allows the system to tolerate node or region failures
- As long as quorum is preserved, the system can continue serving writes safely
3. Failover strategy
During failover, the system promotes the most up-to-date replica as the new primary, typically the one with the smallest replication lag or the latest committed log position.
Performance Optimization
- CDN for hot files
- Chunk-based parallel download
Extensibility
- Pipeline can easily add:
- virus scan
- AI tagging
- preview generation