Design Dropbox - System Design

System requirements

Functional:

User can load/upload files.
Modify content.
Share files with permissions (Read-only/Write-only etc.).

Non-Functional:

Availability
Durability.
Scalability.
Consistency.
Security.
Low latency.

Capacity estimation

. Blob Storage (File Content)

Average file size: 10 MB
Uploads per user per day: 10
Active users: 100,000
Daily upload volume: 10 TB/day
Annual storage needed: ~3.65 PB/year

Metadata Database

a. Files table:

Rows/year: ~365 million
Row size: ~300 bytes
Total: ~110 GB

b.File versions table:

Rows/year: ~550 million (assuming 1.5 versions per file)
Row size: ~400 bytes
Total: ~220 GB

c. Other tables

Estimated size: ~100–150 GB

API design

File APIs:

POST /files – Create a new file upload (get upload URL)

PUT /uploads/{upload_id}/chunks/{part_number} – Upload a chunk

POST /uploads/{upload_id}/complete – Finalize chunked upload

GET /files – List user’s files

GET /files/{file_id}/download – Download a file

DELETE /files/{file_id} – Delete a file

Versioning:

GET /files/{file_id}/versions – List versions of a file

GET /files/{file_id}/versions/{version_id} – Get a specific version

Sync & Devices:

POST /devices – Register a new device

GET /sync – Get list of files needing sync

POST /sync/ack – Acknowledge synced files

Jobs:

GET /jobs/{job_id} – Get job (e.g. virus scan) status

Database design

Database tables:

Table users {

user_id <-key

}

Table files {

file_id <- key

user_id <- reference to user

name

type

size

version

created_at

updated_at

}

Table file_version {

version_id <- key

version

file_id <- refernce to files

blob_path

status

checksum

uploaded_at

user_id <- reference to users

}

Table chunks{

chunk_id <-- key

version_id <- reference to vestions

number

blob_path

created_at

size

}

High-level design

CDN
API Gateway - load balancer, router, auth, rate limiter and ssl termination.
File uploader - writes metadata of file to Metadata DB and returns URL to chunk uploader.
Chunk uploader - uploads data in blob store.
Blob store - stores user content (AWS S3).
Metadata - stores file metadata (Postgres).
Sync service - synchronizes data on the server and on the client device.
Async job - notify all devices.

Request flows

Entry Point

Client interacts with the system.
CDN (Content Delivery Network) is used for caching and accelerating delivery of static assets or downloads.
API Gateway serves as the main entry point, routing requests to appropriate services.

Upload Flow

Client initiates an upload request via the API Gateway.
API Gateway routes the request to the File Upload Service.
File Upload Service:
- Registers the upload and generates an uploadID.
- Calculates chunking strategy based on file size.
- Returns chunk upload URLs (or endpoint) to the client.
Client uploads file chunks directly to the Chunk Uploader.
Chunk Uploader:
- Stores chunks in the Blob Store.
- Records upload metadata (e.g., file name, size, parts) in the Metadata DB.
Once upload is complete:
- Metadata and event notifications are sent to the Sync Service and/or published to the Message Queue.

Download Flow

Client sends a download request via the API Gateway.
API Gateway routes the request to the File Download Service.
File Download Service:
- Fetches file metadata from the Metadata DB.
- Coordinates with Chunk Downloader to retrieve individual file chunks from the Blob Store.
Chunks are reassembled and streamed back to the client.

Syncing & Background Processing

Sync Service maintains consistency across devices, triggering updates or syncing actions as needed when files change.
Async Jobs perform background processing such as:
- Virus scanning
- Preview/thumbnail generation
- File indexing or OCR
These jobs are triggered via events published to the Message Queue, ensuring non-blocking and scalable task execution.

Detailed component design

API Gateway – Routes traffic, handles SSL, rate limiting.
Auth Service – Verifies users/devices via JWT.
File Upload Service – Manages file creation, versions, metadata.
Chunk Uploader – Handles actual file uploads to blob storage.
Blob Store (S3) – Stores raw file data (blobs).
Metadata DB – Stores file info, versions, sync state, jobs.
Sync Service – Tracks device sync state, lists deltas.
Message Queue (Kafka/SQS) – Handles async jobs (scan, sync).
Worker Services – Process background tasks like virus scanning.
Device Manager (optional) – Manages user devices.
Notification Service (optional) – Pushes updates to devices.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

1. Chunk Upload Failures

Interrupted network, client crash, or chunk corruption.
Mitigation: Retry mechanism, chunk checksums, resumable uploads.

2. Metadata DB Outage

Failure in the SQL database (PostgreSQL/MySQL) blocks uploads, sync, versioning.
Mitigation: DB replicas, caching reads, partitioning, connection pooling.

3. Blob Store Unavailable

S3/MinIO down → file uploads fail or downloads break.
Mitigation: Retry logic, fallback storage region, temporary queuing.

4. Message Queue Backup

Queue gets overloaded (slow workers, sudden traffic spike).
Mitigation: Auto-scaling consumers, DLQ (dead-letter queues), rate limiting.

5. Worker Failures

Virus scan, sync propagation, or preview workers crash or misbehave.
Mitigation: Health checks, retries, circuit breakers.

6. File Sync Conflicts

Same file updated on two devices before sync.
Mitigation: Conflict resolution strategies (e.g., last-writer-wins, version merge).

7. API Gateway Overload

Too many concurrent users, slow downstream services.
Mitigation: Load balancing, autoscaling, rate limiting, connection timeout tuning.

8. Storage Limits Reached

Disk full on blob store or DB volume.
Mitigation: Monitoring, autoscaling volumes, cold storage tiering.

9. Authentication Failures

Token service down = all API calls fail.
Mitigation: Token caching, multiple auth nodes, fallback tokens.

10. Device Desync

Device misses events (offline too long, sync fails).
Mitigation: Sync checkpoints, incremental sync, replay missing deltas.

Future improvements

1. Delta Sync (Block-Level)

Sync only changed parts of large files (e.g., using Rsync or binary diff).
Greatly reduces upload/download costs.

2. End-to-End Encryption

Encrypt files on client side before upload.
Improves privacy even from backend access.

3. Content Deduplication

Avoid storing duplicate file content using hash-based checks (e.g. SHA-256).
Saves storage and bandwidth.

4. Cold Storage Tiering

Move old versions to cheaper, slower storage (e.g., S3 Glacier).
Reduces cost for long-term retention.

5. Preview & Thumbnail Generation

Auto-generate image previews, PDF pages, video snapshots.
Improves UX on web/mobile clients.

6. Real-Time Sync with WebSockets

Use WebSockets or gRPC streams for instant device updates instead of polling.

7. Multi-Region Sync Support

Replicate blob data and metadata across regions for better latency and availability.

8. User & Team Sharing

Add sharing permissions, folder collaboration, and public links.

9. Audit Logging

Log user activity: uploads, downloads, deletes, syncs.
Useful for enterprise use cases.

10. Admin Dashboard & Analytics

File stats, usage trends, sync failures, storage consumption.