System requirements
Functional:
- User can load/upload files.
- Modify content.
- Share files with permissions (Read-only/Write-only etc.).
Non-Functional:
- Availability
- Durability.
- Scalability.
- Consistency.
- Security.
- Low latency.
Capacity estimation
. Blob Storage (File Content)
- Average file size: 10 MB
- Uploads per user per day: 10
- Active users: 100,000
- Daily upload volume: 10 TB/day
- Annual storage needed: ~3.65 PB/year
Metadata Database
a. Files table:
- Rows/year: ~365 million
- Row size: ~300 bytes
- Total: ~110 GB
b.File versions table:
- Rows/year: ~550 million (assuming 1.5 versions per file)
- Row size: ~400 bytes
- Total: ~220 GB
c. Other tables
- Estimated size: ~100–150 GB
API design
File APIs:
POST /files – Create a new file upload (get upload URL)
PUT /uploads/{upload_id}/chunks/{part_number} – Upload a chunk
POST /uploads/{upload_id}/complete – Finalize chunked upload
GET /files – List user’s files
GET /files/{file_id}/download – Download a file
DELETE /files/{file_id} – Delete a file
Versioning:
GET /files/{file_id}/versions – List versions of a file
GET /files/{file_id}/versions/{version_id} – Get a specific version
Sync & Devices:
POST /devices – Register a new device
GET /sync – Get list of files needing sync
POST /sync/ack – Acknowledge synced files
Jobs:
GET /jobs/{job_id} – Get job (e.g. virus scan) status
Database design
Database tables:
Table users {
user_id <-key
}
Table files {
file_id <- key
user_id <- reference to user
name
type
size
version
created_at
updated_at
}
Table file_version {
version_id <- key
version
file_id <- refernce to files
blob_path
status
checksum
uploaded_at
user_id <- reference to users
}
Table chunks{
chunk_id <-- key
version_id <- reference to vestions
number
blob_path
created_at
size
}
High-level design
- CDN
- API Gateway - load balancer, router, auth, rate limiter and ssl termination.
- File uploader - writes metadata of file to Metadata DB and returns URL to chunk uploader.
- Chunk uploader - uploads data in blob store.
- Blob store - stores user content (AWS S3).
- Metadata - stores file metadata (Postgres).
- Sync service - synchronizes data on the server and on the client device.
- Async job - notify all devices.
Request flows
Entry Point
- Client interacts with the system.
- CDN (Content Delivery Network) is used for caching and accelerating delivery of static assets or downloads.
- API Gateway serves as the main entry point, routing requests to appropriate services.
Upload Flow
- Client initiates an upload request via the API Gateway.
- API Gateway routes the request to the File Upload Service.
- File Upload Service:
- Registers the upload and generates an uploadID.
- Calculates chunking strategy based on file size.
- Returns chunk upload URLs (or endpoint) to the client.
- Client uploads file chunks directly to the Chunk Uploader.
- Chunk Uploader:
- Stores chunks in the Blob Store.
- Records upload metadata (e.g., file name, size, parts) in the Metadata DB.
- Once upload is complete:
- Metadata and event notifications are sent to the Sync Service and/or published to the Message Queue.
Download Flow
- Client sends a download request via the API Gateway.
- API Gateway routes the request to the File Download Service.
- File Download Service:
- Fetches file metadata from the Metadata DB.
- Coordinates with Chunk Downloader to retrieve individual file chunks from the Blob Store.
- Chunks are reassembled and streamed back to the client.
Syncing & Background Processing
- Sync Service maintains consistency across devices, triggering updates or syncing actions as needed when files change.
- Async Jobs perform background processing such as:
- Virus scanning
- Preview/thumbnail generation
- File indexing or OCR
- These jobs are triggered via events published to the Message Queue, ensuring non-blocking and scalable task execution.
Detailed component design
- API Gateway – Routes traffic, handles SSL, rate limiting.
- Auth Service – Verifies users/devices via JWT.
- File Upload Service – Manages file creation, versions, metadata.
- Chunk Uploader – Handles actual file uploads to blob storage.
- Blob Store (S3) – Stores raw file data (blobs).
- Metadata DB – Stores file info, versions, sync state, jobs.
- Sync Service – Tracks device sync state, lists deltas.
- Message Queue (Kafka/SQS) – Handles async jobs (scan, sync).
- Worker Services – Process background tasks like virus scanning.
- Device Manager (optional) – Manages user devices.
- Notification Service (optional) – Pushes updates to devices.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
1. Chunk Upload Failures
- Interrupted network, client crash, or chunk corruption.
- Mitigation: Retry mechanism, chunk checksums, resumable uploads.
2. Metadata DB Outage
- Failure in the SQL database (PostgreSQL/MySQL) blocks uploads, sync, versioning.
- Mitigation: DB replicas, caching reads, partitioning, connection pooling.
3. Blob Store Unavailable
- S3/MinIO down → file uploads fail or downloads break.
- Mitigation: Retry logic, fallback storage region, temporary queuing.
4. Message Queue Backup
- Queue gets overloaded (slow workers, sudden traffic spike).
- Mitigation: Auto-scaling consumers, DLQ (dead-letter queues), rate limiting.
5. Worker Failures
- Virus scan, sync propagation, or preview workers crash or misbehave.
- Mitigation: Health checks, retries, circuit breakers.
6. File Sync Conflicts
- Same file updated on two devices before sync.
- Mitigation: Conflict resolution strategies (e.g., last-writer-wins, version merge).
7. API Gateway Overload
- Too many concurrent users, slow downstream services.
- Mitigation: Load balancing, autoscaling, rate limiting, connection timeout tuning.
8. Storage Limits Reached
- Disk full on blob store or DB volume.
- Mitigation: Monitoring, autoscaling volumes, cold storage tiering.
9. Authentication Failures
- Token service down = all API calls fail.
- Mitigation: Token caching, multiple auth nodes, fallback tokens.
10. Device Desync
- Device misses events (offline too long, sync fails).
- Mitigation: Sync checkpoints, incremental sync, replay missing deltas.
Future improvements
1. Delta Sync (Block-Level)
- Sync only changed parts of large files (e.g., using Rsync or binary diff).
- Greatly reduces upload/download costs.
2. End-to-End Encryption
- Encrypt files on client side before upload.
- Improves privacy even from backend access.
3. Content Deduplication
- Avoid storing duplicate file content using hash-based checks (e.g. SHA-256).
- Saves storage and bandwidth.
4. Cold Storage Tiering
- Move old versions to cheaper, slower storage (e.g., S3 Glacier).
- Reduces cost for long-term retention.
5. Preview & Thumbnail Generation
- Auto-generate image previews, PDF pages, video snapshots.
- Improves UX on web/mobile clients.
6. Real-Time Sync with WebSockets
- Use WebSockets or gRPC streams for instant device updates instead of polling.
7. Multi-Region Sync Support
- Replicate blob data and metadata across regions for better latency and availability.
8. User & Team Sharing
- Add sharing permissions, folder collaboration, and public links.
9. Audit Logging
- Log user activity: uploads, downloads, deletes, syncs.
- Useful for enterprise use cases.
10. Admin Dashboard & Analytics
- File stats, usage trends, sync failures, storage consumption.