Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Dropbox with Score: 8/10

by iridescent_luminous693

System requirements

Functional Requirements

File Management:
- Upload, download, delete, and update files.
- Support folder structures for organizing files.
- Allow sharing files or folders with other users.
- Enable file versioning and recovery.
Synchronization:
- Sync files across multiple devices in real-time.
- Detect and resolve conflicts during sync operations.
Access Control:
- Manage user permissions for files and folders (read/write/share).
- Support public links with optional expiration and passwords.
Search and Metadata:
- Allow users to search files by name, type, or content.
- Store and retrieve file metadata (e.g., size, type, owner, last modified).
Notifications:
- Notify users about shared files, changes, or sync updates.
Offline Access:
- Cache frequently accessed files for offline use.

Non-Functional Requirements

Scalability:
- Handle millions of users and petabytes of data efficiently.
- Scale storage and metadata services dynamically.
Availability:
- Ensure 99.99% uptime for continuous file access and sync operations.
Performance:
- Low latency for file uploads/downloads and real-time sync.
- Efficient handling of large files and bulk operations.
Reliability:
- Prevent data loss through redundancy and replication.
- Ensure strong data integrity with checksums and consistency mechanisms.
Security:
- Encrypt data at rest and in transit.
- Implement strong access control and audit trails.
Data Durability:
- Ensure a durability rate of 99.999999999% (11 nines) for stored files.

Capacity estimation

Estimate the scale of the system you are going to design...

Assumptions:

Users:
- Total registered users: 500 million.
- Active users per day: 100 million.
Files:
- Average files per user: 1,000.
- Total files: 500 million×1,000=500 billion500 \, \text{million} \times 1,000 = 500 \, \text{billion}500million×1,000=500billion.
- Average file size: 1 MB.
- Total storage: 500 billion×1 MB=500 PB500 \, \text{billion} \times 1 \, \text{MB} = 500 \, \text{PB}500billion×1MB=500PB.
Requests:
- Daily file uploads/downloads: 2 billion.
- Peak requests per second: 2 billion÷86,400 seconds=23,148 RPS2 \, \text{billion} \div 86,400 \, \text{seconds} = 23,148 \, \text{RPS}2billion÷86,400seconds=23,148RPS.
Replication:
- Replication factor: 3.
- Total storage with replication: 500 PB×3=1.5 EB500 \, \text{PB} \times 3 = 1.5 \, \text{EB}500PB×3=1.5EB.

API design

Define what APIs are expected from the system...

1. User Management APIs

POST /api/users/register: Register a new user.
POST /api/users/login: Authenticate user credentials.
GET /api/users/profile: Retrieve user profile details.

2. File Management APIs

POST /api/files/upload: Upload a new file.
GET /api/files/download/{file_id}: Download a specific file.
DELETE /api/files/delete/{file_id}: Delete a file.
GET /api/files/metadata/{file_id}: Retrieve file metadata.

3. Folder Management APIs

POST /api/folders/create: Create a new folder.
GET /api/folders/{folder_id}: Retrieve folder contents.
PUT /api/folders/share: Share a folder with other users.

4. Sync and Conflict Resolution APIs

GET /api/sync/status: Check sync status for a device.
POST /api/sync/conflict/resolve: Resolve file conflicts.

5. Notifications APIs

GET /api/notifications: Fetch user notifications.
POST /api/notifications/mark_as_read: Mark notifications as read.

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

1. User Database

Schema Details:
- Table Name: Users
  - user_id (Primary Key): Unique identifier for each user.
  - username: User’s display name.
  - email: User’s email address.
  - password_hash: Hashed password.
  - created_at: Account creation timestamp.
  - last_login: Last login timestamp.
Purpose:
- Manage user accounts, credentials, and activity.
Tech Used:
- Relational Database (e.g., PostgreSQL).
Tradeoff:
- Pros: Strong consistency for user data and authentication.
- Cons: Requires sharding to scale for millions of users.

2. File Metadata Database

Schema Details:
- Table Name: FileMetadata
  - file_id (Primary Key): Unique identifier for each file.
  - user_id (Foreign Key): ID of the file owner.
  - file_name: Name of the file.
  - file_size: Size of the file in bytes.
  - version: Current version of the file.
  - replica_nodes: List of nodes storing replicas.
  - created_at: File creation timestamp.
  - updated_at: Last modified timestamp.
Purpose:
- Store metadata about files and their locations.
Tech Used:
- NoSQL Database (e.g., DynamoDB).
Tradeoff:
- Pros: High scalability and fast lookups for metadata operations.
- Cons: Limited support for complex queries.

3. File Storage System

Schema Details:
- Storage Layer:
  - Object-based storage with file chunks stored by unique IDs.
  - Each chunk includes:
    - chunk_id: Unique identifier.
    - data: Binary data of the file chunk.
    - checksum: Hash for data integrity validation.
Purpose:
- Store and retrieve actual file data efficiently.
Tech Used:
- Distributed Object Storage (e.g., Amazon S3, HDFS).
Tradeoff:
- Pros: Scalable, durable, and optimized for large files.
- Cons: Higher latency compared to block-based storage.

4. Sync and Conflict Resolution Database

Schema Details:
- Table Name: SyncStatus
  - sync_id (Primary Key): Unique identifier for each sync operation.
  - device_id: ID of the device involved in the sync.
  - file_id: ID of the synced file.
  - status: Sync status (e.g., in-progress, completed, failed).
  - last_synced_at: Timestamp of the last sync operation.
Purpose:
- Track sync operations and conflicts across devices.
Tech Used:
- Relational Database (e.g., MySQL).
Tradeoff:
- Pros: Ensures strong consistency for sync state tracking.
- Cons: Requires optimization for high write throughput.

5. Notifications Database

Schema Details:
- Table Name: Notifications
  - notification_id (Primary Key): Unique identifier for each notification.
  - user_id (Foreign Key): Associated user ID.
  - type: Type of notification (e.g., shared file, sync update).
  - content: Notification content.
  - created_at: Timestamp of notification creation.
  - read: Boolean indicating if the notification has been read.
Purpose:
- Manage notifications and alerts for users.
Tech Used:
- NoSQL Database (e.g., MongoDB).
Tradeoff:
- Pros: High scalability for real-time notification storage.
- Cons: Requires periodic cleanup to manage storage costs.

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

1. User Management Service

Overview:

Manages user accounts, authentication, and authorization. It handles operations like registration, login, profile updates, and permission management for file sharing.

Responsibilities:

Authenticate users with secure credentials.
Manage user roles and permissions for files and folders.
Track account activity and manage user-specific settings.

2. File Metadata Service

Overview:

Stores and retrieves metadata about files and folders. This service does not handle the actual file data but tracks details like file size, owner, version, and location.

Responsibilities:

Maintain metadata for files and folders.
Track file versions and their replication locations.
Manage relationships between files and their folders.

3. Storage Service

Overview:

Stores actual file data in a distributed and fault-tolerant manner. Files are split into chunks, and each chunk is replicated across multiple nodes.

Responsibilities:

Store and retrieve file chunks.
Ensure data durability through replication.
Validate data integrity using checksums.

4. Synchronization Service

Overview:

Handles real-time sync of files and folders across user devices. Tracks changes made to files and resolves conflicts in case of simultaneous updates.

Responsibilities:

Monitor file changes and propagate updates to connected devices.
Resolve sync conflicts with user or automated decisions.
Ensure efficient and consistent syncing across devices.

5. Search and Indexing Service

Overview:

Provides search capabilities for files and folders based on metadata or content. It indexes files for fast retrieval.

Responsibilities:

Index files and folders for efficient search.
Support advanced search queries based on content and metadata.
Provide search suggestions and auto-completion.

6. Notification Service

Overview:

Notifies users about file updates, sync statuses, and shared file activity. Handles both real-time and batched notifications.

Responsibilities:

Notify users of changes in shared files or folders.
Send alerts for sync conflicts or errors.
Manage delivery of notifications across devices.

7. Access Control and Security Service

Overview:

Enforces access control rules and provides data security features such as encryption.

Responsibilities:

Encrypt files during transit and at rest.
Manage user and group-level permissions for files.
Log access attempts for auditing and compliance.

8. Monitoring and Logging Service

Overview:

Tracks system health, performance, and user activity. Logs important events for debugging and auditing.

Responsibilities:

Monitor storage nodes for failures or resource exhaustion.
Log user activity and system errors.
Provide metrics for system performance and usage trends.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

1. File Upload Request

Objective: Upload a new file to the system.

Steps:

User Management Service:
- Validates the user’s authentication token.
- Verifies the user’s upload permissions.
File Metadata Service:
- Generates a unique file ID and splits the file into chunks.
- Determines storage nodes for each chunk and updates metadata.
Storage Service:
- Stores file chunks on the allocated storage nodes.
- Replicates chunks to additional nodes for redundancy.
File Metadata Service:
- Updates metadata with the replication status and chunk locations.
Notification Service:
- Notifies the user about successful upload.
Response:
- Returns the file ID and metadata to the user.

2. File Download Request

Objective: Retrieve a file from the system.

Steps:

User Management Service:
- Validates the user’s authentication token.
- Checks permissions to access the requested file.
File Metadata Service:
- Looks up the file ID and retrieves chunk locations.
Storage Service:
- Fetches file chunks from the storage nodes.
- Assembles the file from chunks.
Response:
- Sends the complete file to the user.

3. File Deletion Request

Objective: Delete a file from the system.

Steps:

User Management Service:
- Validates the user’s authentication token.
- Checks ownership and permissions for the file.
File Metadata Service:
- Marks the file as deleted in the metadata database.
- Identifies storage nodes storing the file chunks.
Storage Service:
- Deletes the file chunks from the storage nodes.
- Confirms deletion status with the File Metadata Service.
Notification Service:
- Notifies the user about successful deletion.
Response:
- Confirms the deletion to the user.

4. Real-Time Synchronization

Objective: Sync a file across devices.

Steps:

Synchronization Service:
- Detects changes in a file on one device.
- Updates the File Metadata Service with the change.
File Metadata Service:
- Identifies affected files and devices.
- Notifies the Synchronization Service.
Storage Service:
- Propagates updated file chunks to other devices.
Notification Service:
- Alerts the user about the sync completion or conflicts.

5. Search for Files

Objective: Search for files by name or metadata.

Steps:

User Management Service:
- Validates the user’s authentication token.
Search and Indexing Service:
- Queries the index for matching files and folders.
- Ranks results based on relevance and metadata.
Response:
- Returns a list of matching files and folders to the user.

6. Share a File or Folder

Objective: Share a file with another user.

Steps:

User Management Service:
- Validates the user’s authentication token.
Access Control Service:
- Updates permissions for the specified user or group.
- Logs the sharing activity.
Notification Service:
- Sends a notification to the recipient.
Response:
- Confirms the sharing action to the user.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

1. User Management Service

End-to-End Working:

The User Management Service is responsible for managing user accounts, authentication, and authorization. It ensures secure access by validating user credentials (e.g., using hashed passwords) and generating authentication tokens (e.g., JWT). When a user registers, the service verifies the input, stores credentials securely, and initializes user-specific settings. For authorization, it checks permissions and roles to grant or deny access to files or folders.

Communication:

Protocols Used:
- HTTPS: Ensures secure communication for sensitive operations like login or registration.
- REST APIs: Handles operations like user creation, login, and permission verification.
Inter-Service Communication:
- Communicates with the Access Control Service to verify permissions.
- Notifies the Notification Service for user-specific alerts (e.g., successful registration).

Data Structures and Algorithms:

Hashing for Password Storage:
- Uses algorithms like bcrypt or Argon2 to securely hash and store passwords.
Token Management:
- Generates and validates JWT tokens for session management.
RBAC Matrix:
- Implements a role-based access control system using hash maps to map roles to permissions.

Implementation Example (Password Hashing):

python
Copy code
from bcrypt import hashpw, gensalt

def hash_password(password):
    return hashpw(password.encode(), gensalt())

Scaling for Peak Traffic:

Horizontal Scaling:
- Deploys multiple instances of the service behind a load balancer to handle traffic spikes.
Caching:
- Uses Redis to cache frequently accessed user data like session tokens.
Rate Limiting:
- Implements rate limiting to prevent brute-force attacks and abuse.

Edge Cases:

Password Reset Attacks:
- Mitigated by using time-limited, cryptographically secure reset tokens.
Account Lockouts:
- Implements a temporary lockout mechanism after multiple failed login attempts.

2. File Metadata Service

End-to-End Working:

The File Metadata Service tracks information about files and folders, such as their size, version history, and location. When a user uploads a file, this service generates a unique file ID, splits the file into chunks, and assigns storage nodes for each chunk. It also records details like the owner, permissions, and timestamps.

Communication:

Protocols Used:
- HTTPS: Provides secure communication for metadata operations.
- gRPC: Ensures efficient communication with the Storage Service and Replication Manager.
Inter-Service Communication:
- Sends file location details to the Storage Service during upload and download.
- Communicates with the Access Control Service to enforce permissions.

Data Structures and Algorithms:

Distributed Hash Table (DHT):
- Maps file IDs to metadata, ensuring fast and consistent lookups across nodes.
Version Control:
- Uses a linked list to maintain version history for files.

Implementation Example (Metadata Storage):

python
Copy code
class MetadataStore:
    def __init__(self):
        self.metadata = {}

    def add_file(self, file_id, metadata):
        self.metadata[file_id] = metadata

    def get_file(self, file_id):
        return self.metadata.get(file_id)

Scaling for Peak Traffic:

Sharding:
- Partitions metadata by user ID or file ID to distribute load across multiple nodes.
Caching:
- Caches frequently accessed metadata for faster retrieval.

Edge Cases:

Metadata Corruption:
- Uses checksums and periodic backups to ensure metadata integrity.
Node Failures:
- Replicates metadata across nodes for fault tolerance.

3. Storage Service

End-to-End Working:

The Storage Service stores the actual file data by splitting files into chunks and distributing them across multiple storage nodes. Each chunk is replicated to ensure durability. When retrieving a file, the service reassembles the file from its chunks.

Communication:

Protocols Used:
- HTTP/HTTPS: Handles file uploads and downloads.
- gRPC: Communicates with the Replication Manager for chunk replication tasks.
Inter-Service Communication:
- Notifies the File Metadata Service about the location of stored chunks.
- Interacts with the Monitoring Service to report storage health.

Data Structures and Algorithms:

Chunk Index:
- A local hash map tracks chunk IDs and their storage paths.
Erasure Coding:
- Reduces storage overhead while ensuring durability by splitting data into fragments with parity blocks for reconstruction.

Implementation Example (Chunk Storage):

python
Copy code
class ChunkStore:
    def __init__(self):
        self.chunks = {}

    def store_chunk(self, chunk_id, data):
        self.chunks[chunk_id] = data

    def retrieve_chunk(self, chunk_id):
        return self.chunks.get(chunk_id)

Scaling for Peak Traffic:

Horizontal Scaling:
- Adds storage nodes dynamically to handle growing data volumes.
Load Balancing:
- Distributes file chunks evenly across storage nodes.

Edge Cases:

Node Failures:
- Automatically recovers lost chunks from replicas.
Chunk Corruption:
- Uses checksums to verify data integrity during storage and retrieval.

4. Synchronization Service

End-to-End Working:

The Synchronization Service ensures real-time updates across user devices. It monitors changes made to files and folders, tracks sync statuses, and resolves conflicts when simultaneous updates occur.

Communication:

Protocols Used:
- WebSockets: Enables real-time communication for sync updates.
- HTTPS: Handles conflict resolution and sync status checks.
Inter-Service Communication:
- Fetches updated metadata from the File Metadata Service.
- Notifies the Notification Service about sync statuses.

Data Structures and Algorithms:

Conflict Resolution:
- Uses operational transformation (OT) or last-writer-wins (LWW) algorithms to resolve conflicts.
Change Log:
- Maintains a log of changes made to files for efficient sync propagation.

Implementation Example (Change Tracking):

python
Copy code
class ChangeLog:
    def __init__(self):
        self.changes = []

    def log_change(self, file_id, change_type):
        self.changes.append({"file_id": file_id, "type": change_type})

Scaling for Peak Traffic:

Delta Sync:
- Transfers only changed parts of a file to reduce bandwidth usage.
Connection Pooling:
- Handles a large number of simultaneous WebSocket connections efficiently.

Edge Cases:

Network Disruptions:
- Implements resumable sync to continue after interruptions.
Simultaneous Updates:
- Alerts users or applies automated conflict resolution.

5. Notification Service

End-to-End Working:

The Notification Service alerts users about events like shared file activity, sync conflicts, or completed uploads. It supports real-time notifications as well as batched updates.

Communication:

Protocols Used:
- HTTPS: Handles batched notifications.
- Push Notifications: Sends real-time alerts to devices.
Inter-Service Communication:
- Receives events from the File Metadata and Synchronization Services.

Data Structures and Algorithms:

Priority Queue:
- Ensures critical notifications (e.g., errors) are delivered before less important ones.
Pub/Sub Model:
- Publishes events to subscribed devices or users.

Implementation Example (Notification Queue):

python
Copy code
from queue import PriorityQueue

class NotificationQueue:
    def __init__(self):
        self.queue = PriorityQueue()

    def add_notification(self, priority, message):
        self.queue.put((priority, message))

    def get_notification(self):
        return self.queue.get()

Scaling for Peak Traffic:

Push Notification Services:
- Integrates with third-party services (e.g., Firebase Cloud Messaging) for scalability.
Sharding:
- Distributes notification data across multiple nodes.

Edge Cases:

Delayed Notifications:
- Uses retry mechanisms for failed notification deliveries.
Overloaded Users:
- Groups similar notifications into batches to reduce spam.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Replication vs. Erasure Coding:
- Trade-off: Replication increases storage overhead (3x), while erasure coding reduces overhead but increases reconstruction time.
- Reason: Chose replication for simplicity and faster recovery in the initial phase.
NoSQL for Metadata:
- Trade-off: Limited support for complex queries compared to relational databases.
- Reason: Prioritized scalability and fast lookups for metadata-heavy operations.
Delta Sync for Synchronization:
- Trade-off: Increased complexity in tracking changes but significantly reduced bandwidth usage.
- Reason: Optimized for real-time sync across devices.
Eventual Consistency for Non-Critical Updates:
- Trade-off: Temporary inconsistencies in low-priority operations.
- Reason: Balances performance and availability in a distributed system.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Node Failures:

Issue: Loss of file chunks stored on the failed node.
Mitigation: Use replication to maintain multiple copies of data and redistribute chunks from healthy nodes.

Sync Conflicts:

Issue: Simultaneous updates from multiple devices cause conflicts.
Mitigation: Implement conflict resolution algorithms like operational transformation or last-writer-wins.

Metadata Overload:

Issue: High traffic may overwhelm the metadata service.
Mitigation: Shard metadata by user/file ID and use in-memory caching (e.g., Redis) for frequent lookups.

Network Partitions:

Issue: Temporary loss of communication between nodes leads to inconsistent data.
Mitigation: Use quorum-based reads and writes to ensure data consistency.

Delayed Notifications:

Issue: Notifications fail or are delayed due to server overload.
Mitigation: Use retry mechanisms and prioritize notifications with a queue.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Erasure Coding:

Improvement: Replace full replication with erasure coding for reduced storage costs.
Mitigation: Use fast reconstruction algorithms to minimize read latency for missing chunks.

Predictive Scaling:

Improvement: Leverage AI to predict traffic patterns and scale resources dynamically.
Mitigation: Prevent service slowdowns during peak periods.

Advanced Search and Indexing:

Improvement: Enhance indexing with machine learning to provide smarter and faster search results.
Mitigation: Use distributed search engines (e.g., Elasticsearch) to handle growing data volumes.

Geo-Replication:

Improvement: Distribute data across global regions to reduce latency for users in different locations.
Mitigation: Implement region-based access and asynchronous replication.

Self-Healing System:

Improvement: Automate failure detection and recovery with real-time monitoring and self-healing mechanisms.
Mitigation: Minimize downtime and operational overhead.