Codemia | Master System Design Interviews Through Active Practice

Design Dropbox with Score: 8/10

by alchemy1135

System requirements

Functional:

User Management:
Create an account with a unique username and a valid email address.
Log in securely with proper authentication mechanisms.
Log out from the system to terminate the session.
File Operations:
Upload files to the user's account.
Download files from the user's account.
Create, move, rename, and delete folders.
Move, rename, and delete files.
Access and manage previous versions of files.
Synchronization:
Automatically synchronize files across multiple devices in real-time.
Ensure changes made on one device are reflected on all other connected devices.
Sharing and Collaboration:
Share files/folders securely with other users.
Collaborate in real-time on shared files.
Set permissions for shared items (view-only, edit, etc.).
File Search:
Search for files/folders based on keywords.
Provide accurate and fast search results.

Non-Functional:

Security:
Implement robust encryption for data transmission and storage.
Regularly update security protocols to protect against emerging threats.
Monitor and log user activities for auditing and security purposes.
Scalability:
Design the system to handle a growing number of users and files.
Scale the infrastructure horizontally to accommodate increased load.
Performance:
Ensure low-latency file uploads and downloads.
Optimize search algorithms for quick and efficient results.
Minimize synchronization delay between devices.
Reliability:
Implement regular backups and data recovery mechanisms.
Provide system availability with minimal downtime for maintenance.
Compatibility:
Support a variety of file types and sizes for uploading and downloading.
Ensure compatibility with popular operating systems and browsers.
Compliance:
Comply with data protection regulations and privacy laws.
Maintain transparency in terms of data usage and storage policies.
Availability:
Design the system with high availability to minimize service downtime.
Implement redundant systems and failover mechanisms to ensure continuous service.

Capacity estimation

The total number of users = 500 million.

Total number of daily active users = 100 million

The average number of files stored by each user = 200

The average size of each file = 1 MB

Total number of active connections per minute = 1 million

Storage Estimations:

Total number of files = 500 million * 200 = 100 billion

Total storage required = 100 billion * 1 MB = 100 PB

Considering 1 server can handle 1000 requests concurrently, we would need 1 Million / 1000 = 1000 servers

API design

User Authentication API:

Description: This API handles user authentication, allowing users to securely log in and obtain access tokens.
Input: User credentials (username, password).
Output: Access token or an error message.

2. File Upload API:

Description: Enables users to upload files to their accounts.
Input: File data, user authentication token.
Output: Confirmation of successful upload or an error message.

3. File Download API:

Description: Allows users to download files from their accounts.
Input: File identifier, user authentication token.
Output: Downloaded file data or an error message.

4. File Management API:

Description: Provides functionality to manage files and folders (create, move, rename, delete).
Input: File/folder details, user authentication token.
Output: Confirmation of the operation or an error message.

5. File Synchronization API:

Description: Ensures synchronization of files across multiple devices in real-time.
Input: User authentication token, device identifier, file changes.
Output: Confirmation of synchronization status or an error message.

6. Sharing and Collaboration API:

Description: Facilitates secure sharing of files/folders and collaboration between users.
Input: Shared item details, user authentication token.
Output: Confirmation of successful sharing or an error message.

7. Version Control API:

Description: Manages access to previous versions of files.
Input: File identifier, version details, user authentication token.
Output: Previous version of the file or an error message.

8. File Search API:

Description: Allows users to search for files/folders based on keywords.
Input: Search query, user authentication token.
Output: List of search results or an empty result set.

Database design

For the tables required in this design, refer to the class diagram, the list of classes is not exhaustive but this is a good number of tables to start with.

Database Choice

User Data:
Database Type: Relational Database (e.g., PostgreSQL, MySQL)
CAP Focus: Balanced (Consistency and Availability)
Reasoning: User data often requires a balance between consistency (ensuring accurate and up-to-date user information) and availability (ensuring users can access the system). Relational databases are designed to provide a balanced approach.
File Metadata and Sharing Data:
Database Type: Relational Database (e.g., PostgreSQL, MySQL)
CAP Focus: Balanced (Consistency and Availability)
Reasoning: Similar to user data, file metadata and sharing data benefit from a balanced approach to ensure that users see accurate and consistent information while still allowing for system availability.
Search Data:
Database Type: Search Engine (e.g., Elasticsearch)
CAP Focus: Availability
Reasoning: Search functionality benefits from a focus on availability, allowing users to retrieve search results quickly. Search engines like Elasticsearch are optimized for distributed and scalable search operations.
Audit Logs and Version History:
Database Type: Relational Database (e.g., PostgreSQL, MySQL) or NoSQL Database (e.g., MongoDB)
CAP Focus: Depends on the use case
Reasoning: Depending on the specific requirements, the focus may vary. For strict consistency, a relational database may be suitable. If flexibility and availability are prioritized, a NoSQL database could be preferred.
File Chunks:
Storage Service: Amazon S3 or similar object storage service
CAP Focus: Availability
Reasoning: Cloud-based object storage services are specifically designed for storing large volumes of binary data, offering high availability, durability, and scalability. They are optimized for read and write operations and provide low-latency access to stored objects.

Data Partitioning:

Strategy: Hash-Based Partitioning
Explanation: For file hosting services, hash-based partitioning is often a suitable strategy. It evenly distributes data across multiple partitions based on a hash function applied to a chosen key (e.g., user ID, file ID). This ensures a balanced distribution of data and efficient retrieval.

Regional or Geographical Partitioning:

Applicability: Not necessary initially, but consider for scalability and performance optimization.
Explanation: Initially, a global approach may be sufficient. However, as the user base grows and the service expands globally, you might consider regional or geographical partitioning. This can enhance performance by placing data closer to users and addressing data residency and compliance requirements.

Sharding Strategy:

Strategy: Range-Based Sharding
Explanation: Range-based sharding involves dividing the dataset into ranges based on a specific criteria (e.g., user IDs, file IDs). This can be effective for tables that are expected to grow significantly, such as file chunks or version history. Each shard can then handle a specific range of data, enabling horizontal scalability.

Sharding Key Selection:

Key Criteria: Choose a sharding key that evenly distributes data and avoids hotspots.
Explanation: The choice of sharding key is crucial. It should distribute the data evenly across shards to prevent hotspots. For example, sharding files based on a user's geographical location might lead to uneven distribution if certain regions have a higher concentration of users.

Replication:

Strategy: Master-Slave Replication
Explanation: Implement master-slave replication to ensure data durability and availability. Writes can be directed to the master node, while read queries can be distributed across slave nodes, enhancing both read and write scalability.

Load Balancing:

Load Balancer Type: DNS Load Balancing for Global Distribution
Explanation: Use DNS load balancing to distribute incoming requests across multiple servers globally. This ensures efficient load distribution and improved response times for users in different regions.

High-level design

Let's break down the high-level design into various components

1. Client-Side Components:

Watcher Component:
Monitors the sync folder for user activities (creating, updating, or deleting files/folders).
Sends notifications to Indexer and Chunker on any file/folder actions.
Chunker Component:
Breaks files into small chunks.
Uploads chunks to cloud storage with a unique ID or hash.
Detects and uploads only modified chunks to reduce bandwidth and storage usage.
Indexer Component:
Updates internal database upon receiving notifications from Watcher.
Receives chunk URLs and hashes from Chunker for modified chunks.
Communicates with Synchronization Service using Message Queuing Service.
Internal Database Component:
Stores information about files, chunks, versions, and their locations in the file system.
Allows efficient retrieval and management of client-side data.
Security Enhancements Component:
Manages encryption for file chunks in transit and at rest.
Implements multi-factor authentication for an added layer of user security.
Verifies the check-sum when it receives data from the server.

2. Message Queuing Service:

Request Queue:

The global queue for clients to send update requests.
Handles asynchronous communication between clients and Synchronization Service.

Response Queues:

Individual queues for clients to receive updates.
Ensures that updates are delivered even if clients are temporarily disconnected.
Provides load balancing and elasticity for multiple instances of the Synchronization Service.

3. Synchronization Service:

Receives update requests from Request Queue.
Updates the Metadata Database with the latest changes.
Broadcasts updates to clients through their respective Response Queues.
Polls for new updates and synchronizes with clients once they are back online.

4. Cloud Storage:

Stores actual files and chunks.
Facilitates folder synchronization across clients.
Ensures data availability and durability.

5. Metadata Database:

Stores metadata information, including file indexes, chunks, and versions.
Maintains consistency with internal databases on the client side.
Provides information needed for file recreation and synchronization.
Version Control Database Service:Stores version-related data for the Version Control Service.
File Metadata Database Service: Stores structured data for file metadata, sharing, and other relevant information for the File Service

6. Additional Server Side components

User Management Service:
Responsible for user registration, authentication, and authorization.
Interfaces with the User Database for user-related data.
Search Service:
Indexes and searches files based on user queries.
Interfaces with the Search Engine (e.g., Elasticsearch) for fast and efficient search results.
Version Control Service:
Manages and tracks version history for files.
Utilizes the Version Control Database for version-related data.
Audit Log Service:
Records and stores user activities for auditing purposes.
Interacts with the Audit Log Database for logging-related data.

Request flows

The below sequence diagram shows the flow of users uploading and sharing a file.

Detailed component design

Let's talk about what happens on the client side when a user uploads or modifies a file.

When the user updates an existing file or creates a new file in the folder selected for synchronization and backup the below components come into play.

1. Watcher Component:

This component is responsible for monitoring the sync folder for user activities. It detects file creation, updates, or deletions and notifies the Indexer and Chunker about the observed actions.
It utilizes filesystem monitoring APIs to detect changes and communicates asynchronously with the Indexer and Chunker to initiate further actions.

2. Chunker Component:

The chunker component starts when it receives a message from the watcher component and it starts to break the files into multiple chunks. It is important to break the file into multiple components since it will help in making it faster, and more efficient and will minimize bandwidth usage.
By utilizing a chunking algorithm it breaks files into smaller, manageable pieces and generates unique IDs or hashes for each chunk. It verifies which chunks have been updated by the user by checking the internal db and uploads only the modified chunks to the cloud storage, reducing data transfer.

3. Indexer Component:

This component updates the internal database upon receiving notifications from the Watcher. It receives URLs and hashes from the Chunker for the modified chunks. It communicates with the Synchronization Service using the Message Queuing Service.
This service is also used to maintain an internal database to store file metadata, versions, and chunk information.

4. Internal Database Component:

This component is responsible for storing information about files, chunks, versions, and their locations. It supports efficient retrieval and management of client-side data.
It maintains consistency with the Metadata Database on the cloud side and provides fast and efficient access to client-side data.

Server Side Components

Although there are multiple components on the server side we will discuss the below important components which are essential for our current design.

1. Message Queuing Service:

Queue service facilitates asynchronous communication between clients and the synchronization service. It handles message queues for both requests and responses. Its job is to ensure reliable and ordered message delivery.

Detailed Design:
Request Queue:
Global queue shared among all clients.
Clients send update requests through this queue.
Messages include details about file actions or synchronization requests, chunk information, file metadata.
Response Queues:
Individual queues corresponding to each client.
Multiple clients receive updates through their respective response queues.
Ensures that updates are delivered even if clients are temporarily disconnected.
Implementation:
Utilizes a high-performance and scalable message queuing system (e.g., RabbitMQ, Apache Kafka).
Ensures message durability and order through appropriate configurations.

2. Synchronization Service:

This service is responsible for the following things
It receives update requests from the Request Queue.
It updates the Metadata Database with the latest changes.
The service will broadcast updates to clients through their respective Response Queues.
It keeps polling for new updates and synchronizes with clients once they are back online.
Detailed Design:
Update Processing:
The service keeps listening to the Request Queue for incoming update requests.
It processes requests, updates the Metadata Database, and triggers synchronization tasks.
Broadcasting Updates:
It sends updates to the appropriate Response Queues for broadcasting to clients.
It ensures reliable and ordered delivery of updates. Ordered delivery of updates also helps in conflict resolution when multiple updates are made to the same file at the same time by different users.
Polling Mechanism:
The service keeps periodically checking for updates from clients that were temporarily offline.
It initiates synchronization with offline clients upon their reconnection.
Implementation:
Utilizes a scalable and fault-tolerant service that supports concurrent updates and message broadcasting.
Incorporates retry mechanisms for handling communication failures.

Horizontal Scaling:

For services like file upload/download, file synchronization, and user authentication, horizontal scaling ensures that the system can handle a growing user base and increasing demand by adding more servers dynamically. This allows the system to distribute incoming requests across multiple servers, preventing a single point of failure and improving overall performance.

Elasticity in Infrastructure:

Elasticity in infrastructure ensures that the system can adapt to varying workloads,we can use the auto-scaling feature provided by cloud service providers to keep resources up during peak usage periods and scaling down during periods of lower demand. This dynamic resource allocation optimizes costs and maintains efficient performance.

Load Balancing Techniques:

Load balancing is crucial for accommodating a large user base in the Dropbox system. By distributing requests across multiple servers, it ensures that no single server becomes a bottleneck, improving system reliability and scalability.
Techniques such as round-robin, weighted round-robin, least connections, and least response time can be employed to distribute the load effectively. Load balancers can be configured to monitor server health and route traffic only to healthy servers, enhancing the system's overall availability

Conflict Management and Data Integrity

Let’s now discuss how we can handle version conflicts in a collaborative file editing scenario ensuring data integrity and making sure that users are notified when their file is being updated by their collaborators.

Client-side :

The indexer monitors local changes made by the user and communicates with the server to check for changes made by other collaborators. By comparing versions, it can identify potential conflicts during the editing process.
In cases when the system by itself is not able to resolve conflicts, the client can provide an intuitive interface, highlighting conflicting changes, and allowing the user to easily understand and resolve conflicts.
In case of real-time collaboration, the communicates with the server during editing, the user interface should be able to show how and where multiple collaborators are making edits.

Server-Side

The server, particularly the Synchronization Service, is responsible for implementing conflict resolution logic.
The server maintains version control mechanisms.By keeping a detailed version history, the server ensures that users can roll back to previous versions, providing an additional layer of control in case conflicts cannot be resolved manually.
If the server is not able to resolve the conflicts on its own, it can send notification to the user and communicate that the user needs to compare versions and provide conflict resolution.

Conflict Resolution Strategies:

Automatic Conflict Resolution: For simple conflicts, the system can automatically merge changes based on predefined rules. For example, if two users insert text at different positions in the file, the system can merge these changes without user intervention.
Manual Conflict Resolution: For complex conflicts, such as changes to the same line of text by two users, the system should provide tools for users to manually resolve conflicts. This could involve highlighting conflicting changes and allowing users to choose which version to keep.

Trade-offs/Tech choices

Although we have discussed a lot of things below are 3 things that we can explore to further improve our design

Enhanced Security Measures:
Implement additional security measures such as encryption for data in transit and at rest, multi-factor authentication, and secure tokenization. This will enhance the overall security posture of the system, safeguarding user data and access credentials.
Advanced Caching Mechanisms:
Introduce advanced caching mechanisms, such as content-based caching or edge caching using a Content Delivery Network (CDN), to optimize the retrieval of frequently accessed files and reduce latency. This enhancement would improve the overall performance and user experience, especially for large-scale deployments.
Dynamic Scaling and Resource Allocation:
Implement dynamic scaling mechanisms and resource allocation strategies to adapt to varying workloads and efficiently utilize computing resources. This involves incorporating auto-scaling policies, load-based scaling, and efficient resource allocation algorithms to ensure optimal performance during peak usage periods while minimizing operational costs.

Future improvements

Enhanced File Versioning System:

Implement a more robust versioning system that allows users to revert to any historical version of a file. This can be achieved by extending the Metadata Database to store detailed version information and integrating it with the Chunker to efficiently manage and retrieve historical file versions.

Optimized Chunking Algorithm:

Explore and implement an optimized chunking algorithm that adapts dynamically to different file types and sizes, reducing the average chunk size for small files and improving upload/download efficiency. This enhancement in the Chunker component would further minimize bandwidth usage and enhance overall system performance.

Intelligent Conflict Resolution Mechanism:

Develop an intelligent conflict resolution mechanism in the Synchronization Service that can automatically handle conflicts arising from concurrent updates or offline modifications by analyzing the nature of changes and merging them seamlessly. This improvement will enhance user experience by reducing the need for manual conflict resolution.

Global File Deduplication:

Introduce a global file deduplication mechanism that identifies and eliminates duplicate files across users, optimizing storage usage and reducing redundancy in the Cloud Storage. This enhancement would involve integrating deduplication logic into the Synchronization Service to enhance overall storage efficiency and reduce costs.