Design Dropbox - System Design

System requirements

Cloud file storage services have become very popular recently as they simplify the storage and exchange of digital resources among multiple devices. The shift from using single personal computers to using multiple devices with different platforms and operating systems such as smartphones and tablets each with portable access from various geographical locations at any time, is believed to be accountable for the huge popularity of cloud storage services.

Functional:

Users should be able to upload and download their files/photos from any device.
Users should be able to share files or folders with other users.
Our service should support automatic synchronization between devices, i.e., after updating a file on one device, it should get synchronized on all devices.
The system should support storing large files up to a GB.
Our system should support offline editing. Users should be able to add/delete/modify files while offline, and as soon as they come online, all their changes should be synced to the remote servers and other online devices.

Non-Functional:

Availability: The motto of cloud storage services is to have data availability anywhere, anytime. Users can access their files/photos from any device whenever and wherever they like.

Reliability and Durability: Another benefit of cloud storage is that it offers 100% reliability and durability of data. Cloud storage ensures that users will never lose their data by keeping multiple copies of the data stored on different geographically located servers.

Scalability: Users will never have to worry about getting out of storage space. With cloud storage you have unlimited storage as long as you are ready to pay for it.

Consistency: ACID is required. Atomicity, Consistency, Isolation and Durability of all file operations should be guaranteed.

Capacity estimation

Let’s assume that we have 500M total users, and 100M daily active users (DAU).
Let’s assume that on average each user connects from three different devices.
On average if a user has 200 files/photos, we will have 100 billion total files.
Let’s assume that average file size is 100KB, this would give us ten petabytes of total storage. 100B * 100KB => 10PB
Let’s also assume that we will have one million active connections per minute.

API design

Metadata Service

Save Metadata
1. Endpoint: /metadata
2. Request Type: POST
3. Request Parameters: None

Request JSON

{
    "filename": "example.txt",
    "path": "/user/directory/example.txt",
    "block_hashes": ["hash1", "hash2", ...]
}

Response JSON:

{
  "status": "success", // or "error"
  "message": "Metadata saved successfully." // or error message
}

Get Metadata
1. Endpoint: /metadata
2. Request Type: GET
3. Request Parameters: path (e.g., /metadata?path=/user/directory/example.txt)

Response JSON:

{
    "filename": "example.txt",
    "path": "/user/directory/example.txt",
    "block_hashes": ["hash1", "hash2", ...]
}

Block Service

Upload block
1. Endpoint /<block_hash>
2. Request Type: POST
3. Request Parameters: block_hash (e.g., hash1)

Request Data: Binary data of the block

Response JSON:

{
  "status": "success", // or "error"
  "message": "Block uploaded successfully." // or error message
}

Download block
1. Endpoint /<block_hash>
2. Request Type: GET
3. Request Parameters: block_hash (e.g., hash1)

Response: Binary data of the block

Notification Service

Retrieve events
1. Endpoint /poll
2. Request Type: GET

Response: Connection kept open until a change occurs.

Response JSON

{
  "status": "change_detected",
  "message": "A change has been detected.",
  "changed_files": [
    {
      "filename": "example.txt",
      "path": "/user/directory/example.txt"
    }
    // ... potentially other changed files
  ]
}

Database design

The core function of the Metadata Database in Dropbox is to store essential details about the files without actually storing the file content itself. One of the core requirements is maintaining strong ACID properties and consistency across devices. We use a Relational Database Management System (RDBMS) structure to manage and query the metadata generated by users.

To track file changes:

Addition: Add a new file entry in the File Metadata Table and an associated initial version in the Journal Table.
Modification: Add a new version to the Journal Table, detailing the changes.
Deletion: Add a new version with the change type "deletion" to the Journal Table.

Devices can compare the LatestJournalID of files with the one in the File Metadata Table during syncing. If they're different, the system knows updates have occurred. The Journal table helps identify changes, aiding synchronization.

High-level design

Client: The Client Application monitors the workspace folder on the user's machine and syncs all files/folders in it with the remote Cloud Storage. The client application will work with the storage servers to upload, download, and modify actual files to backend Cloud Storage. The client also interacts with the remote Synchronization Service to handle any file metadata updates, e.g., change in the file name, size, modification date, etc. Here are some of the essential operations for the client:
1. Upload and download files.
2. Detect file changes in the workspace folder.
3. Handle conflict due to offline or concurrent updates.

Client consists of main 4 parts:

1. Internal Metadata Database will keep track of all the files, chunks, their versions, and their location in the file system.
2. Chunker will split the files into smaller pieces called chunks. It will also be responsible for reconstructing a file from its chunks. Our chunking algorithm will detect the parts of the files that have been modified by the user and only transfer those parts to the Cloud Storage; this will save us bandwidth and synchronization time.
3. Watcher will monitor the local workspace folders and notify the Indexer (discussed below) of any action performed by the users, e.g. when users create, delete, or update files or folders. Watcher also listens to any changes happening on other clients that are broadcasted by Synchronization service.
4. Indexer will process the events received from the Watcher and update the internal metadata database with information about the chunks of the modified files. Once the chunks are successfully submitted/downloaded to the Cloud Storage, the Indexer will communicate with the remote Synchronization Service to broadcast changes to other clients and update the remote metadata database

Metadata Database: The Metadata Database is responsible for maintaining the versioning and metadata information about files/chunks, users, and workspaces. The Metadata Database can be a relational database such as MySQL or a NoSQL database service such as DynamoDB. Regardless of the type of the database, the Synchronization Service should be able to provide a consistent view of the files using a database, especially if more than one user is working with the same file simultaneously. Since NoSQL data stores do not support ACID properties in favor of scalability and performance, we need to incorporate the support for ACID properties programmatically in the logic of our Synchronization Service in case we opt for this kind of database. However, using a relational database can simplify the implementation of the Synchronization Service as they natively support ACID properties.
The Synchronization Service: is the component that processes file updates made by a client and applies these changes to other subscribed clients. It also synchronizes clients' local databases with the information stored in the remote Metadata DB. The Synchronization Service is the most important part of the system architecture due to its critical role in managing the metadata and synchronizing users' files. Desktop clients communicate with the Synchronization Service to either obtain updates from the Cloud Storage or send files and updates to the Cloud Storage and, potentially, other users. If a client was offline for a period, it polls the system for new updates as soon as they come online. When the Synchronization Service receives an update request, it checks with the Metadata Database for consistency and then proceeds with the update. Subsequently, a notification is sent to all subscribed users or devices to report the file update.
Cloud/Block Storage: stores chunks of files uploaded by the users. Clients directly interact with the storage to send and receive objects from it. Separation of the metadata from storage enables us to use any storage either in the cloud or in-house.

Request flows

Uploading a New File (Write Path)

When a user adds a new file to their Dropbox folder:

File Preparation: The Dropbox client on the user's device divides the file into 4MB blocks, calculating a hash for each block. These blocks are then transmitted to the Block Service.
Block Storage: The Block Service stores these blocks, using the block hashes as reference keys, in its Block Storage system.
Metadata Transmission: The client sends the file's metadata, which includes information such as namespace, path, and the list of blocks, to the Metadata Service via its load balancer.
Metadata Storage: The chosen Metadata Server then writes these details into the Metadata Database and simultaneously caches this information for quick access.
Notification: After recording the metadata, the Metadata Service alerts the Notification Service about the new file.
Syncing Alert: The Notification Service informs other client installations about the new file addition, prompting them to synchronize.

Downloading the New File (Read Path)

When another device attempts to sync the new file from Dropbox:

Notification Receipt: The client on the user's secondary device receives the update from the Notification Service about the newly added file and initiates the syncing process.
Metadata Retrieval: The client communicates with the Metadata Service's load balancer, which routes the request to an appropriate Metadata Server. This server checks its cache for the required metadata. If unavailable, it fetches the data from the Metadata Database.
Block Request: Armed with the metadata, the client sends a request to the Block Service to retrieve the associated file blocks.
Block Retrieval: The Block Service fetches the necessary blocks, identified by their hashes, and sends them to the client.
File Reconstruction: Upon receipt, the client assembles these blocks to reconstruct the original file and saves it to the device's storage.

Detailed component design

Data deduplication is a technique used for eliminating duplicate copies of data to improve storage utilization. It can also be applied to network data transfers to reduce the number of bytes that must be sent. For each new incoming chunk, we can calculate a hash of it and compare that hash with all the hashes of the existing chunks to see if we already have the same chunk present in our storage.

We can implement deduplication in two ways in our system:

a. Post-process deduplication

With post-process deduplication, new chunks are first stored on the storage device and later some process analyzes the data looking for duplication. The benefit is that clients will not need to wait for the hash calculation or lookup to complete before storing the data, thereby ensuring that there is no degradation in storage performance. Drawbacks of this approach are 1) We will unnecessarily be storing duplicate data, though for a short time, 2) Duplicate data will be transferred consuming bandwidth.

b. In-line deduplication

Alternatively, deduplication hash calculations can be done in real-time as the clients are entering data on their device. If our system identifies a chunk that it has already stored, only a reference to the existing chunk will be added in the metadata, rather than a full copy of the chunk. This approach will give us optimal network and storage usage.

Partitioning

1. Vertical Partitioning: We can partition our database in such a way that we store tables related to one particular feature on one server. For example, we can store all the user-related tables in one database and all files/chunks related tables in another database. Although this approach is straightforward to implement it has some issues:

Will we still have scale issues? What if we have trillions of chunks to be stored and our database cannot support storing such a huge number of records? How would we further partition such tables?
Joining two tables in two separate databases can cause performance and consistency issues. How frequently do we have to join user and file tables?

2. Range Based Partitioning: What if we store files/chunks in separate partitions based on the first letter of the File Path? In that case, we save all the files starting with the letter ‘A' in one partition and those that start with the letter ‘B' into another partition and so on. This approach is called range-based partitioning. We can even combine certain less frequently occurring letters into one database partition. We should come up with this partitioning scheme statically so that we can always store/find a file in a predictable manner.

The main problem with this approach is that it can lead to unbalanced servers. For example, if we decide to put all files starting with the letter ‘E' into a DB partition, and later we realize that we have too many files that start with the letter ‘E', to such an extent that we cannot fit them into one DB partition.

3. Hash-Based Partitioning: In this scheme we take a hash of the object we are storing and based on this hash we figure out the DB partition to which this object should go. In our case, we can take the hash of the ‘FileID' of the File object we are storing to determine the partition the file will be stored. Our hashing function will randomly distribute objects into different partitions, e.g., our hashing function can always map any ID to a number between [1…256], and this number would be the partition we will store our object.

This approach can still lead to overloaded partitions, which can be solved by using 'Consistent Hashing'.

Caching

We can have two kinds of caches in our system. To deal with hot files/chunks we can introduce a cache for Block storage. We can use an off-the-shelf solution like Memcached that can store whole chunks with its respective IDs/Hashes and Block servers before hitting Block storage can quickly check if the cache has desired chunk. Based on clients' usage patterns we can determine how many cache servers we need. A high-end commercial server can have 144GB of memory; one such server can cache 36K chunks.

Which cache replacement policy would best fit our needs? When the cache is full, and we want to replace a chunk with a newer/hotter chunk, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used chunk first. Similarly, we can have a cache for Metadata DB.

Conflict Management and Data Integrity

Let’s now discuss how we can handle version conflicts in a collaborative file editing scenario ensuring data integrity and making sure that users are notified when their file is being updated by their collaborators.

Client-side :

The indexer monitors local changes made by the user and communicates with the server to check for changes made by other collaborators. By comparing versions, it can identify potential conflicts during the editing process.
In cases when the system by itself is not able to resolve conflicts, the client can provide an intuitive interface, highlighting conflicting changes, and allowing the user to easily understand and resolve conflicts.
In case of real-time collaboration, the communicates with the server during editing, the user interface should be able to show how and where multiple collaborators are making edits.

Server-Side

The server, particularly the Synchronization Service, is responsible for implementing conflict resolution logic.
The server maintains version control mechanisms.By keeping a detailed version history, the server ensures that users can roll back to previous versions, providing an additional layer of control in case conflicts cannot be resolved manually.
If the server is not able to resolve the conflicts on its own, it can send notification to the user and communicate that the user needs to compare versions and provide conflict resolution.

Conflict Resolution Strategies:

Automatic Conflict Resolution: For simple conflicts, the system can automatically merge changes based on predefined rules. For example, if two users insert text at different positions in the file, the system can merge these changes without user intervention.
Manual Conflict Resolution: For complex conflicts, such as changes to the same line of text by two users, the system should provide tools for users to manually resolve conflicts. This could involve highlighting conflicting changes and allowing users to choose which version to keep.

Trade offs/Tech choices

An interesting trade-off is data robustness vs data privacy.

Privacy laws require that users can delete their data completely.

However, to be able to recover data from a disaster, it's important that the files are copied for backup in multiple places.

A reliable metadata store (e.g. in RDB) should remember where all the copied chunks are located, including backups. This way, when a user decides to delete the files, all the copies (copied chunks) are deleted.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

Adding File Versioning System: Implement a versioning system allows users to revert to any historical version of a file. This can be achieved by extending the Metadata Database to store detailed version information and integrating it with the Chunker to efficiently manage and retrieve historical file versions.
Optimized Chunking Algorithm:Explore and implement an optimized chunking algorithm that adapts dynamically to different file types and sizes, reducing the average chunk size for small files and improving upload/download efficiency. This enhancement in the Chunker component would further minimize bandwidth usage and enhance overall system performance.