Design Dropbox - System Design

My Solution for Design Dropbox with Score: 8/10

by kraken_pinnacle338

System requirements

Functional:

+ Users should be able to upload files (Create / Update)

+ Users should be able to download files

+ Users should be able to share files (with specific users)

+ being able to sync local folders and online folders

Non-Functional:

+ Authentication

+ Billing

+ Consistency

+ Availability

Out of scope:

+ Latency - we should try to provide the best user experience we can.

Capacity estimation

I expect the system to be able to upload large files (Gigabytes).

We can expect millions of users, and millions of files.

The file might be more read-intensive than write-intensive.

API design

/api/v1/file [POST, PUT]

/api/v1/file/ [GET]

/api/v1/file/permissions [PUT]

Database design

Files

FileId (4 bytes)

UserId (4 bytes)

Path (200 chars) (800 bytes)

Checksum (4 bytes)

Chunk Strategy (ENUM) (2 bytes)

Chunk

ChunkId (4 bytes)

FileId (4 bytes)

Path (200 chars) (800 bytes)

Permissions

FileId (4 bytes)

UserId (4 bytes)

PermissionType (READ, WRITE) (1 byte)

High-level design

Everytime a file is uploaded, the client will create chunks. Then will upload each of these chunks to a server. Each file has a uniqueid accompanied with it. The uniqueid is based on either the filename. This allows overwrites to happen easier. If we used the checksum, then we would never be able to identify which file we're talking about.

The client keeps a streaming connection (socket) ongoing. In the first request, it creates a file, then it will upload the chunks. If the connection is lost, we can continue since we have a uniqueid for this upload.

Once the request hits the server, we will call 2 microservices. The metadata service will maintain data about the file, and it's chunk. It will also maintain details about the upload (what chunking strategy was used?) Maintaining a copy of the chunking strategy is useful in case we want to make changes to it later.

The chunks will be sent to storage service. the storage service will use a blob storage to make sure those chunks are uploaded in multiple and different machines (resiliency).

for updates, the orchestration service will query the metadata service. it will compare a checksum of each chunk sent by the client. if the checksum of some chunks are different, only those chunks will be overwritten.

for downloads, the client will start a streaming connection with the server. we will download all the chunks, and then reconstruct the file in the client. every time we send a request to download the file, we will send the chunks that we already have as part of the request.

The sync service is responsible for querying the orchestration service, passing a path. The server will use the permissions table to see which files does this user have access, and then give access to them. Then the client will get a list of files that it should download. These will go into a queue. The downloader service looks into this queue to determine which file to download next.

On the analytics side, it would be nice to understand the latency, chunk size, user base demographics, error numbers, etc...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

For the storage, we can rely on something like S3 or big table. We can shard based on chunk id.

For the metadata, we will need a relational db since there's a mapping between a file entry, and a list of chunks. there are also associations between a file and the permissions it has. We can use spanner, mysql, oracle, etc... I would use spanner because i have experience with it.

We could shard based on file id, we can generate file id using a UUID (random).

To handle hotspots, we can use a cache. Every time a chunk is downloaded, we can add it into the cache (LRU eviction policy). This will avoid hitting our database as much. For the cache, we can use Redis.

Trade offs/Tech choices

Failure scenarios/bottlenecks

This design doesn't address what happens when multiple users want to update a file or upload a file. In that case, we should make a temporary lock (can be at the database level). If the file is locked, then the user can get an error message.

What happens if a chunk upload fails? We should retry at the client level (maybe exponential retry, or just retry once). We can also have an uploader service that runs in the background continuously trying to upload.

What happens if all the servers are down? The user receives an error at the UI level, asking them to retry later.

Chunking strategy can become problematic. For clients that are non-web (iOS, Android, PC/Mac users). We might want to play around with different chunking strategies. The ability to be able to upgrade chunking strategies relies on clients upgrading their binaries.

Future improvements

How can we colocate files with the user as much as possible?

If our user is in the west coast, we should try to store the file in the west coast to improve download speed.

Markdown supported