Design Dropbox - System Design

My Solution for Design Dropbox with Score: 8/10

by serenade_xenon816

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

User can upload contents
User can download contents
User can share files/folders to another user
The uploaded contents are sync'd across connected devices

System Properties

Support a max file size of 50GB

Out Of Scope

Billing and Payment
Encryption at rest

Non-Functional:

List non-functional requirements for the system...

Conflict Management - The system ensure integrity when updating content. The content updated in device-1 and updated in device-2 are handled gracefully. In case of conflict a separate copy is maintained to let user verify and merge manually.
System is lets devices to download, sync metadata across devices, and upload contents. Offline mode enabled files are downloaded to local system
Updates are eventually consistent - a new file uploaded may not be immediately visible in a different region for a couple of seconds. This makes the availability property of the system edge more than consistency
Low latency sync, content downloads. content metadata downloads < 200ms
Security - communication channel from client to server is over TLS - data in transit are encrypted. Data at rest are encrypted. Users are authenticated and authorised using OAuth token (JWT token)

Capacity estimation

Estimate the scale of the system you are going to design...

Total Users - 500M users, Daily Active Users (70%) - 350M users
Average file size - 100MB
Average size of uploads per day - 350M x 100MB = 350, 000, 000 x 100 = 35,00,00,00,000 i,e, 35PB per day
No of devices per user Average size of downloads per day - 4 devices
Average downloads per day - 350M people x 4 devices x 100MB average file size = approx 35PB x 4 = 175PB

API design

Define what APIs are expected from the system...

Get Metadata

Used by mobile/desktop/laptop apps and web browser to grab all content metadata from a user's account in dropbox server.

GET dropbox.api.com/1/contents
Request Header : [ 
  Authorization = Bearer <Oauth Token>
]

Response: 200 OK
Response Body: { an array of ContentMetadata }

Start Chunk Upload - Content upload

Used by mobile/desktop/laptop apps to indicate that a file upload is going to be initiated. The client will start using POST Chunk API for uploading chunks of file.

POST dropbox.api.com/1/contents
Request Header : [ 
  Authorization = Bearer <Oauth Token>
]
Request Body: { ContentMetadata }
Response : 200 
Response Header: [SessionId = ZZZwweee...]
Response body: { ContentMetadata }

Post Chunk - content upload

Used by mobile/desktop/laptop apps to upload file content in parts. Dropbox server assembles the chunk to a final content

POST dropbox.api.com/1/contents/chunks
Request Header : [ 
  Content-Type = application/octetstream
  Authorization = Bearer <Oauth Token>
  SessionId = ZZZwweee...
]
Request Body: { ChunkMetadata + bytestream }
Response : 200 
Response body: { ChunkMetadata }

Finish Chunk Upload - Content upload

Used by mobile/desktop/laptop apps to upload file content in parts. Dropbox server assembles the chunk to a final content

POST dropbox.api.com/1/contents/chunks
Request Header : [ 
  Authorization = Bearer <Oauth Token>
  SessionId = ZZZwweee...
]
Request Body: { ContentMetadata }
Response : 200 
Response body: { ContentMetadata }

Get Chunks

Used by mobile/desktop/laptop apps to get all chunk metadata for a content from dropbox server

GET dropbox.api.com/1/contents?content={contentId}
Request Header : [ 
  Authorization = Bearer <Oauth Token>
]
Response : 200 OK
Response Body: { A list of ChunkContent } 

Upload a Content

TBD - or browsers to upload a file in its complete form

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

ContentMetadata
ChunkMetadata
User

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Chunk and Hashing file content - Block-Based Content-Addressed Storage

A file is size can be max 50GB. The file is sliced into chunks of 4MB to 16MB. The native app computes hash (XXH3 or BLAKE3) and prepares metadata for each chunk. The metadata. This metadata consists of

{ 
  chunkId: 1,
  sequence: 1,
  filename: "rocks.txt",
  offsetStart: 0,
  offsetend: 123456,
  createdOn: <timestamp>,
  lastUpdatedOn: <timestamp>,
  ownerId: "1234",
  checksum: ZZZeeewww...
  location: "path in dropbox file server"
}

Uploading Content from Mobile/Laptop/Desktop

New File upload - User creates a folder which is to be sync'd to Dropbox server. The files in this folder is sent in chunks to the server with its metadata.
File Changes - The native OS app hooks into the kernel events in order to listen to the file change events. For instance iNotify events in linux, FSEvents in iOS, windows file change events in windows OS. The native app picks the file and slices to chunk with predefined chunk size, metadata and its hashes. The hash of the chunk is compared with the hash in the local metadata cache. The chunks that differs are sent to server to limit the bandwidth usage. i.e. the delta is uploaded to server

File Sync Between Devices

When a file is changed in another device, the native app in the device prepares #1 or #2 above and pushes the full file or delta chunks of the file to the dropbox server. A Server Side Event is pushed from dropbox server to the devices to initiate a call home on changes. The native client reaches out to dropbox with a last sync timestamp. The changes post this timestamp are returned in the form of metadata. The client compares these metadata with local metadata cache and downloads the required chunks which are modified in the dropbox server and updates the local files accordingly. The updates might result in file delete, merge, add operations.

Web browser Downloading Content

Here the user login to account using a browser, the browser lists the home drive and recently accessed files. The dropbox web application lists files and folder and users can navigate/search the files

Modification of a file from Multiple Devices - Conflict Resolution

Merging of Changes

There are cases where a same file is modified in multiple devices which results in conflict and uploaded at once. For instance there are two devices modifying the file, Client A & Client B. Client A commits the file first and dropbox server updates the chunks modified by Client A and update the file version to Version + 1. That is now Client B has N - 1 version. In this case, if Client B updated different chunks in the file than the Client A, dropbox will merge the changes and produces a new version N + 1. However if both clients modifies the same content chunk then Client B will be notified that a version is updated and would require to download the latest & merge. A separate copy of Client B's file is made at dropbox server named `copy-of-xxxx` filename, by keeping Client B's changes.

Online Shared Updates - Conflict Resolution

The case here is both devices or two different users updating the same file at the same time. If both parties are updating different chunks then dropbox server will merge the file and produce a new version. However if the same chunk is modified by both parties then the last write wins. However the version is updated for each commit. This means the user can see the history of changes

Storage Layer

Files are stored in Data nodes spread across regions. This means the chunks of each file are stored in different physical server. For instance, a file is divided into 10 chunks, each chunk is stored in different storage servers in encrypted form.

Erasure Coding

When the file is divided into 10 block (chunks) a 6 parity blocks are added to it, making it 10, 6 erasure coding. This is better than replication where each block is copied to 3 or more storage servers (30 chunks altogether). This incurs huge storage for 400M users or 35PB of data.

Erasure coding helps in keeping 6 parities for 10 blocks of data making it 16 blocks of data. We can recover or reconstruct data even if lose 6 nodes (assuming each block is stored in different storage servers). The worst case is losing 7+ nodes, in which case the file is lost and cannot be recovered. Its important to store these chunks in different storage nodes. Altogether in this case atleast we need 16 server nodes.

Chunk Repairs

A storage monitoring service that looks for disk health, node health, partition healths and publishes a message. The message might indicate a disk failure or chunk failure

A chunk repair service picks up the message and acts on it by looking at degraded chunks and repair them by reconstructing and placing the chunks in same or different storage nodes. It reads the chunk metadata from Chunk coordinator service and tries to repair the chunks. This is part of health and wellness monitoring job that picks up the chunks to repair.

File Sharing & Security

User A shares a file/folder with User B, this creates an entry into the Sharing table in database with file id, target userid, permissions (read/write/both), timestamp, userid who shared, expiration time if set etc . User A creates a pre-signed url for the file/folder and shares it with User B. The user B should have a dropbox account to view the file, when User B logins to dropbox under the "Shared with Me" tab, they will see the shared files. The sharing metadata will have the entry than the content metadata table in the database. When the file is shared the file is loaded onto the CDN nearest to the target user(s). This helps in downloading/reading file faster.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

General Flow

Client sends a request (upload, download, Get) to an edge load balancer which terminates SSL and forwards the request to API gateway at the nearest datacenter.
API Gateway authenticates, authorises, rate limits the requests and forwards to a DNS based Kubernetes service `<service-name>-prod.svc.cluster.local`. Kubernetes internally routes the request to a pod grouped under the service name like `upload`, `encoder`, `chunkmetadata` etc.
Kubernetes service, on receiving request will forward the request to a pod

Client Uploading a File

Client divides a 5 GB file into 320 chunks of 16MB each and sends each chunk to server, Load Balancer at the edge receives the request, terminate SSL, routes the request to an API gateway at nearest datacenter.
The API Gateway authenticates authorizes request and forwards to DNS location, which is the endpoint url of Kubernetes service for uploading chunks. The kubernetes service load balances request to a microservice running in a pod.
The microservice, chunk coordinator service, receives the request and asks the placement service to decide on where to place the chunk. The placement service returns the chunk location based on chunk metadata.
Chunk coordinator service asks the encoder service to encode chunk (reed-solomon encoder).
Erasure coding service computes parity of all chunks on finish API call from client.
the chunk coordinator asks chunk service to write the chunk to the location suggested by placement service, this includes chunks and parity chunks. The chunks are distributed to different servers in a rack, different disks in a server box, different datacenter.
Once all the chunks are written the metadata is constructed and updated in the database, this is content metadata. Only when entire chunks are completed, the content is available for read, else its considered partial and unwritten chunks are in the client device that uploaded the content.

Handling chunk failures

When the client decides to upload a content, the client makes a call to `start` upload API, this returns a session key with a time expiry period, lets say 48hrs. Client presents this session key in every chunk upload request and the final `finish` API. Only when the client invokes `finish` method with all the chunks , dropbox server considers that the content upload is complete and available for read/write, until then the content metadata indicates the upload in progress to the user that started upload.

The client always cache the chunks of the file and its metadata plus the progress of upload. This is verified periodically by invoking the chunk metadata api from dropbox which returns what dropbox received so far. In case of a failure while uploading chunk or the device goes out of network or shutdown the local cache in the client is updated with the status of upload for the chunk. When client is online the background process downloads the latest chunk metadata for the content and compares with local cache and retry to upload the chunk accordingly. In case the client device is offline and session is expired,

they can re-issue the session token which extends the validity of session key, however this is limited to a configurable 1 or 2 extensions
session key has max attempts and overall max duration for the upload, if any of this crosses, the partial contents are removed from dropbox server. This handles the partial contents or residue at the storage nodes.
When the chunk uploads fails in the mid way, the dropbox server checks the checksum in chunk metadata and what is received at dropbox server, this helps in removing partial chunk and client have to upload the chunk once again.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Chunk/Content metadata storage - Cassandra cluster faster and consistent writes.
Blob File Storage - A cluster of block storage servers with onboarded/attached storage disks. These rack servers with disk array. Onboarded storage disk arrays are faster than RAID attached storage like WD JBoD or Dell Powervault. Softwares like Cloudian Hyperstore is installed and clustered
Microservices - deployed in kubernetes, grouped by their functions - Chunk metadata service that reads the metadata, Chunk service that writes metadata, Chunk encoding service that encodes a chunk, chunk encrypt service that encrypts/decrypt a chunk etc
Load Balancer - an edge L7 load balancer that forwards request to a API gateway nearest to the client
API gateway - Kong, Apigee etc that routes traffic to Kubernetes service
Kubernetes - all microservices are deployed in kubernetes, each pod consists of a stateless version of microservice and can scaled up/down based on the traffic
Database choices
1. Chunk Metadata - a key/value store - cassandra
2. Content metadata - relational data, user to files (one to many), folders to files inside it. each content has an folder attribute, null here means the root folder. Folder is a separate table with its own permissions and identifier. A clustered postgres works well
3. Sharing entry metadata - a relational data, folders to files, permissions, targer userids etc. Postgres works fine
4. Monitoring metrics - a grafana + influx db works well or RRD files works well for plotting charts

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Failures in Kubernetes Environment

Kubernetes ensures resilience and fault tolerance through self-healing, declarative management and distributed orchestration. Kubernetes checks "whats is desired vs whats the status right now" This means the failures on,

Pod crashes - kubernetes restarts the pod
node failure - kubernetes spins up pods in another node
Auto scale pods - if deployment desires 5 nodes and 2 are running, 3 nodes are deployed
Service failures - Kubernetes does health checks + auto-restart, when the container fails to respond or crashes, its restarted.
Service readiness probe - checks the health of service running in the container, if it finds unhealthy then the service is restarted
The microservices running in containers are stateless, meaning another instance of microservice can pickup the workload and finish it.

Partial/Orphan chunks

There can be scenarios where the clients lose network connection and ends up partial chunks. The chunk monitoring jobs looks for these partial chunks, however this introduces database I/O to remove those chunks from filesystem and update the chunk metadata. Partial chunk is identified by checksum comparison of what is mentioned in the request header and what is received in the dropbox server. Such chunk upload API operation results in error and asks client to retry.

Corrupted Chunks - Repair

The chunk are verified for its correctness, if found the chunks are corrupted then the repair job works to correct chunks by recreating from the parity chunks.

Resilient Block Storage Layer

The storage layer is spread across different datacenters and different regions. The chunk are spread across the different storage server nodes in a datacenter as well as the same is replicated to another datacenter for disaster recovery
Erasure coding in storage layer for each content helps in recovering data, and is cheaper than replication, however a copy of the entire chunks are retained in another datacenter and region for disaster recovery. Once the datacenter is back or storage node is back online, the content is replicated to the datacenter and nodes in the datacenter

Resilience in Database Layer

The cassandra db is clustered across multiple datacenter and regions. The writes are fast since it uses LSM trees and SS Table, this ensures the metadata is replicated to multiple nodes across the datacenter and regions. Read operations are bit slow since cassandra has to look up the snapshots of SS tables in disk to identify the latest write, however this ensures data reliability and integrity consistency.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Security - Scan contents for malicious contents
Encrypt Data at Rest
Integrity - checksum verifications at client and server
Organising files into folders, classifying, grouping, tagging etc

Markdown supported