System requirements


Functional:

Store files

retrieve files

share files with others

look at file history

search for files user owns


Non-Functional:

high availability

good performance

security and encryption (so unauthorized users cannot see another person's personal files)

scalability

durability (prevent data loss, redundancy of stored files in different failure zones, etc)


Capacity estimation

10 million registered users

roughly 2-3 million DAU


qps:

each user might upload 5-10 files a week, so roughly 15M uploads weekly.

downloads ~ 3 downloads per user per week, so ~ 6M downloads weekly.


storage:

avg file size 500kb -> 7TB added weekly

avg user uses 2 GB -> 20 PB


sharing:

average shared file accessed by 5-10 users.


API design


We need API endpoints for the following:

Upload file: POST files/ with user details in parameters and byte content for the file.

status code for uploads, 200 for successful upload, 400 for incorrect params, ineligible upload (like too large or out of storage quota) or unauthorized request


this upload endpoint also needs to reject conflicting writes from multiple users (e.g. between owner and someone else with write permission), perhaps using last-write wins semantics or accepting both and requiring user to pick one later down the line


download_file: GET files/{file_id}, might specify optional version number otherwise the latest is fetched and downloaded, 404 for garbage collected or nonexistent file, 403 for forbidden/wrong permissions.

view_files: GET users/{user_id}/files, gets list of files this user owns paginated by last-modified/alphabetical/other reasonable options


access_control/sharing: POST files/{file_id}/sharing with params like user_id, authentication, list of users, mode (removing or adding permission/visibility?).

POST request for endpoint to update sharing and permissions for a given user and file combo.


Database design

Probably picking nosql (like a wide column database like Bigtable), though RDS could also work. Reasons for preferring nosql might be built-in horizontal scalability, support for semi-structured metadata, and faster read/write performance for single-file operations.

Reasons for preferring RDS instead could be better query support and easier joins, and ACID properties to help provide native transactions/locking/isolation.


With the feature set in mind, going with nosql for now.


We need a Files/Metadata table and Users table.

Actual file content will be stored in bucket storage like S3 and referred to via url/file pointers in the metadata table.


Important fields will be file_id, file name, upload time (to help differentiate different file versions), owner/user_id, permissions (like an ACL data structure), file size, object storage location, active_state (like deleted or not), list of users this file was shared with. Other nice-to-have optional features might include TTL for file (for temporary files) and download count (for analytics)


Users table might have user_id, name, email, password hash + salt, and account creation date.


Depending on overall design, we may even want to extract Shares into their own table, representing the state of a particular file share with other users. E.g. a share_id, and sharer and sharee user_ids along with whatever access controls or time limits are set.


High-level design


In the user upload file flow, we have a client that will talk to a Uploads frontend service that will process a file upload. The user will provide some important parameters like file name or path (if we support directory structure) and initial permissions/users to share with, and also the actual content of the file. For simplicity right now we can assume files will be small enough to do single-shot uploads, but we can support multi-part upload for large files if necessary by extending the API, making it a bit more complex.


That frontend service will do some important computations and queries against the Metadata database; 1) does the file already exist? if so create a new version and update the existing DB entry to point to this as the most recent one, otherwise create a new Metadata row entry in the DB and populate it with all the fields 2) send request to object storage to store the data and get a URL back 3) write to the Shares table for the N people that this file was shared with.


We'll also use a microservices architecture for better tweakability, and have a separate Sharing service that handles all the sharing aspects of an initial file upload. The Upload service will call to this service with details on which user is sharing with which other user (or revoking permission), through a message broker so this could be done asynchronously (decoupling it from the critical hot path). This is fine because users can tolerate a little bit more latency for shares.


We will also have a Downloads service that will fetch the file metadata, do authnz checks, and fetch the content from object storage. The Downloads service will try to fetch the latest 'clean' or finalized file upload version, to avoid data corruption in cases of read/write races.


A cache could sit in front of the Downloads service for popular metadata entries, and a CDN could help serve popular static content (rather than the slow fetch from object storage). Introduction of a cache would add some complexity though, due to the need for invalidation when a file changes.


Finally the client can also call a View Files service that queries an index on owner user_id field so we can quickly retrieve the files that a particular user owns. This service should support pagination and filtering for a better user experience, so the index might need additional columns like last_modified_timestamp or MIME type.


Offline processes:

We need a way to garbage collect TTLed files or expired time-based shares. One good way to accomplish this is to model these as events (file expiration, share expiration) and put them on a message queue that supports scheduled delivery, and have a small worker farm process those events (by writing/updating the relevant DB entry)

There should also be some metadata consistency checker pipelines that periodically check for potential inconsistencies or incomplete uploads; these would either trigger alerts for oncalls or expose options to the user to do data recovery (say, restore an older version or retry an upload) if the service were unable to automatically recover.


Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design






Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


Database choice: mentioned above but nosql offers native horizontal scalability and more flexibility around development and metadata modeling. Some wide-column database types offer some useful features like transactions support and indexing, but more complex joins and queries will still be difficult. I anticipate that to be ok, since our current feature set doesn't rely on those, and we can always ETL data snapshots from our DBs into an OLAP engine for more complex business analysis or for less realtime data needs, like storage estimates or user quota computation.


Talking about Deduping:

One important followup on the design is to implement some deduping mechanism to save on storage costs. While object storage providers do their own data deduping, we should not rely on that because for proper security we'd like to encrypt the data in transit per user, and that would incur additional object storage upload/download costs since the data sent would be different. Rather, we should compute file checksums ourselves, store them along with the metadata, and only call the object storage upload if the file checksum doesn't already exist somewhere in our metadata table. This might necessitate us to extract the checksum -> object storage file pointer into a separate table since it's not scalable to scan every Metadata table entry for a duplicate.


Scaling

How do we overcome scaling challenges? First, the bottlenecks around performance and scalability revolve around the database and how well we can load balance the user requests (uploads, downloads, view_files, sharing). For very popular files, we should put caching in front of the data fetch and disable modifications (to avoid hotspotting on a single metadata entry). We might also want to artificially throttle uploads to alleviate some of the lock contention on multi-user updates to the same file, and space out the updates/batch them up on the client to reduce load on the services. As mentioned before, pagination and filtering by default can reduce load and improve performance on the table index scans, compared to full data fetches for all the files a user owns (which they likely won't need all at once)

Finally, the database could be replicated (for read only use cases) and/or sharded across multiple hosts, based on some reasonable partition like hash(user_id), to more evenly distribute load from the upload and download services.


Failure scenarios/bottlenecks

We want to avoid data loss as much as possible. That means there should be redundancy at many levels:

  • file metadata should be replicated, the DB should have replicas across multiple failure zones. Because of this requirement it might make sense to set up leaderless replication with quorum.
  • using object storage helps with actual data integrity since many providers like S3 and GCS offer high levels of durability due to the internal hardening/redundancy in those systems. however, we need to be careful to select an appropriate storage policy and not the cheapest (which might be single-cluster and thus not durable enough), or go through extra work like in S3 to set up geographical multi-region replication.

Even beyond hardware and network faults, we need to beware of software induced failures and data loss. That means we need to define a well-exercised and battle tested data backup and recovery plan, and regularly test Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for our services. Untested backups == no backups. Oncallers should also be familiar with the process so catastrophic data loss doesn't cause extended downtime due to unfamiliarity with the recovery procedures.


We ought to have good monitoring and observability across the stack. Storage usage and capacity management is important, especially for a service as large as Dropbox and expected 20PB of data (with room for growth). Temperature of files (e.g. how fresh or recently accessed) open up room for lowering object storage costs by downgrading normal access policy to cold storage policy, and user monitoring can help unload inactive user metadata into disk based storage on the DBs (vs. keeping them in memory or flash) Monitoring is also vital at ensuring the health of the service (error rates, avg and tail latencies) and utilization, and can be helpful in dynamically adjusting compute capacity at peak vs. trough usage or cache sizes.


Logging is important for debugging and audit trail, and for overall analytics (product development and feature strategy)


Finally load balancing ensures that incoming requests can be evenly distributed across multiple server instances and reduces single points of failure. We'd want to put load balancing before every frontend service that responds to users.


Future improvements

Many directions to take this system and improve it.

  • reliability and hardening features
  • performance enhancements, tuning UI responsiveness and profiling avg and tail latencies for upload/download requests
  • security features, data governance, rich permissions and permission inheritance
  • better collaboration features and support for intelligent conflict resolution