System requirements


Functional:

  1. User Management:
  2. Create an account with a unique username and a valid email address.
  3. Log in securely with proper authentication mechanisms.
  4. Log out from the system to terminate the session.
  5. File Operations:
  6. Upload files to the user's account.
  7. Download files from the user's account.
  8. Create, move, rename, and delete folders.
  9. Move, rename, and delete files.
  10. Access and manage previous versions of files.
  11. Synchronization:
  12. Automatically synchronize files across multiple devices in real-time.
  13. Ensure changes made on one device are reflected on all other connected devices.
  14. Sharing and Collaboration:
  15. Share files/folders securely with other users.
  16. Collaborate in real-time on shared files.
  17. Set permissions for shared items (view-only, edit, etc.).
  18. File Search:
  19. Search for files/folders based on keywords.
  20. Provide accurate and fast search results.


Non-Functional:

  1. Security:
  2. Implement robust encryption for data transmission and storage.
  3. Regularly update security protocols to protect against emerging threats.
  4. Monitor and log user activities for auditing and security purposes.
  5. Scalability:
  6. Design the system to handle a growing number of users and files.
  7. Scale the infrastructure horizontally to accommodate increased load.
  8. Performance:
  9. Ensure low-latency file uploads and downloads.
  10. Optimize search algorithms for quick and efficient results.
  11. Minimize synchronization delay between devices.
  12. Reliability:
  13. Implement regular backups and data recovery mechanisms.
  14. Provide system availability with minimal downtime for maintenance.
  15. Compatibility:
  16. Support a variety of file types and sizes for uploading and downloading.
  17. Ensure compatibility with popular operating systems and browsers.
  18. Compliance:
  19. Comply with data protection regulations and privacy laws.
  20. Maintain transparency in terms of data usage and storage policies.
  21. Availability:
  22. Design the system with high availability to minimize service downtime.
  23. Implement redundant systems and failover mechanisms to ensure continuous service.



Capacity estimation


The total number of users = 500 million.

Total number of daily active users = 100 million

The average number of files stored by each user = 200

The average size of each file = 1 MB

Total number of active connections per minute = 1 million


Storage Estimations:


Total number of files = 500 million * 200 = 100 billion

Total storage required = 100 billion * 1 MB = 100 PB

Considering 1 server can handle 1000 requests concurrently, we would need 1 Million / 1000 = 1000 servers



API design

User Authentication API:

  • Description: This API handles user authentication, allowing users to securely log in and obtain access tokens.
  • Input: User credentials (username, password).
  • Output: Access token or an error message.

2. File Upload API:

  • Description: Enables users to upload files to their accounts.
  • Input: File data, user authentication token.
  • Output: Confirmation of successful upload or an error message.

3. File Download API:

  • Description: Allows users to download files from their accounts.
  • Input: File identifier, user authentication token.
  • Output: Downloaded file data or an error message.

4. File Management API:

  • Description: Provides functionality to manage files and folders (create, move, rename, delete).
  • Input: File/folder details, user authentication token.
  • Output: Confirmation of the operation or an error message.

5. File Synchronization API:

  • Description: Ensures synchronization of files across multiple devices in real-time.
  • Input: User authentication token, device identifier, file changes.
  • Output: Confirmation of synchronization status or an error message.

6. Sharing and Collaboration API:

  • Description: Facilitates secure sharing of files/folders and collaboration between users.
  • Input: Shared item details, user authentication token.
  • Output: Confirmation of successful sharing or an error message.

7. Version Control API:

  • Description: Manages access to previous versions of files.
  • Input: File identifier, version details, user authentication token.
  • Output: Previous version of the file or an error message.

8. File Search API:

  • Description: Allows users to search for files/folders based on keywords.
  • Input: Search query, user authentication token.
  • Output: List of search results or an empty result set.




Database design

For the tables required in this design, refer to the class diagram, the list of classes is not exhaustive but this is a good number of tables to start with.



Database Choice

  1. User Data:
  2. Database Type: Relational Database (e.g., PostgreSQL, MySQL)
  3. CAP Focus: Balanced (Consistency and Availability)
  4. Reasoning: User data often requires a balance between consistency (ensuring accurate and up-to-date user information) and availability (ensuring users can access the system). Relational databases are designed to provide a balanced approach.
  5. File Metadata and Sharing Data:
  6. Database Type: Relational Database (e.g., PostgreSQL, MySQL)
  7. CAP Focus: Balanced (Consistency and Availability)
  8. Reasoning: Similar to user data, file metadata and sharing data benefit from a balanced approach to ensure that users see accurate and consistent information while still allowing for system availability.
  9. Search Data:
  10. Database Type: Search Engine (e.g., Elasticsearch)
  11. CAP Focus: Availability
  12. Reasoning: Search functionality benefits from a focus on availability, allowing users to retrieve search results quickly. Search engines like Elasticsearch are optimized for distributed and scalable search operations.
  13. Audit Logs and Version History:
  14. Database Type: Relational Database (e.g., PostgreSQL, MySQL) or NoSQL Database (e.g., MongoDB)
  15. CAP Focus: Depends on the use case
  16. Reasoning: Depending on the specific requirements, the focus may vary. For strict consistency, a relational database may be suitable. If flexibility and availability are prioritized, a NoSQL database could be preferred.
  17. File Chunks:
  18. Storage Service: Amazon S3 or similar object storage service
  19. CAP Focus: Availability
  20. Reasoning: Cloud-based object storage services are specifically designed for storing large volumes of binary data, offering high availability, durability, and scalability. They are optimized for read and write operations and provide low-latency access to stored objects.


Data Partitioning:

  • Strategy: Hash-Based Partitioning
  • Explanation: For file hosting services, hash-based partitioning is often a suitable strategy. It evenly distributes data across multiple partitions based on a hash function applied to a chosen key (e.g., user ID, file ID). This ensures a balanced distribution of data and efficient retrieval.


Regional or Geographical Partitioning:

  • Applicability: Not necessary initially, but consider for scalability and performance optimization.
  • Explanation: Initially, a global approach may be sufficient. However, as the user base grows and the service expands globally, you might consider regional or geographical partitioning. This can enhance performance by placing data closer to users and addressing data residency and compliance requirements.

Sharding Strategy:

  • Strategy: Range-Based Sharding
  • Explanation: Range-based sharding involves dividing the dataset into ranges based on a specific criteria (e.g., user IDs, file IDs). This can be effective for tables that are expected to grow significantly, such as file chunks or version history. Each shard can then handle a specific range of data, enabling horizontal scalability.


Sharding Key Selection:

  • Key Criteria: Choose a sharding key that evenly distributes data and avoids hotspots.
  • Explanation: The choice of sharding key is crucial. It should distribute the data evenly across shards to prevent hotspots. For example, sharding files based on a user's geographical location might lead to uneven distribution if certain regions have a higher concentration of users.


Replication:

  • Strategy: Master-Slave Replication
  • Explanation: Implement master-slave replication to ensure data durability and availability. Writes can be directed to the master node, while read queries can be distributed across slave nodes, enhancing both read and write scalability.



Load Balancing:

  • Load Balancer Type: DNS Load Balancing for Global Distribution
  • Explanation: Use DNS load balancing to distribute incoming requests across multiple servers globally. This ensures efficient load distribution and improved response times for users in different regions.




High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...






Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...








Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...








Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?