System requirements
Functional:
User Auth and Authorization
File Management
Sharing and collaboration
VCS
Search on file system
Notification of file events
Offline Access.
Security and Encryption
More description of system: System acts like a network file system to end user. Each user will have access to logical storage space and access control possible at folder level. End/Edge application get notification about shared file/folder modification events. End user should be able to search on file system owned /access available to.
Non-Functional:
Highly available
Scalable
Secure
Performance
Capacity estimation
User count: 100 Million
Storage size per user: 3G Average
Write per second: 5/sec-user
Read per second: 10/sec-user
60% of total operation on storage is Read
40% of total operation on storage is write
API design
We need following group of API
User management REST APIs: For user CRUD. Associate User to group. Associate Group to different roles. Associate Roles to different permission.
File management API's : APIs for CRUD on files. API's will be role based access controlled. Highlighting some of the important API's like Upload, Download
Upload API:
Method : HTTP POST
URL: v1/file/upload
Header: UserName: <>, access_token:<>,
content-type: multipart-form-data
content-disposition-type: filename
Download API:
Method : HTTP GET
URL: v1/file/download/:filepath
Header: UserName: <>, access_token:<>,
content-type: multipart-form-data
content-disposition-type: filename
Search API:
Method : HTTP GET
URL: v1/file/search/:searchString
Header: UserName: <>, access_token:<>,
content-type: application/json
response:{
matches:["files"]
}
Notification API:
For realtime file change update Websocket subscription can be initiated by client. Changes to allowed file system will be sent over websocket channel.
Database design
User and Permission management
RBDMS will be used to store user and associated group and permission information.
Files will be stored in distributed file system. Which is sharded and replicated.
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Distributed storage detailed design:
Goal of this system is scalable, reliable and Highly available Distribured file storage
- Will have multiple storage machines with with specialised storage hardware HDD/SDD attached.
- Software layer/agent running on storage machine collect all the machine storage and form a logical storage layer.
- Storage types of the system is File storage.
- All the storage machine will be connected through RAFT cluster.
- Data will be distributed based on consistent hashing technique with storage nodes on hash ring and user hash on storage ring.
- File content will be replicated across multiple storage machine to ensure high availability.
- Write request can come to any node request will be routed to shard which is owner of the user requested storage. File will be written and Hash of file will be compared with hash provided during upload. Same content will be replicated across followers. Notification of file activity will be sent to notification service.
- When read request comes Distribured file storage request will be routed to shard responsible for user storage. file stream will be opened to client.
- Storage system encrypts data in disk and keys needed for encryption will be fetched from key management system.
Trade offs/Tech choices
Instead of building our own custom storage system we can use battle tested storage solution like S3, Azure disk.
Failure scenarios/bottlenecks
Key management service should be very secure it need more detailed discussion since key is lost encrypted data cant be recovered.
Future improvements
Disk encryption needs improvement. Cryptographic keys management need detailed design.