System requirements
Functional:
- Authentication
- A user can upload files.
- A user can download a file.
- A user can modify the data.
- At any given time multiple people can update a folder.
- Sync between different users.
- Update a delta
- Data storage.
- Versioning.
Non-Functional:
- Scalable.
- Reliable.
- Should not lost any update.
- Consistency.
- No data lost
- User experience
Capacity estimation
Total user : 100 Million
Lets assume user a user data is : 200Bytes
User space : 100*200 -->. 10Billion --> 10 GB
Storage size for a day :
10Million * 5*1000*1000 -->. 50000Billion Bytes --> 50TB/Day
TPS : 10Million /10000-> ~10TPS
API design
Post : /v1/upload --> data as bytes
Response will be {documentId:"xxxfstw7623672883"}
Get : /v1/docuemnts?userId=''
Response --> {
docId,
created at,
lastUpdated,
LastUpdated by
}
Put : /v1/update/{documentId} --> body will content
Post : /v1/share/{doumentId} Body {List:{usrId}}
Database design
User table ->{
userId: string(PK)
email : string(UK)
createdAt:
updatedAt
}
userDocumentMeta ->{
docId : PK
contentAddress: --> Location of data
createdby : userId; --> Index
createdAt :
}
DocumentVersion{
docId: PK
version: String
comment : String
created at : timestamp
updated by : userId
documentAddress: blockId
}
As update is allowed so recommended is to use block storage
ContetTable {
blockId : pK
content: String
offset:
nextBlock:
}
High-level design
Component ->
Client.
DropBoxServer
DB
BroadCastService : Queue based notification(Client can connect and update latest meta data )
Client :
- It will maintained data locally and stored it will it.
- Once client is online it will connect to backend and push the changes.
Server
- Client login to dropBox
- After authentication user uploaded a documents.
- It connected to server and upload document.
- Once server received all doc it will start processing.
- Once data come for update, new version will be created and update the data
- Push the update to each client.
Request flows
Client APK/Client side:
- Open any folder, it will fetch from server(if not available at client).
- Update the document/folder.
- Save change locally.
- Publish changes to backend.
BackEnd:
Upload/update flow
- Get changes from client1 for doc1.
- Versioning of new data in metedata table.
- Update the data in document.
- broadcast all request to connected client.
Download flow:
- User authenticate.
- query all document from table on basis of userId
- Publish all doc Id to client.
- User can choose any one docId.
- Client will call a download API with docId
Detailed component design
How client maintained the data locally?
How updated data stored in backend?
How data will be synced from one client to another client?
How delta will be stored at backend?
How client maintained the data locally?
- Client will connect to server and download all list of doc for a user.
- User can pick any document and start updating it.
- Changes are saved locally and periodically broadcast to backend.
- We have two approach either we can push changes immediately as soon as changes are done. Problem --> Scalability issue as each minor changes need to maintained at back end has to broadcast to every user
- Or we can push changes periodically.
- Advantage : Scalable solution and we can tune the durations for cost effective(DB changes)
- Disadvantage : not real time, if someone required changes real time
- Get update of a version of document and resolve conflicts.
How updated data stored in backend?
How data will be synced from one client to another client?
- After authentication
- Backend receive the changes from one client.
- It will create new version of data and update in datastore.
- It will also detect the changes and update the content of document.
- It broadcast all changes to the notification servers.
How delta will be stored at backend?
Trade offs/Tech choices
DB : For user and meta storage : RelationDB --> Postgres/Mysql
For data content storage: NoSQL --> DynamoDB/Cassendar (Chooses because of availability’s )
Trade Offs
Client pull vs push data sync to backend.
Failure scenarios/bottlenecks
bottlenecks:
- Notifying the changes to all the client. For small scale this approach will works.
Future improvements
Notifying the changes to all the client. For small scale this approach will works.