My Solution for Design Pastebin with Score: 8/10
by monolith1144
System requirements
Functional:
List functional requirements for the system (Ask the chat bot for hints if stuck.)...
10 KB max size per paste. This is a system limit taht cant' be chaned by the user.
1 million writes per day
10 million reads per day
user can paste an get a unique link for it to share
another user can use the link to view the afor mentioned paste.
Non-Functional:
List non-functional requirements for the system...
The system doesn't need user accounts.
Files are immutable one created asside from the initall ttl set by the user whne the file is created.
Capacity estimation
Estimate the scale of the system you are going to design...
1 million writes per day of 10 kb is 10 million kb per day or 3,650 million kb per year
3,650,000,000,000 bytes = 3.7 TB per year
annual read bandwith is 37 TB per year
API design
Define what APIs are expected from the system...
Post file
- takes
- file contents
- TTL
- return
- uuid of file
Get file
- takes
- file uuid
- returns
- file bytes
- expiration time.
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
redis
- key: file uuid
- contents
- file bytes
- ttl
Cassandra
File table
- primary key: file uuid
- other columns
- file bytes
- ttl
- other columns
S3
file name = file uuid
contents = bytes
ttl set on file
Kafka message
key: file uuid
file contents
ttl
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
writes will use Kafka to distribute teh data to: redis, Casandra, and S3
Reads will read from redis, if not there, then Casandra, if not there then s3
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
- write flow
- user will make a Rest request with the file details to a java server
- The Java server will generate a uuid for the file and enqueue the file data and uuid into Kafka
- The Java service returns the file uuid to the user
- Sync flow
- All data stores will have a sync service that'll read the kafka messages and write them to their associated data stores.
- This will allow writes to happen indepedently and also support future batch writes if needed.
- Read flow
- Read reqeusts will come to hte java service.
- The service will look up the file in Redis, if there, ti returns it
- If not there, the servcie will look in Casandra for he file and return it if there.
- If not there, it falls back to s3
- The servcie will asynccronously backfil the file to any place it checked and didn't find it. This will make subseqent look ups faster
- Caching
- Redis will will be sized to hold 1 percetn of the annual data
- Casanda wil be size for 10 percent of annual data
- Both will be configured as LRU to purge the least recently used data when they fill up
- A CDN can also be used to add futher caching if needed.
- The system alos can be distributed to additional data centers using the S3 as a fallback backbone ot syncronize data across the centers.
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Redis and Casandra will operate as a L1 and L2 level cache. Bot hwith use the LRU algorithm to evict the least used data. As they are only caches, they would be configured for no data redundancy to save on capacity
In addition the the LRU eviction, all 3 storage systems will use the incomign ttl to evict data. This wil lensure they the data will be purged frojm all systems by the approperiate time if the LRU purging didn't alrayd do the job.
Redis will be deployed using AWS Elastic Cache with 3 instances with zero replication in cluster mode.
Casandra will be deployed using some managed Casandra vendor with writing to a single node. If the data is lost, the system will repair from S3 eventually.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Redis could be replaced with another in memory data store but it's well adopted through out hte industry.
Other nosql dbs could be used instead of Casandra, for instance dynamo db. Primarly we need a nossql db that handles TTL and can scale to sizes greater than a TB if the need arisees. Casndra fits the bill
S3 is nice because it provides easy global replication, but requently reading many small files fro mit oculd become expensive, thuse the caching.
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- Syncing issues
- its possible that all the consumers won't have consumed the kafka message by the time a user ties to access the data.
- This could cause the data to disapear and reappear inf the case that the data is written to redis, then flushed from teh lru, and finally repopulated by the S3 flow
- downed systems
- Any 3 of the storeage systms could go down. The system should have a timeout that fallsback to the next data store in the chain if ther'e no response in a given time limit. This will partially help mitigate the issue, although te resposentime would go up.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
It'd be a good idea to implent file hashing so that a duplicate file is only stored one place and then the uuids link to it to make it appear that the smae file is uploaded multiple times.
to further reduce latncy, the setup coudl be dpeoyd to multipe isolated data centers around the globe. They'd be connected by S3. if a data center on the other side the the plant ident' have the data locallly, it would fall back to s3 and that would cause the data to be replicatd in that data center too. The draw back of this is that a user in the writting region could see an fiel before the file is synced to s3, so a user on the other size of the globe might not see it until the sync is complerte.
We could also add s3 file life cycle managemetn that would automatically move infrequetnly used files to less expenstive, but slower access tiers.