Design Pastebin - System Design

System requirements

Functional:

Store Text
Retrieve Text
Expire Text on date and/or time

Non-Functional:

Performant in reads
Highly Available

Capacity estimation

With the assumption of 10,000 pastebins created a day and 30 day average duration, we would have around 300,000 active pastbins per month.

We would need to store text in text files, an expiry date and then also a url that we create.

Lets say the file has an average size of 1MB. The metadata such as expiry date and associated url is 200 bytes at a maximum and the file is 1MB. For 300,000 pastebins we would have 300GB of text files and 600mb of metadata.

We would have a high read heavy system and should prioritise reads.

API design

For uploading we would have

POST /v1/upload

Parameters (text, expiry duration)

This api would create a hash for each upload and use this as URL path.

Return a pastbin URL.

GET /[...url path]

returns the text file if expiry date is not reached

If the expiry date is reached then return a 404 or other appropriate http code.

Database design

Use a relational database as the data is relational

Text file could be small but outliers could be very large. We store the text in a file and put it in an object store in order to make database more performant.

The database is contains the metadata for each piece of text. A table could look like:

id, pastbin url, object store url, expiry date. When a lookup is performed we can return the link to the object store file.

As this is a read heavy system we need to be performant and highly available. For high availability we can create read replicas for redundancy. For this application we do not need to ensure strict consistency and pastebin contents is not modifiable.

For performance we can also introduce a cache. Some files will be more shared than others so we can cache commonly used urls to the file path.

High-level design

Use a load balancer to distribute requests among multiple servers. On server upload we can store metadata in the database and create a file and store it in our object store. Storing files seperately can improve performance for databases as large files can slow down lookups.

When looking up a file we can check cache first for a cache hit, if we miss then we can query the database for the information.

Request flows

On upload we send the text to the upload api. Upload api converts text to a file to be stored. Object store url and pastebin url is generated and we can store this metadata in the database along with expiry time. Once this is done we return the pastbin url.

For viewing we can go to the url, the server first checks the cache for the url. If it exists then we can return the file url. If it does not exists in cache then the server checks the database for an existing entry, if it does not exist then we return 404 not found.

Detailed component design

As this is a read heavy system we need to be performant and highly available. For high availability we can create replicas for redundancy. For this application we do not need to ensure strict consistency and pastebin contents is not modifiable. Therefore we can update each replica asynchronously. Since files wont be updated we also do not need to worry about versioning here as it is a write once system.

Cache would be some kind of in memory key value store that maps url to file url like Redis. we can set the time to live for each value as the minimum between some set value and the expiry time. Let each file live in cache for 30mins or expiry time based on which one is shorter.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?