System Requirements
Functional -
- Users should be able to paste their text.
- Users should be able to set the amount of time for which the paste is active, including indefinite time.
Non-Functional -
- High Availability - Service should be highly available, we are perfectly fine not being able to instantly access pasted links but service should be active at all times.
- Consistency - Here we are aiming for eventual consistency.
- Durability - System should be reliable, no text should be lost.
- Low Latency - We should provide the user with a paste link within 500 ms of him/her creating a paste.
Capacity Estimation
- We assume each paste is 1 MB in size.
- We assume we generate 1 million pastes a day, meaning we create 10^12 bytes of data everyday = 1 TB data created everyday.
- Over a year, we generate ~400 TB of data. Over 10 years we need to store 4000 TB = 4 Petabyte of data over 10 years.
- Also, we would need to replicate this data multiple times. We can assume we would need 20 PB data over 10 years.
- 10% of our total data would be cached for faster access, thus we would cache 40 TB data in cache.
API Design
We would primarily need two APIs.
- Create Paste Link
- This would be a POST request which would have all the parameters needed to create a paste link, like the content, title of the paste, expiration time, etc.
- Takes in the link, the pasted content and the expiration time as parameters
- Returns success in case the link is created, error otherwise
- Get Paste
- This would take the provided paste link as an input and return the corresponding pasted text
- This would return the pasted text for the link along with the success message if it exists, a 404 Not Found otherwise
- It is possible that the paste link is invalid, in that case, we return a 4xx error
Database Design
- We are not storing user data since we assume they want to remain anonymous
- We will have a paste info table which will serve as storing our metadata
- pasteInfo table
- pasteID Primary Key
- pasteLink Secondary Key
- createdAt
- expiryDate Index
- pasteInfo table
- We use an object store like Amazon S3 or Google Cloud Platform to store the content of our links
- paste store
- pasteID Key
- content
- paste store
- To quickly search through the expired links, we add an index to the expiryDate field in our pasteInfo table
- Alternatively, we can use Redis as a sorted set which will store the expiration data of a link along with the link ID. These pairs will be sorted in our set on the basis of the expiration date. We can keep checking the first element of the sorted set, if see a link that has expires, we delete the paste from our table and data store. We also remove that pair from our data store
High-Level design
- To present abuse of our system, we impose rate limiting based on IP address and bucketing strategy. This bucketing strategy will prevent distributed attacks
- We allow a high number of reads but a lower number of writes. Also, we let the users know that they are being rate limited along with the time when they can make their next request
- All read and write requests go through a load balancing layer where the load balancer assigns servers based on one of multiple available schemes, like round robin, average response time etc.
- Our database uses single leader replication and all of its followers act as replicas
- S3 handles scaling and replication on its own for us
- To shard our database, we can utilize consistent hashing to create hash values which will be our pasteID
- To avoid collision on our hashed values, we can use Base64 conversion to ensure that all the created hashed values are unique and increasing with time
- We also add a caching layer, we assume that 20% of our reads will be handled by the cache. Inside the cache, we the Least Recently Used eviction policy to evict pastes. For our cache, we can use Redis
- Since we are using MySQL, it will use two level locking to address write conflicts, where a process can only write to a row when no other process is reading that row.
- A process that tries to read a row while it is being written to cannot read as it has an exclusive lock on that row
- To avoid losing our pastes while our databases are being written, we can use a Write Ahead Log (WAL) that can rebuild our database in case of a failure
- All components are horizontally scaled
Request Flows
Explained in the high level design above
Detailed Component Design
Explained in the high level design above
Trade Offs/Tech Choices
- S3 is ideal for putting in the pasted content. It can scale horizontally and handle replication on its own
Failure Scenarios/Bottlenecks
- Our data store runs out of space
- Our database goes down
- Too many requests can overload our service
Future Improvements
- We can use a Write Ahead Log (WAL) to protect from data store failures
- Schedule a job to run routinely to get rid of expired paste links
- Support not just pasting text, but also media like photos, files and videos