System Requirements
Functional -
- Users should be able to paste their text and get a paste link in return
- Users should be able to set the expiration time for the pasted content, including indefinite time
- We are not storing user data since we want them to stay anonymous
Non-Functional -
- Availability - We want our service to the available at all times
- Consistency - We are okay with the user not seeing their pasted text on the provided URL instantly. We are aiming for eventual consistency
- Fault Tolerant - System should be fault tolerant
- Latency - Latency should be fairly low
- Durability - We can never lose any text given by the user
Capacity Estimation
- We assume that each paste is 1 MB in size
- We assume 1 million pastes are generated every day, thus, we generate 10^12 bytes of data everyday = 1 TB data everyday
- Over 10 years, we would need ~4000 TB data = 4 PB
- Factoring in replication and caching, we would need roughly 20 PB data over 10 years
- We serve 5 million requests per day, read and write combined
- Queries per second, QPS = 5 * 10^6 / 10^5 = 50 queries served per second
- Peak queries per second would be 5 * QPS = 250
API Design
We would primarily need two APIs.
- Create Paste Link
- POST
- /paste/create
- Takes in the arguments in the form of a JSON like content of the paste, expiration date, title of the content etc.
- Returns a 4XX status code in case of an error, 2XX otherwise
- Get Pasted Content From Link
- GET
- /paste/fetch/:link
- Takes in the link as the argument and returns the pasted content for the link
- Returns 2XX in case of success, 4XX in case of errors
Database Design
- We are not storing user data since we assume they want to remain anonymous
- We can use a relational database to auto generate an ID that will serve as our pasteID
- paste
- pasteID Primary Key
- pasteLink Secondary Key
- createdAt
- expiryDate Index
- objectID Index
- paste
- We use an object store to store our content
- In our relational database along with the pasteID we store the object ID that our object store returns to map the paste to the pasted content
- To quickly search through our expired links, we add an index to the expiry data field
- Alternatively, we can use Redis as a sorted set which will store the expiration data of a link along with its paste ID. We keep checking the first element of the set, if we see a link that has expired, we delete the paste from our paste table. We also mark the object as deleted in our object store
High-Level design
- To present abuse of our system, we impose rate limiting based on IP address and bucketing strategy. This bucketing strategy will prevent distributed attacks
- We allow a high number of reads but a lower number of writes. Also, we let the users know that they are being rate limited along with the time when they can make their next request
- All read and write requests go through a load balancing layer where the load balancer assigns servers based on one of multiple available schemes, like round robin, average response time etc.
- Our database uses single leader replication and all of its followers act as replicas
- S3 handles scaling and replication on its own for us
- To shard our database, we can utilize consistent hashing to create hash values which will be our pasteID
- To avoid collision on our hashed values, we can use Base64 conversion to ensure that all the created hashed values are unique and increasing with time
- We also add a caching layer, we assume that 20% of our reads will be handled by the cache. Inside the cache, we the Least Recently Used eviction policy to evict pastes. For our cache, we can use Redis
- Since we are using MySQL, it will use two level locking to address write conflicts, where a process can only write to a row when no other process is reading that row.
- A process that tries to read a row while it is being written to cannot read as it has an exclusive lock on that row
- To avoid losing our pastes while our databases are being written, we can use a Write Ahead Log (WAL) that can rebuild our database in case of a failure
- All components are horizontally scaled
Request Flows
Explained in the high level design above
Detailed Component Design
Explained in the high level design above
Trade Offs/Tech Choices
- S3 is ideal for putting in the pasted content. It can scale horizontally and handle replication on its own
Failure Scenarios/Bottlenecks
- Our data store runs out of space
- Our database goes down
- Too many requests can overload our service
Future Improvements
- We can use a Write Ahead Log (WAL) to protect from data store failures
- Schedule a job to run routinely to get rid of expired paste links
- Support not just pasting text, but also media like photos, files and videos