Requirements
Functional Requirements:
- Allow users to upload and store text or code snippets.
- Generate a unique shareable URL for each paste.
- Enable retrieval of paste content by URL.
- Support expiration and TTL for pastes.
- Allow paste owners or the system to delete a paste before its natural expiration.
- Urls need to be non guessable if they are to be private.
- Paste content should not be lost and should be replicated for durability.
Non-Functional Requirements:
- List the key non-functional requirements (eg low latency, scalability, reliability, etc.)...
- High Availability
- Horizontal Scalability
- High reliability
- High Consistency
- Low latency
Capacity Estimation:
Lets consider writes first, for a lower mid size pastebin type application has 150 paste content creations per second that would be 150 * 86,400 seconds per day i.e 12.9 million writes per day.
now lets consider that the ration between read and write would be around 60:40 and an average of 100 clicks of a paste content in life time that would be 9000 requests per second.
Storage:
Each paste content would have the following:
- Paste content ID (16 bytes)
- Paste content userID ( 16 bytes)
- Paste Content (512 KB)
- Paste Content URL (100 bytes max)
- Created Time (8 bytes)
- Expiration Time (8 bytes)
- Access type (1 byte)
Rougly 512 KB per entry.
Daily growth: 150 writes/sec * 86400 * 512 KB = 6.48 TB growth per day.
Bandwidth:
Write Bandwidth: 150 requests/ sec * 256 bytes = ~38 KB/sec negligible
Read Bandwidth: 9000 requests/sec * 256 bytes = ~2.3 MB/sec can be handled
API Design
Create Paste Content
POST /api/v1/paste
Request Body:
Text: Content to paste
AccessType: Public, Private
ExpiresAt(optional default:never): ISO 8601 timestamp for when the link should stop working
Response (201 Created):
PasteContent: The text user has pasted.
PasteContentID: Id of the pasteContent created
PasteContentUserID: Id of the user who created the pasteContent
ShareURL: url of the paste content for sharing
CreatedAt: ISO 8601 timestamp for when the pasteContent was created
ExpiresAt: ISO 8601 timestamp for when the link should stop working
AccessType: Public/private
Response (400 Bad Request) If a request fails
Response (429 Too may requests) To rate limit the creation and prevent abuse and spam.
Get Paste Content
GET /pasteContent/{ID}
Request Body:
PasteContentID: ID of the pasteContent to retrieve.
Response (200 Found)
PasteContent: The text user has pasted
ExpiresAt: ISO 8601 timestamp for when the link should stop working
Response(404 NOT FOUND) If the pasteContent is not present.
Response (401 Unauthorized) If the user is not logged in.
Response (403 Forbidden) If the user do not have permissions
Delete Paste Content
DELETE /pasteContent/{ID}
Request Body:
PasteContentID: ID of the pasteContent to delete
Response (204 No Content) when deleted successfully.
Response (401 Unauthorized) If the user is not logged in.
Response (403 Forbidden) If the user do not have permissions
High-Level Design
Database Design
I think we can go with Postgres with 2 separate tables one for metadata named Details and one to store the pasteContent named Content table.
Details Table Schema
- UserID (Primary Key)
- PasteURL
- CreatedAt
- ExpiresAt
- AccessType
Content Table Schema
- PasteContentID (foreign Key references UserID of Details Table)
- PasteContent
Read Path
Client Click -> CDN -> API Gateway -> Load Balancer -> Getter Service -> Redis Cache -> Postgres
Write (Paste) Path
Client Click -> CDN -> API Gateway -> Load Balancer -> Create Service -> Message Queue -> Queue Worker -> Postgres
The loadbalancer is integrated into the api gateway api paths through a vpc endpoint. Making it possible for communications.
When ever a request comes to an api it will go through the load balancer from there it will go to the respective service.
If a request fails at message queue for 3 times then we will send it to DLQ for manual analysis and rectification. If this is for write operation we will let the user know that something went wrong and they have to try again.
Since CDN also caches most of the reads will be taken care of by the CDN itself. In case of any cache hit misses it will come inside but there as well we have Redis cache. Only after missing both of these will it hit the DB directly.
In case of Expiry we will invalidate the entry of the pasteContent from both the caches keep it from going stale. For this we can use a cleanup service
Detailed Component Design
Caching Strategy
CDN (Hot Tier): Most of the read requests will be taken care by the CDN it self. There is TTL for ensuring cache is fresh.
Redis (Warm Tier): If a cache miss happens at CDN we still have redis ready. The data is loaded only when some clicks on the pasteContent URL and we can use LRU eviction for keeping the cache fresh.
When a pasteContent is deleted we have to remove it from cache i.e CDN and redis. For redis we can just delete the key and for CDN we can invalidate it.
Postgres
We went with postgres because it is easy to maintain the separation of concerns for metadata and the content. Postgres also makes it easy for joining the tables.
Trade Offs:
Since we are using Joins we would be having a bit of latency issue since joins take some time during querying. This could have been avoided if we went with a documentDB but the separation of concerns would not have been that great in that case.
CleanUp Service
The cleanup service is setup for deleting the entries which have expired. Having this makes it more efficient since we have the separated it from the core services. After deleting of the entry from the DB it will also delete the keys from the redis and invalidate the CDN cache. The system determines that a paste as expired based on the ExpiresAt metadata. Using this we will remove the pasteContnet. And If any of the user tris to access an expired paste we can just give them 404 Not found and a message saying pasteContent expired.
Key Generation:
To avoid collisions we can go with universally unique UUIDs which even if we generate millions of id per second for 100 years the chaces of collision is only 50%, So if we go by this simultaneous paste creations would not be getting collisions.
Trade Off:
Even though it is great at avoiding collisions the Key generated is not very readable. It generates a 128 bit sized key for which we have considered in the design itself.
It works for this use case.
Rate Limiting:
Since we are using API Gateway it can help us in Ratelimiting as well. We can set it to throttle based on the rate or the burst. If we can use api keys per client then we can rate limit per client as well.
Trade off:
We cannot rate limit based on IP which maybe useful in some cases. If we want to ratelimt based on IP then we may have to use Web Application Firewall.