Design Pastebin - System Design

Requirements

Functional Requirements:

Allow users to upload and store text or code snippets.
Generate a unique shareable URL for each paste.
Enable retrieval of paste content by URL.
Support expiration and TTL for pastes.
Allow paste owners or the system to delete a paste before its natural expiration.

Non-Functional Requirements:

Once text has been pasted, the upload of text should be done within 500ms. (write operation)
The retrieval of paste content by URL should be done within 500ms. (read operation)
The service should support up to 1 billion MAU. Peak QPS will be around 1500.
The service should be available at least 99.999% of time.
The service could store up to 10K words per user. Assuming 8 byte per word, that would be 80TB of storage needed.

API Design

Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...

There will be 4 APIs we need to support:

Upload of pasted text:

User_id is a unique identifier of a user, used as a path parameter.

POST v1/upload_text/{user_id}

{

post_id: UUID,

text: String,

createdAt: Timestamp,

textType: Enum, (code or text),

ttl: String

}

Generation of unique sharable URLs:

POST v1/generate_url/{user_id}

{

post_id: UUID

}

Retrieval of text blob using the URL:

GET v1/retrieve_text/{user_id}

{

url: String

}

Deletion of a post:

DELETE v1/delete_post/{user_id}

{

post_id: UUID

}

High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.

First, we need to define the data model for storing the pasted text metadata:

table pasted_text_metadata {

post_id: UUID,

user_id: UUID,

created_at: Timestamp,

ttl: string,

text_type: string (whether the stored text is a text or code snippet)

}

For this table, we build an index on post_id.

table pasted_text_url {

sharable_url: string,

post_id: UUID,

user_id: UUID,

}

For this table, we have 2 indexes, one is on sharable_url, and the other is on post_id.

The above 2 tables store the metadata for the pastebin text. For metadata data storage, we will use postgres, which is a highly available distributed SQL database with ACID properties.

For the object storage, we will have post_id as keys, while the text blob as the stored object.

For all the requests, we first go through a load balancer, which routes traffic to different servers using consistent hashing. The request then goes to an API gateway, which does authentication, and rate_limiting by user_id, IP address etc.

Now let me walk through the 4 flows:

When users want to upload text, client triggers the text upload API. We create a unique UUID for the text, specify the text type, set the user_id of the user creating the text snippet, and set the corresponding creation timestamp and associated TTL. We will use a write-through cache, so all updates will happen at the caching layer and the database layer. This will increase latency but ensure consistency. Repeated reads will be optimized by the caching layer.
When users want to delete text, client triggers the text delete API. We evict the entry from cache, and trigger an update to the underlying database at the same time.
When users want to generate a sharable url. We take in user_id, post_id and current timestamp, and hash the inputs to generate a unique url. We then create mappings between the url and the associated post, and write the results to the pasted_text_url table. We then return the url to users.
When users want to retrieve the text from url, the url triggers the retrieve_text API. We find the associated post in the pasted_text_url table, and query the object storage based on post_id, to return the pasted text.

For removing stale pasted text after TTLs, we will rely on the inherent TTL features on both redis and postgres.

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

For the high level design, I will cover several questions, with focus on how does the system handle heavy read and write traffic, with high availability and relatively low latency.

Assuming the service supports up to 1 billion MAU, the peak read QPS will be around 1500, and write QPS will be 1/3 of that.

For the read path, we first hit the load balancer. To balance the traffic evenly on different servers and avoid hot spots, we will use consistent hashing and use the user_id as the hashing key. For a user, you can either write (add pasted text, remove text or generate url), or you can read (retrieve text based on a url). To prevent malicious actors from overwhelming the system, for each request from a user, we will do both authentication and rate limiting. On the rate limiting side, we will rate limit by user_id, and each user is allowed to send 5 write requests per minute, and retrieve contents for 10 urls per minute. As we handle more QPS, this ensures that we only accept requests from authenticated users under the rate limit threshold, and their requests are distributed relatively evenly on servers.
For the servers, they are stateless, so we can easily scale horizontally and vertically.
For the redis cache, we will use a write through cache instead of a write back cache. While the write back cache has lower latency, in case of a cache crash before data is written to the database, we will lose the data, and causing inconsistencies. For the pasted bin system, we don't want users to lose their pasted data. On the read path, a lot of QPS will be directed towards the same hot posts. In this case, our redis cache can quickly return the post metadata for these repeated requests. On the write path, the tradeoff for this system is relatively higher write latency, since we need to write to cache, to postgres cluster and to object storage. For postgres, we will build indexes on post_ids and sharable_urls. The indexes on post_id make sure that any read to retrieve text metadata gets quick results, and the indexes on sharable_urls makes the retrieval of post_id fast based on urls.
For both the redis and postgres cluster, since the data stored has TTL, we don't expect storage to keep increasing, so there is no need for sharding. However, as traffic increases, database CPU might become too hot. In this case, we will create replicas for both redis and postgres. There will be 1 primary database handling writes, and the rest are read replicas. For write, we will use sync the writes to read replicas before committing an operation. This ensures consistency, with the tradeoff of higher write latency.