Design Pastebin - System Design

Requirements

Functional Requirements:

Users should be able to paste text and create posts
Users should be able to read posts, using dedicated post id
Users should be able to delete posts
Posts should be retained for a configurable time and then deleted

Non-Functional Requirements:

Low latency paste reads (10s of milliseconds)
Low latency paste creation (10 - 50 milliseconds)
Durability and retention of paste till their TTL
High availability for paste and view endpoints

Capacity Estimation:

Let us assume we have 1M DAU
Let us say 10% of the users create a paste every day, so 100k paste created per day
Let us say that users read on average 10 pastes per day, totaling to 10M paste reads per day
This is about 1 paste write per second and 100 paste reads per second
On average a paste will be 100 kB in size. At 100k pastes per day, we will store 10 GB per day.
This is 300 GB of data that we have to store each month, if we cap TTL to be no more than 1 month

API Design

Let's start simple, we will need two API endpoints:

POST /createPaste - {"text": "...", "ttl": "number of seconds"} returns the unique paste id
GET /view/id
DELETE /poste/id

High-Level Design

At a glance we will have an API gateway which directs the requests to our application servers. It is obvious that our database is going to be the bottleneck in our application, so we will use a cache to reduce the read burden on our database. Since our data is static, i.e. no edits, we will also keep a CDN cache. A cron job periodically removes expired pastes from our database.

To scale our system horizontally, we will deploy multiple application servers, and the API gateway will give use load balancing and rate limiting capabilities to smooth out traffic and ban abusers.

Detailed Component Design

Database:

The database will be the backbone of our application. While we can use a simple relational database for the application, they are not suitable for storing long blobs of text. So we will use two databases, one for metadata storage and another for the actual paste text. For the metadata store we will use Postgres and for the object store we will use S3. For durability and retention SLAs we will keep a replicas of our metadata store, S3 guarantees durability and retention internally.

Application Servers:

For incoming read requests, the application servers query the metadata, fetch the associated paste text from the object store and return it. For big pastes, ones >50 MB in size, we will chunk them and stream the chunks. On the write side we will take the paste store the metadata and then upload the text to S3. For big pastes, again >50 MB text, we will send a pre-signed URL to the client so that they can upload directly to S3, reducing load on the servers.

Our CRON job runs every 10 minutes, and even though the cache entry will expire due to TTL, the paste can still sit in the database and be served through a cache miss. Before returning any response the application servers will check if the paste is active, if not the servers will return a 410 expired status code.

A better design here would be microservices, with a service for read and a service for write. This way we can scale reads and writes independently and according to traffic. The core logic per service remains the same.

On thing I overlooked was pasteId creation. Some approaches are:

Hashing the paste text and using the first 24 characters
Use uuid

Since the hashing technique may lead to collisions, let us go with uuid.

Deletion path: when a user makes a delete request first we will check if this is the user that created the post using user_id. If it is the user then we will mark the corresponding paste as expired in the metadata store and update our cache and CDN. The CRON job can then delete the row and paste when it runs next.

Cache:

We will use a Redis cache to store requested pastes. On cache miss, we query the metadata store and save only the metadata in Redis. On subsequent requests we read from Redis, get the paste from S3 and return. Since our data is static we can also use a CDN to cache full pastes at edge, this will improve global latency.

As our data is ephemeral, we need to take care to invalidate cached pastes. For this at cache entry time the read service will calculate remaining live time for the paste using expires at and current time and set TTL to this. This way we will not serve expired pastes.

On availability and durability front, cache and CDN continue to serve during database or microservice outage.

CRON Job:

We will run a cron job every 10 minutes to clean up any expired pastes from both the metadata store and the object store. Caches don't need to be cleaned here as we have set the TTL appropriately.