Design Pastebin - System Design

Requirements

Functional Requirements:

Create and share paste
- the system should be create paste for user and share that paste via a unique url
Expiration of paste
- allow the users to set the expiration of paste, which, after exceeding assigned value the paste will be deleted
Unique URL/ID generation
- the user should be able to robustly generate unique id for each url coupling with collision prevention mechanism
Paste Retrieval
- users should be able to retrieval paste using unique url
User anonymity
- Allow users to create paste without needing to create an account to preserve simplicity.
Update Paste
- users should be able to update an existing paste
Delete Paste
- users should be able to delete an existing paste (immediate delete)

Non-Functional Requirements:

Low response time
- < 100 ms for saving text
- < 100 ms for retrieve text
Durability
- the data must be retained until it expires
High availability
- 99.99 % uptime
Scalable
- able to handle traffic spike
Security
- data rest: user text is encrypted with standard algorithm such as RSA
- data transit: implement TLS for the website
Consistent of user experience
- the paste must be available to user immediately after its creation.

Capacity estimation

Traffic (Writes)

Suppose we anticipate on the order of 1 million new pastes per day. This is about ~12 paste creation requests per second on average. Traffic will not be uniform; at peak times we might see perhaps 10x the average load (hundreds of writes per second during spikes). We assume each paste is created once (no updates).

Traffic (Reads)

The system is read-heavy. Each paste may be read many times after creation. A read-to-write ratio of around 5:1 or higher is reasonable (many references assume between 5× and 10× more reads than writes). For example, if we have 1M new pastes/day, we might see ~5–10 million paste retrievals per day. This comes out to roughly 60–120 read requests per second on average, and potentially bursts of thousands of reads per second at peak if a particular paste goes viral. In a more extreme scenario (100:1 read/write ratio), read traffic could reach ~100M/day (~1200 reads/s), so our design should be prepared for high read throughput.

Data Size per Paste

We must decide on maximum and average paste sizes. To prevent abuse, we can cap the size of a paste (for example, maximum 1 MB or 5–10 MB of text). Realistically, many pastes are much smaller (snippets of code or logs). Let’s assume an average paste size ~10 KB (some may be just a few hundred bytes, some could be larger, but 10 KB is a typical order of magnitude).

Storage (Daily & Total)

At 1M pastes/day * 10 KB each, that’s about ~10 GB of new data per day that needs to be stored. Over longer periods:

In one month (~30 days): ~300 GB of data.

In one year: ~3.6 TB of data.

In five years: ~18 TB of data.

These figures assume we retain everything indefinitely. If we implement expirations, the actual stored data would depend on how long pastes live. For instance, if we only retain data for 3 months (90 days), we’d store roughly the last ~900 GB of pastes at any given time. Given millions of users, we should design for multiple terabytes of storage in the long run.

Total Number of Pastes

Over five years at 1M pastes/day, we could accumulate about 1.8–2 billion paste entries (if none expired). Even with expirations, the database could contain on the order of hundreds of millions of records. This impacts how we choose our data store and how we generate unique keys (IDs).

ID Space and Collision

We need a strategy to generate unique paste IDs that won’t run out. If we use an alphanumeric ID of length 6 (using 62 characters [0-9, a-z, A-Z]), we have

≈

≈56 billion possible IDs. This is plenty for our needs (e.g. 2 billion IDs used is only ~3.6% of that space). Using base64 (64 characters) with 6 characters gives ~68.7 billion possibilities. Even in the distant future with tens of billions of pastes, 6-character IDs suffice. We can always extend to 7 characters (62^7 = 3.5 trillion combos) if needed. The probability of random ID collisions with 56+ billion possibilities is very low for our usage, but we will still handle collision cases just in case (see Key Generation in detailed design).

These estimations guide our design: we need a system that can handle tens of writes per second, hundreds to thousands of reads per second, and store terabytes of data over time. The design should support scaling out horizontally as these numbers grow.

Database design

The data model for a paste is simple. Each paste is essentially a small document with an ID as the key. We can separate metadata (ID, timestamps, etc.) from the actual content for efficiency. The design will include:

Paste Metadata: This will be stored in a fast, persistent database keyed by the paste ID (which is the unique URL token).

PasteID (Key): The unique identifier (e.g. “abc123”). This is the primary key used to look up the paste. It’s a short string (length ~6-8). All pasteIDs are unique; we can enforce this with a primary key or unique index.
ContentPointer: Instead of storing the full text content in the primary database, we store a pointer or reference to where the content is stored. For instance, if we use an object storage service (like AWS S3) to hold the paste text, this could be an object key or URL. Another approach is to store the content directly if it’s small – but given content can be up to megabytes, using external storage is more scalable (see below).
CreationTime: Timestamp of when the paste was created (or stored).
ExpirationTime: Timestamp of when the paste is set to expire (optional; null or a special value if the paste doesn’t expire). This allows us to implement TTL logic – we know after this time the paste is invalid and can be purged.

We can represent the Paste metadata table as follows:

id Unique paste identifier (e.g. “abc123”). Primary key for lookup.
content_ref Reference to stored content (e.g. an object store key or file path).
created_at Timestamp when the paste was created.
expires_at Timestamp when the paste will expire (nullable if no expiration).
size Size of the paste content in bytes (for informational or optimization use).

Since we have no user accounts, we don’t need a user ID in this table. Each paste stands on its own.

Paste Content Storage: For storing the actual text content, a common pattern (given potentially large volumes of text) is to use an object storage or blob storage service. The idea is to offload the heavy lifting of storing and serving large text blobs to a system designed for it. Options include cloud storage like Amazon S3, Google Cloud Storage, or a distributed file system. The metadata table will store a reference (e.g. an S3 key or URL) to the content.

If the paste is small (say under some threshold like 1-2 KB), we could consider storing it directly in the database record (to save an extra lookup). But for simplicity and consistency, we can store all content in the object store and keep only references in the DB. This also keeps the database size smaller (only ~hundreds of bytes per paste record for metadata, instead of potentially kilobytes).
Each paste’s content object can be named by the paste ID (for instance, we could use the pasteID as the filename in S3). This makes retrieval straightforward: if you have the ID, you can directly fetch the object from storage.
Storing content externally provides unlimited scalability and durability – object stores can handle billions of objects and large data volumes cheaply. It also allows independent scaling of storage and database: our DB handles small records (ID->pointer), while the heavy data is on a storage service that can scale out separately.
We should ensure the object store has proper replication and backup, or if self-hosted, that we replicate files to avoid data loss. Cloud object stores typically handle replication automatically across multiple availability zones.

Why not store content directly in the database?

Storing everything in one SQL or NoSQL database is simpler initially, but it can become a bottleneck:

Large text blobs can slow down database queries and replication. By separating content, the DB only handles keys and small metadata (fast to query).
Using an object store means we can serve content directly to users (possibly via CDN) without always going through our application servers, improving read throughput (more on this in High-level Design).
Object stores are optimized for high-throughput, streaming large data, and cost-effective storage (often cheaper per GB than database storage).

Database Choice

We need to pick a database for the metadata. Key considerations:

We will have on the order of billions of records over time, with a simple key->value access pattern (no complex relationships or joins).
We require high write and read throughput, and the ability to scale horizontally.
A NoSQL key-value store or wide-column store is a natural fit. For example, Amazon DynamoDB (a hosted key-value store) or Apache Cassandra (a distributed wide-column DB) are designed for this kind of workload (billions of keys, high read/write rates). They also natively support partitioning and replication.
A relational database (SQL) could be used for the metadata (with id as primary key). It provides strong consistency and ease of querying, but scaling to billions of rows might require sharding or federation. It’s feasible (companies do shard MySQL/Postgres for large data), but adds complexity. Since our access pattern is simple and mostly primary-key lookups, a NoSQL store can give scalability with less management overhead.
We can also consider the consistency needs. DynamoDB or Cassandra are typically eventually-consistent by default, but can be configured for strong consistency on reads/writes if needed. A relational DB would give strong consistency by default (reads from master) but might trade off availability in a partition. Given our requirement that newly created pastes be immediately available, we will ensure whichever DB we choose can satisfy that (e.g. perform reads at consistency level that ensures latest data).

Alternatively, DynamoDB (if using AWS) could be used with a similar data model. DynamoDB can handle a very high scale of requests and also has a TTL feature for items. Both choices meet our needs; we’ll proceed with the concept (the design is similar either way).

Search Indices

Our primary access pattern is by ID (exact lookup), which is the primary key. That lookup will be O(1) or O(log N) depending on the DB, but essentially very fast via hashing. We should also consider an index on the expires_at field if we want to efficiently query for expired records for cleanup. In Cassandra, one might use a time-series table or a separate mechanism for expiration, but since TTL will handle automatic expiration, we may not need a secondary index. If using a system without TTL, we might need a scheduled job to find expired pastes (e.g. scanning by time or keeping a sorted set of expiring IDs).

Summary

Each paste operation will result in:

An object stored (the text content in blob storage).
A metadata entry in the database (with key and metadata).

This separation allows us to scale and manage the two concerns (metadata vs content) independently. The metadata DB remains lean and efficient for queries, and the object store handles large data storage with high durability.

API Design

External API

1.Create Paste

Method: POST

Endpoint: /{version}/paste

Request Body:

paste_text: string (required) : text included in paste

Response:

Success:

status: ENUM: SUCCESS

status_code: int: 200

paste_id: id: id of paste

Error:

status: ENUM: ERROR

status_code: int: 4xx | 5xx

2.Get Paste

Method: GET

Endpoint: /{version}/paste/{paste_id}

Path Variable:

paste_id: {id}: id of the paste

Response:

Success:

status: ENUM: SUCCESS

status_code: int: 200

Error:

status: ENUM: ERROR

status_code: int: 4xx | 5xx

3.Update Paste

Method: PATCH

Endpoint: /{version}/paste/{paste_id}

Path Variable

paste_id: id: id of the paste

Request Body:

new_text: string: new text

Response:

Success:

status: ENUM: SUCCESS

status_code: int: 200

Error:

status: ENUM: ERROR

status_code: int: 4xx | 5xx

4.Delete Paste

Method: DELETE

Endpoint: /{version}/paste/{paste_id}

Path Variable:

paste_id: id: id of the paste

Response:

Success:

status: ENUM: SUCCESS

status_code: int: 200

Error:

status: ENUM: ERROR

status_code: int: 4xx | 5xx

High-Level Design

Service Component:

CDN: act as cache that is in closest proximity to user.
API Gateway: responsible for authentication and traffic rerouting to the correct micro service
Rate Limiter: part of api gateway, limit the number of request that can be sent to the server to prevent DoS attack.
Load balancer: responsible for distribute load for each specific micro service
Create Paste Service: micro service responsible for creating paste
Retrieve Paste Service: micro service responsible for retrieving paste
Delete Paste Service: micro service responsible for deleting paste
Update Paste Service: micro service responsible for updating existing paste
Unique Id Generation service: micro service responsible for creating unique id of each paste using secure random algorithm to generate alphanumeric string with length = 8
Cache Layer: to quickly retrieve frequently used data, instead of searching in the database which is quite slow.
Database Cluster: cluster of database, act as source of truth for the system.

Diagram

Detailed Component Design

Gen UID Service:

secure random
- secure random to generate alpha numeric string whose length equal to 8 (1/62^8 probability for collision). In case of collision happen we will random string again.
KSG algorithm

Database:

database choice:
- primary function of paste-bin is to map the unique id of paste to the text
- key-value database:
  - we can directly used key-value database like map data structure to map uid (key) with text (value)
- mysql database:
  - we can have the uid as the primary queue in the table while the text can be just a column in the table
- in my humble opinion, I think mysql database is a superior option. Despite mapping uid with text is a primary objective. We need to go further than that. As we also need to keep track of the creation timestamp, expiration timestamp, title, isOneTimeRead, etc.
- In this case, my sql is more scalable than just a key-value database.
database instance:
- only one single database instance to act as source of truth for the system
- simple to implement
database cluster:
- multiple instance working together to improve availability of the data for the system
- possible out of sync data
  - solution:
    - raft algorithm
    - assign the most updated database as a leader
    - the rest are follower and will replicate data from leader via log
    - if the leader become unhealthy elect one of the follower to become leader
in this case we will go with database cluster as it is more scalable and provide more availability and its drawback from its distributed system can be mitigated by raft's algorithm

Paste expiration:

As we allow the user to set expiration condition for every paste we need efficient way to invalidate a balk of paste. For time condition we just need to map expiration_date with the id. If there is request after expiration timestamp we can easily know that this paste is invalid and delete it. In addition we can have the cron job that will periodically check if at the current time are there any paste that is expired if so we can delete them all for reducing storage cost.

Cache:

we can use cache write through to keep the data in cache sync with the data base. Also we need test and set Time To Live for data in cache so that after the reasonable time the item will be clear from cache saving space. In case the data is deleted in the database we also need to set up trigger to make sure that the data is also deleted from cache.
Algorithm
- LRU: least recently use cache is one of the more popular cache algorithm thanks to its simplicity. Basically, it just have predefine size and if cache surpasses this limit it will remove the least recently used item from cache to make room for new item. This algorithm is one of built in algorithm that Redis has, however, the usage pattern isn't entirely suitable for this algorithm as it doesn't account for the how frequent it is used.
- LFU: least frequently use cache. As the name suggest it is like an extended version of LRU as it also take the number of item usage into account. So, it is more suitable for our usecase.

Scalability:

We will deploy stateless service on container orchestration service such as AWS ECS. In this case we can scale up/down our container on a whim. We can also set up condition for scaling such as when the request amount of certain service exceed our predefine value. They will add more instance to serve the upcoming load. Also we can adopt fault tolerant architecture by using spot instance to save the cost of infrastructure.