Design Pastebin - System Design

My Solution for Design Pastebin with Score: 8/10

by celestial_lotus529

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

store text and return url to access that for set period of time.
use login and signup and session management
The content should be accessible for a set period.

Non-Functional:

List non-functional requirements for the system...

Availability
Scalability

Capacity estimation

Estimate the scale of the system you are going to design...

1. Estimating Traffic and Data Load

A. Writes (Paste Creations)

Pastes per Day: Let's assume we anticipate around 1 million new pastes per day.
Average Paste Size: We'll assume an average paste size of 10 KB.
Total Daily Data Ingestion:
- $1 \text{ million pastes/day} \times 10 \text{ KB each} = 10 \text{ GB/day}$.
Writes per Second:
- With 1 million pastes per day:
  - $\frac{1,000,000}{24 \times 60 \times 60} \approx 12 \text{ writes per second}$ on average.
- During peak loads, this could spike to ten times the average, resulting in hundreds of writes per second.

B. Reads (Paste Retrievals)

Read to Write Ratio: Assuming a read-heavy workload with a read-to-write ratio of 5:1 or even as high as 10:1.
Reads per Day:
- For a 5:1 ratio:
  - $5 \text{ million pastes per day} = 50 \text{ million read requests per day}$.
- Reads per Second:
  - Average: $\frac{50,000,000}{24 \times 60 \times 60} \approx 578 \text{ reads per second}$.
- This can burst significantly on viral pastes, potentially requiring thousands of reads per second at peak times.

2. Storage Requirements

Daily Storage: At 10 GB of new data per day, storage over time becomes significant.
Monthly Storage: Approximately $300 \text{ GB/month}$.
Yearly Storage: Approximately $3.6 \text{ TB/year}$.
Retention Policy: If pastes expire or are deleted after a certain period, such as 3 months, average storage can be around 900 GB.

3. Unique ID Space and Collisions

ID Size: Using a 6-character alphanumeric ID provides a vast space of $62^6 \approx 56 \text{ billion possibilities}$.
With 1 million new pastes every day, it would take years to exhaust this key space, keeping collision probability low.

Key Considerations

Scalability: The design should allow horizontal scaling to handle the peak loads efficiently.
Data Durability: Critical to ensure no data loss occurs, particularly for popular pastes.
Caching Strategy: This is crucial for handling the read-heavy nature of the workload, with frequently accessed pastes cached to reduce database load.

API design

Define what APIs are expected from the system...

1. Store Data

This endpoint is used to create a new paste by sending the text content and user information.

Method: POST
Path: /pastebin/store

Request Body

Content-Type: application/json

Field	Type	Description
`text`	`string`	The text content to be stored.
`userid`	`string`	The unique identifier for the user.

Example Request:

{
  "text": "This is a new paste for the service.",
  "userid": "user-12345"
}

Response

Status Code: 201 Created
Content-Type: application/json

Field	Type	Description
`url`	`string`	The unique URL for the newly created paste.

Example Response:

{
  "url": "[https://pastebin.example.com/abcdef123]    (https://pastebin.example.com/abcdef123)"
}

2. Fetch Data

This endpoint is used to retrieve the text content of a paste using its unique hash.

Method: GET
Path: /{hash}

Response

Status Code: 200 OK
Content-Type: application/json

Field	Type	Description
`text`	`string`	The raw text content of the paste.

Example Response:

{
  "text": "This is a new paste for the service."
}

Note: If the hash is not found, the API will return a 404 Not Found status.

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

In database design we have two things one we are going to store data in aws s3 and metadata in relational database.

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Pastebin Service ⚙️

This is the core service that orchestrates the entire process. Its responsibilities are:

Data Storage: It's responsible for the primary action of the service—storing text data. It writes the raw text content to AWS S3 and saves the corresponding metadata to the database.
URL Creation: It's tasked with generating the unique, shareable URL for each paste. This is done by creating a hash based on a UUID and User ID, which ensures uniqueness for each session. This hash is stored as encoded_url in the database, and the final URL is constructed as domain/hash.
URL and Data Retrieval: When a user accesses a URL, the service must:
- Extract the hash from the URL.
- Use the hash to query the database and retrieve the associated metadata (including the UUID and User ID).
- Check if expiry time has passed or not . if pass then reject request.
- Use the S3 key from the metadata to fetch the actual text data from AWS S3.
- Return the text content to the user.

How URLs are Created

The URL creation process is a key part of the service's design. The service combines two unique identifiers—a UUID (Universally Unique Identifier) and the User ID—to create a unique hash. This approach ensures that the URL is not easily guessable and is unique to a specific user and a specific "paste" session. The hash is then appended to the domain to form the complete URL, like domain/hash.

Service Components

AWS S3 ☁️

AWS S3 (Simple Storage Service) is an object store used to hold the raw text data. It's chosen for its key features:

Cost-Effectiveness: It's a cheap and scalable storage solution, making it ideal for storing large volumes of data without high costs.
High Scalability: It can handle virtually unlimited data, ensuring the service can grow as needed.

Database 💾

The database's primary role is to act as a metadata store. It's crucial for the service's functionality and stores information such as:

encoded_url: The unique hash used in the URL.
S3 Key: The pointer or key that links the database record to the actual text data stored in S3.
User Details: Information about the user who created the paste.
Other Metadata: This could include things like the creation date, expiration date, or privacy settings for the paste.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

1. Client

The Client is the user-facing component, which can be a web browser, a mobile app, or a command-line utility.

Key Responsibilities:

User Interface: Provides an interface for the user to input text and a user ID.
Request Handling: Sends POST requests to the API to store new pastes and GET requests to retrieve existing ones.
Data Presentation: Displays the generated URL for a new paste and presents the retrieved text content to the user.

2. API Gateway

The API Gateway acts as the entry point for all client requests. It provides a single, unified interface for the backend services.

Key Responsibilities:

Request Routing: Directs incoming requests to the appropriate backend service. For this design, it forwards all requests to the Pastebin Service.
Endpoint Management: Exposes the public endpoints (/pastebin/store and /{hash}).
Basic Validation: Can perform initial checks on the request path to ensure it is valid before forwarding.

3. Pastebin Service (Backend Logic)

This is the core business logic component of the system. It orchestrates the storage and retrieval of data by interacting with the data tier components.

Key Responsibilities:

Store Logic (Enhanced Detail):
1. Receives the text and userid from the API Gateway.
2. Generates a unique, collision-resistant hash based on the userid and a newly created UUID. A robust hashing algorithm (e.g., SHA-256) should be used, truncated to a URL-friendly length (e.g., 8-12 characters).
3. Asynchronously writes the raw text data to AWS S3 and receives a unique s3Key in return. This asynchronous operation is crucial for efficiency, as it prevents the service from blocking while waiting for S3.
4. Stores the metadata (including the hash, s3Key, userid, and creation timestamp) in the Database. This operation should also be handled with robust error handling, with a retry mechanism if the database is temporarily unavailable.
5. New: Calculates an expiryTimestamp (e.g., 24 hours from creation) and includes it in the metadata record before storing it in the database.
6. Constructs the full, retrievable URL and returns it to the API Gateway.
7. Error Handling: Implements a rollback mechanism. If the database write fails after a successful S3 write, the S3 object must be deleted to prevent "orphaned" data.
Fetch Logic (Enhanced Detail):
1. Receives a hash from the API Gateway.
2. Implements a Caching Layer: Before querying the database, the service can check a cache (e.g., Redis) for the s3Key using the hash as the key. This significantly improves performance for frequently accessed pastes.
3. Queries the Database using the hash to retrieve the corresponding s3Key and expiryTimestamp. The database should have an index on the encodedUrl column for fast lookups.
4. New: Checks if the current time is past the retrieved expiryTimestamp.
  - If the paste has expired, the service returns a 410 Gone HTTP status code.
  - It can also asynchronously trigger a cleanup process to delete the expired data from both the database and S3.
5. Error Handling: If the database query returns no result, the service returns a 404 Not Found response, preventing a request to S3.
6. Retrieves the text content from AWS S3 using the s3Key.
7. Returns the raw text content to the API Gateway.
8. Efficiency: The S3 retrieval is the most resource-intensive step. The service should ensure a streamlined connection and efficient data transfer from S3.

4. AWS S3 (Object Storage)

AWS S3 is the primary storage component for the raw text content.

Key Responsibilities:

Object Storage: Securely and durably stores the text data as individual, immutable objects.
Unique Keys: Provides a unique key for each stored object, which serves as the pointer for retrieval.
Scalability & Durability: Handles massive volumes of data with high availability and reliability.

5. Database (Metadata Storage)

The Database is responsible for storing and providing fast lookups for the metadata associated with each paste.

Key Responsibilities:

Metadata Storage: Stores a lightweight record for each paste, containing the encodedUrl (hash), the s3Key, the userId, and now an expiryTimestamp.
Fast Lookups: Allows the Pastebin Service to quickly find the s3Key by querying the encodedUrl.
Data Integrity: Ensures consistency between the URL hash and the S3 object key

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

1. Choice of AWS S3 for Text Data Storage

You've correctly identified the primary trade-off with this choice: performance versus cost and scalability.

Trade-off: Performance vs. Cost/Scalability
- Why S3 is beneficial: It is an incredibly cost-effective solution for storing large amounts of unstructured data. You pay per gigabyte stored and for data transfers, but the cost is significantly lower than a traditional relational database. S3 is also highly scalable and durable, meaning it can handle massive volumes of data without you needing to manage the underlying infrastructure.
- The Compromise: The trade-off is that S3 is not designed for low-latency, real-time access. Retrieval times can be slightly slower compared to fetching data directly from a database. This is a deliberate design choice, as the service is not intended for high-speed transactions where every millisecond counts. For a pastebin service, a small delay in retrieval is acceptable.
Choice of a Database for Metadata

- Reasoning: Storing the raw text content in the database alongside the metadata would quickly become inefficient. Large text fields can slow down database queries, increase storage costs, and make backups cumbersome. The database's role is not to store large blobs of data, but to perform quick lookups.
- The "Best of Both Worlds" Approach: By using a database only for metadata (like the encodedUrl and s3Key), the service leverages the database's strengths:
  - Fast Lookups: It can perform very fast lookups to find the S3 key based on the URL hash.
  - Efficient Indexing: Databases are optimized for indexing, which is crucial for the retrieval endpoint.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

we improve this service in future by spending time on choose other object store.

Markdown supported

My Solution for Design Pastebin with Score: 8/10

by celestial_lotus529

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

store text and return url to access that for set period of time.
use login and signup and session management
The content should be accessible for a set period.

Non-Functional:

List non-functional requirements for the system...

Availability
Scalability

Capacity estimation

Estimate the scale of the system you are going to design...

1. Estimating Traffic and Data Load

A. Writes (Paste Creations)

Pastes per Day: Let's assume we anticipate around 1 million new pastes per day.
Average Paste Size: We'll assume an average paste size of 10 KB.
Total Daily Data Ingestion:
- $1 \text{ million pastes/day} \times 10 \text{ KB each} = 10 \text{ GB/day}$.
Writes per Second:
- With 1 million pastes per day:
  - $\frac{1,000,000}{24 \times 60 \times 60} \approx 12 \text{ writes per second}$ on average.
- During peak loads, this could spike to ten times the average, resulting in hundreds of writes per second.

B. Reads (Paste Retrievals)

Read to Write Ratio: Assuming a read-heavy workload with a read-to-write ratio of 5:1 or even as high as 10:1.
Reads per Day:
- For a 5:1 ratio:
  - $5 \text{ million pastes per day} = 50 \text{ million read requests per day}$.
- Reads per Second:
  - Average: $\frac{50,000,000}{24 \times 60 \times 60} \approx 578 \text{ reads per second}$.
- This can burst significantly on viral pastes, potentially requiring thousands of reads per second at peak times.

2. Storage Requirements

Daily Storage: At 10 GB of new data per day, storage over time becomes significant.
Monthly Storage: Approximately $300 \text{ GB/month}$.
Yearly Storage: Approximately $3.6 \text{ TB/year}$.
Retention Policy: If pastes expire or are deleted after a certain period, such as 3 months, average storage can be around 900 GB.

3. Unique ID Space and Collisions

ID Size: Using a 6-character alphanumeric ID provides a vast space of $62^6 \approx 56 \text{ billion possibilities}$.
With 1 million new pastes every day, it would take years to exhaust this key space, keeping collision probability low.

Key Considerations

Scalability: The design should allow horizontal scaling to handle the peak loads efficiently.
Data Durability: Critical to ensure no data loss occurs, particularly for popular pastes.
Caching Strategy: This is crucial for handling the read-heavy nature of the workload, with frequently accessed pastes cached to reduce database load.

API design

Define what APIs are expected from the system...

1. Store Data

This endpoint is used to create a new paste by sending the text content and user information.

Method: POST
Path: /pastebin/store

Request Body

Content-Type: application/json

Field	Type	Description
`text`	`string`	The text content to be stored.
`userid`	`string`	The unique identifier for the user.

Example Request:

{
  "text": "This is a new paste for the service.",
  "userid": "user-12345"
}

Response

Status Code: 201 Created
Content-Type: application/json

Field	Type	Description
`url`	`string`	The unique URL for the newly created paste.

Example Response:

{
  "url": "[https://pastebin.example.com/abcdef123]    (https://pastebin.example.com/abcdef123)"
}

2. Fetch Data

This endpoint is used to retrieve the text content of a paste using its unique hash.

Method: GET
Path: /{hash}

Response

Status Code: 200 OK
Content-Type: application/json

Field	Type	Description
`text`	`string`	The raw text content of the paste.

Example Response:

{
  "text": "This is a new paste for the service."
}

Note: If the hash is not found, the API will return a 404 Not Found status.

Database design

In database design we have two things one we are going to store data in aws s3 and metadata in relational database.

High-level design

Pastebin Service ⚙️

This is the core service that orchestrates the entire process. Its responsibilities are:

Data Storage: It's responsible for the primary action of the service—storing text data. It writes the raw text content to AWS S3 and saves the corresponding metadata to the database.
URL Creation: It's tasked with generating the unique, shareable URL for each paste. This is done by creating a hash based on a UUID and User ID, which ensures uniqueness for each session. This hash is stored as encoded_url in the database, and the final URL is constructed as domain/hash.
URL and Data Retrieval: When a user accesses a URL, the service must:
- Extract the hash from the URL.
- Use the hash to query the database and retrieve the associated metadata (including the UUID and User ID).
- Check if expiry time has passed or not . if pass then reject request.
- Use the S3 key from the metadata to fetch the actual text data from AWS S3.
- Return the text content to the user.

How URLs are Created

Service Components

AWS S3 ☁️

AWS S3 (Simple Storage Service) is an object store used to hold the raw text data. It's chosen for its key features:

Cost-Effectiveness: It's a cheap and scalable storage solution, making it ideal for storing large volumes of data without high costs.
High Scalability: It can handle virtually unlimited data, ensuring the service can grow as needed.

Database 💾

The database's primary role is to act as a metadata store. It's crucial for the service's functionality and stores information such as:

encoded_url: The unique hash used in the URL.
S3 Key: The pointer or key that links the database record to the actual text data stored in S3.
User Details: Information about the user who created the paste.
Other Metadata: This could include things like the creation date, expiration date, or privacy settings for the paste.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

1. Client

The Client is the user-facing component, which can be a web browser, a mobile app, or a command-line utility.

Key Responsibilities:

User Interface: Provides an interface for the user to input text and a user ID.
Request Handling: Sends POST requests to the API to store new pastes and GET requests to retrieve existing ones.
Data Presentation: Displays the generated URL for a new paste and presents the retrieved text content to the user.

2. API Gateway

The API Gateway acts as the entry point for all client requests. It provides a single, unified interface for the backend services.

Key Responsibilities:

Request Routing: Directs incoming requests to the appropriate backend service. For this design, it forwards all requests to the Pastebin Service.
Endpoint Management: Exposes the public endpoints (/pastebin/store and /{hash}).
Basic Validation: Can perform initial checks on the request path to ensure it is valid before forwarding.

3. Pastebin Service (Backend Logic)

This is the core business logic component of the system. It orchestrates the storage and retrieval of data by interacting with the data tier components.

Key Responsibilities:

Store Logic (Enhanced Detail):
1. Receives the text and userid from the API Gateway.
2. Generates a unique, collision-resistant hash based on the userid and a newly created UUID. A robust hashing algorithm (e.g., SHA-256) should be used, truncated to a URL-friendly length (e.g., 8-12 characters).
3. Asynchronously writes the raw text data to AWS S3 and receives a unique s3Key in return. This asynchronous operation is crucial for efficiency, as it prevents the service from blocking while waiting for S3.
4. Stores the metadata (including the hash, s3Key, userid, and creation timestamp) in the Database. This operation should also be handled with robust error handling, with a retry mechanism if the database is temporarily unavailable.
5. New: Calculates an expiryTimestamp (e.g., 24 hours from creation) and includes it in the metadata record before storing it in the database.
6. Constructs the full, retrievable URL and returns it to the API Gateway.
7. Error Handling: Implements a rollback mechanism. If the database write fails after a successful S3 write, the S3 object must be deleted to prevent "orphaned" data.
Fetch Logic (Enhanced Detail):
1. Receives a hash from the API Gateway.
2. Implements a Caching Layer: Before querying the database, the service can check a cache (e.g., Redis) for the s3Key using the hash as the key. This significantly improves performance for frequently accessed pastes.
3. Queries the Database using the hash to retrieve the corresponding s3Key and expiryTimestamp. The database should have an index on the encodedUrl column for fast lookups.
4. New: Checks if the current time is past the retrieved expiryTimestamp.
  - If the paste has expired, the service returns a 410 Gone HTTP status code.
  - It can also asynchronously trigger a cleanup process to delete the expired data from both the database and S3.
5. Error Handling: If the database query returns no result, the service returns a 404 Not Found response, preventing a request to S3.
6. Retrieves the text content from AWS S3 using the s3Key.
7. Returns the raw text content to the API Gateway.
8. Efficiency: The S3 retrieval is the most resource-intensive step. The service should ensure a streamlined connection and efficient data transfer from S3.

4. AWS S3 (Object Storage)

AWS S3 is the primary storage component for the raw text content.

Key Responsibilities:

Object Storage: Securely and durably stores the text data as individual, immutable objects.
Unique Keys: Provides a unique key for each stored object, which serves as the pointer for retrieval.
Scalability & Durability: Handles massive volumes of data with high availability and reliability.

5. Database (Metadata Storage)

The Database is responsible for storing and providing fast lookups for the metadata associated with each paste.

Key Responsibilities:

Metadata Storage: Stores a lightweight record for each paste, containing the encodedUrl (hash), the s3Key, the userId, and now an expiryTimestamp.
Fast Lookups: Allows the Pastebin Service to quickly find the s3Key by querying the encodedUrl.
Data Integrity: Ensures consistency between the URL hash and the S3 object key

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

1. Choice of AWS S3 for Text Data Storage

You've correctly identified the primary trade-off with this choice: performance versus cost and scalability.

Trade-off: Performance vs. Cost/Scalability
- Why S3 is beneficial: It is an incredibly cost-effective solution for storing large amounts of unstructured data. You pay per gigabyte stored and for data transfers, but the cost is significantly lower than a traditional relational database. S3 is also highly scalable and durable, meaning it can handle massive volumes of data without you needing to manage the underlying infrastructure.
- The Compromise: The trade-off is that S3 is not designed for low-latency, real-time access. Retrieval times can be slightly slower compared to fetching data directly from a database. This is a deliberate design choice, as the service is not intended for high-speed transactions where every millisecond counts. For a pastebin service, a small delay in retrieval is acceptable.
Choice of a Database for Metadata

- Reasoning: Storing the raw text content in the database alongside the metadata would quickly become inefficient. Large text fields can slow down database queries, increase storage costs, and make backups cumbersome. The database's role is not to store large blobs of data, but to perform quick lookups.
- The "Best of Both Worlds" Approach: By using a database only for metadata (like the encodedUrl and s3Key), the service leverages the database's strengths:
  - Fast Lookups: It can perform very fast lookups to find the S3 key based on the URL hash.
  - Efficient Indexing: Databases are optimized for indexing, which is crucial for the retrieval endpoint.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

we improve this service in future by spending time on choose other object store.

Markdown supported