Design Pastebin - System Design

Requirements

Functional Requirements:

Allow users to upload and store text or code snippets.
Generate a unique shareable URL for each paste.
Enable retrieval of paste content by URL.
Support expiration and TTL for pastes.
Allow paste owners or the system to delete a paste before its natural expiration.

Non-Functional Requirements:

DAU of about 100 million users
read/write of about 100:1 - read heavy system
average user accesses links between 5-20 times each day, maybe creates 1 sharable link

so:

we want to be able to scale the system to support lots of users
we want to support high availability - system should remain accessible
low latency to access the data - <200ms, users wants to access data as fast as possible
system has to be durable to avoid data loss - users want to be able to access data even in cases of failure
security - URLs should not be guessable

API Design

POST /api/v1/pastes -> 201 Created + unique id for the URL

{

title: ?

syntax-type:

content:

timestamp-ttl: ? (could be minutes/hours/days up to 30 days, null for permanent)

}

GET /api/v1/pastes/{id} -> Paste

@auth - only owners of the paste are allowed to delete

DELETE /api/v1/pastes/{id} -> 2xx - might be 200 OK, 204 no content or 202 if not yet processed

to prevent abuse of the system we'll use rate limiting - 5 links per minute for a non-member, and 20 for members

High-Level Design

Core Entities:

Users - id(PK), name, email, createdAt
PastesMetadata- id(PK, unique), createdBy(FK), title, type, timestamp-ttl(index), signedUrl, status
Pastes - id, content

we'll use a CDN to serve static content as close as possible to the user. it will also serve as a proxy to access data from our main database. it will work using a LRU strategy and with a TTL value of 1 minute.

we need an API gateway to handle TLS termination, authentication and authorization, rate limiting routing, and load balancing.

for writing pastes -

we'll use an initial pastes write service that will generate a unique id and create an event and insert it to kafka with all the data. the event will be consumed by an ingestion service in the background which in turn insert the metadata to an relational DB and the data to s3. the insertion will first be to S3, get the signed link and storing it along with the metadata.

to create a unique id we'll use base62 encoding with 6 chars which gives us roughly 56 billion addresses.

for deleting pastes -

the reader service creates an event in kafka, which will be consumed by the cleanup service. the cleanup service will update the status of the metadata to INVALID so that CRON job will also delete it and invalidate the entry in the cache. that way when a user tries to read the data the reader service sees the status as invalid and returns an appropriate response to the user.

for reading pastes -

we'll use a pastes reader service to be able to fetch pastes. it will first try to fetch from a Redis cache, if the data isn't there it will fetch the data from the relational DB and store it in the cache and return a response to the user, meaning the signed url. to user could then access the data directly via the signed URL using the CDN.

for caching -

we'll use a Redis cache to reduce latency and make the reads faster. the cache will store the pastes metadata(which include the signed url). each metadata entry is roughly 100 bytes. we have a 100:1 read:write ratio so roughly 1 million metadata pastes to keep in the cache, which is roughly 0.1Gb which could fit inside a single Redis instance.

we could keep the same information at the CDN's cache but with a TTL of 1 minute to be able to synchronize with any updates.

for cleanup -

we'll use a cleanup service. the cleanup service has a CRON job so every 5 minutes it will delete records from the DB with a timestamp-ttl that is less or equal to its own OR with a status of INVALID.

for scale -

each of the services can scale up or down based on workload and traffic, so we could use multiple instances of each service and spread the load using dedicated load balancers in front of each service.

we will use a relational database such as PostgreSQL for storing the metadata because of the ACID properties of transactions. enabling use to SELECT FOR UPDATE to handle the metadata for reads or deletion before they can be read. for the data itself we'll use a S3 storage to separate the metadata from the data.

Detailed Component Design

for unique id generation we'll use base62 encoding with 7 letters - that about 56 billion possible ids. the problem is what happens with multiple instances of the writer service. for that we can use the Redis cache to also store a global atomic counter. each instance would get a set amount of ids equal to a 1000 and update the counter when even the writer service reaches the 1000 and get 1000 more. in case of writer service failure we have redundancy of services and other writer service will share the load, but in general we would monitor the service and scale up or down when needed(usage above 80% and 20% accordingly). for scale we could add more letters by allowing the global counter to go even higher, and support up to 10 digits which is more than enough and still sits with our rough estimates.

the hot path of our system is the read path and it has to be as fast as possible. using a CDN near to the client to cache metadata and content of pastes with low TTL contributes to low latency and handles roughly 90% of the traffic since it stores the signed links. we will also use multiple reader services, just like for the writer service, to spread the load and provide high availability. again failure in one reader service spreads the load over the rest of the reader services. the reader service also utilizes the Redis cache that can hold all the metadata in it, again reducing latency to about 5-10 ms on a cache hit. in case of a cache miss we fetch the metadata from the postgres DB, fill the cache and return a response to the user.

again the key is an atomically incremented counter inside the Redis cache(which will also be backed up to the metadata DB). each access to the global counter ensures single access and increment for the accessing service. that way a range a specific writer service fetches is unique and never repeated again.

if a service crashes we loose those 1000 ids for good, but the address space is large enough for us to continue as usual. the keys are then ingested by the writer service to create a unique set of generated ids that are used, base62(num in received range, salt) which will guarantee uniqueness and prevent collisions.

the key could be made hard to guess using some random salt that would be saved along side the global counter, or alternatively use an obfuscation method.

the id of a paste is idempotent. upon request we would also receive a hash of the content(sha256) to store for a short time(10 minutes), write the hash to both the cache and the database the id has a constraint to be unique. if another request comes for the same request, meaning the same hash(timestamp is probably different, but content is not), we would check if the data already exists in both metadataDB and S3 by comparing checking if the metadata contains the same hash value. if so the status of the metadata with status PENDING means the user already received the presignedURL and uploads the paste so we would drop the request. otherwise we would continue to return a response to the user to continue with saving the data.

if multiple requests come in at the same time, the load is spread across multiple writer services, each with its own range of ids from the global counter(writer1 - 1000-1999, write2 - 2000-2999 and so on). each request atomically increments the local counter in each service, ensuring unique ids per request.

for the write path we are ok with eventual consistency, but we still want low latency. we receive a request from the user generate a unique,base62 encoded id, fetch a presigned URL from S3 and write the metadata to the metadata DB, then return a response to the user. we avoid handling large files by letting the user upload the data by themselves, reducing load from the system, the metadata written is roughly 100b, with 100 million users a day, and even distribution throughout the day thats 100M*100b/24h which is roughly 1Kb of data per second. both the writer service and the metadata database can handle both easily.

S3 is a cheap NoSQL database, that can hold all the data we need. the average document will be around 0.5Mb or even less, which fits easily in S3. the metadata for a single record is roughly 100b, and roughly 1M users create pastes each day - that is 0.1Gb per day and 36.5Gb a year which fits nicely in a single postgres instance even without deleting irrelevant pastes metadata, and we could store data for years ahead.

we do however want to shard the postgres and S3 by the id to avoid congestion of request to a single data base instance, another suitable variation is to keep the region of the created user and use that as the partition key because most probably that the paste is shared with people from the same region, which allows better latency and scalability.

the Redis cache can store all data for the same day in a single instance - 100b * 1 million metadata records is about 0.1GB. we could store multiple days in a single instance without any problem. since pastes can't be changed after creation we can use a mapping of id to metadata with a TTL value of 12 hours for the heavy reads, and with an LRU eviction policy. also using a redis cluster to handle redundancy in case of failure. the CDN will also use LRU eviction policy with a shorter TLL value of 10 minutes.

to avoid a cache stampede we would batch together requests for the same resource(metadata and data) and use request coalescing to send a single request to fetch the data and/or metadata. that way we avoid overwhelming the cache and the database. in case of a cache failure and unavailability response latency will go up to around 50ms from the metadata DB but will still be functional. we could rebuild the cache by prefetching popular pastes and data about pastes from the last 6 hours using another worker for that.

for rate limiting in the API gateway we'll use token backet strategy, limiting non-member users(based on IP) to 5 pastes per minute and member users to 20 pastes per minute.

hot keys are spread out through multiple shards ensuring balancing load of incoming traffic, they are also cached in the Redis cache and the CDN.