Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Pastebin with Score: 9/10

A user should be able to create pastes and receive a uniqueId by specifying paste content and optionally an expiration, the pastes should be kept indefinitely if no expiration time was specified.
A user should be able to retrieve a paste by its uniqueId if it has not passed its expiration time

Strong consistency - Our system needs to support read after write consistency, meaning that once a paste is created and a uniqueId is returned, we must always be able to retrieve the same content with the provided uniqueId.
Low latency - Writes should complete in 10s or 100s of milliseconds, reads should complete in a few to 10s of milliseconds.
Scalability - Writes should support an average of 12TPS and a peak of up to 120TPS. Reads should support 120TPS on average and up to 1200TPS at peak.
Durability - Pastes without an expiration should be stored indefinitely.

Write throughput - Writes should support an average of 12TPS and a peak of up to 120TPS
Read throughput - Reads should support 120TPS on average and up to 1200TPS at peak
Storage of pastes - with an average paste size of 10KBs and an average write of 12TPS, we can estimate storing 10kb * 12 = 120kb per second. 120kb/sec * 86400 sec/day = 10,368,000kb or ~10.4 GB per day. Which would translate to 3.796TB per year.

We will use HTTP/REST for our APIs, as follows:

Request

POST api/v1/paste

Body:

{

"content": "lorem ipsum...",

"expires_in": "24hr"

}

Response

{

"id": "xyz123"

}

Request

GET api/v1/paste/{id}

Response

{

"content": "lorem ipsum..."

}

We will use two different storages for this sytem.

Metadata DB - this will be a DynamoDB table that stores information about our pastes such as id (string), created_at (number), expires_in (string), and content, which will be a pointer to S3 where our paste content is stored. The content needs to be stored separately as DynamoDB only supports a maximum of 400KB per item while our system supports pastes of up to 1MB. We will specify strong consistency for DynamoDB to support our use case of pastes being available after a consistent upload which means clients can not read a paste until the write has been replicated across all nodes. This is a trade off as strong read after write consistency will slow down the latency of our writes as we need to wait until all followers are replicated in order to read.
Paste bucket - S3 bucket for storing our paste content. Using this as it fits our storage requirements for pastes.

Services we will need:

Upload service - responsible for emitting paste upload events and writing to our Metadata table and S3 bucket. Sits behind a load balancer and makes use of autoscaling to automatically spin up new instances based on incoming traffic load.
Unique id generation service - responsible for generating unique ids for each paste. The pastes should never have collisions.
Expiration service - service responsible for deleting pastes that have expired from both the Metadata DB and our paste bucket.
Retrieval service - service responsible for reading pastes.

Write flow - A client uploads a paste to our upload service load balancer which distributes the request to a upload service instance using round robin. The upload service then calls our unique id service to get a unique id for the paste, uploads the paste contents to our Paste S3 bucket which redistributes the paste across our Paste CDNs with strong consistency, writes the metadata associated with the paste to our Metadata DB and finally pushes a message to our expiration queue. Once this is done it responds back to the client with the uniqueId for the uploaded pase.
Expiration flow - The expiration queue is used for deletion of pastes with an expiration specified. The expiration service reads expiration messages from the expiration queue and deletes pastes for the associated message if they have hit the expiration date/time.
Read flow - A client requests a paste using its unique id, the retrieval service's load balancer distributes the request to a retrieval service instance using round robin. The retrieval service checks to see if the paste is already in our in-memory paste LRU cache built with Redis, if it is not it retrieves the paste from our paste CDN using the cdn instance geographically closest to the user.

Unique Id service instance - this service will generate unique ids using UUID generation, it will query the metadata db to check if the unique id already exists, if it does it will generate a new unique id until we have created one that is new. This guarantees we don't have unique id collisions for pastes.
Expiration service/queue - The queue is used to guarantee message delivery as clients expect pastes to be deleted if they specified an expiration. When the expiration service reads a ready to be expired paste message off the queue, it marks the message as pending as the service does its work. This ensures we don't remove a message from the queue until it has been confirmed the paste has been deleted. Once the expiration service successfully deletes the metadata and paste objects, the message is removed from the queue to mark it as deleted.
Multi region/availability zone setup. This application will be deployed across multiple availability zones to ensure data is replicated and our service still runs even in the event that an availability zone goes down.

Strong vs. eventual consistency in DynamoDB. This trade off was made to ensure we couldn't read pastes before they had been written to all leader nodes of our database. This was a design requirement but comes at the expense of write latency as it takes more time to replicate the data across all follower nodes. The more nodes you have the longer it takes.
Unique Id generation service could have collisions with the UUID algorithm, to mitigate this our service queries our DB to see if the unique id exists already which could take more time on write. This trade off was made to ensure we guarantee we cannot have collisions of unique ids.

A service instance goes down, all of our services are distributed across multiple nodes so more traffic will be directed the the available nodes.
Availability zone goes down, traffic will be redirected to the nearest availability zone which could be far away and increase our latency.

Analytics service -for tracking which pastes are retrieved the most, who uses the pastes, etc.