Requirements


Functional Requirements:


  • Create a short URL for a given long URL.
  • Return the long URL associated with a given short URL.



Non-Functional Requirements:


  • The system should have less latency.
  • The system should scale to handle millions of URLs.


API Design

Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...


POST v1/shorten

{

url:

} -> shortened URL


GET v1/long/ -> long URL


Main estimation is that the number of GET API calls >> number of POST API calls. -> we can tradeoff some latency to POST to account for correctness over the GET endpoint.


We can have a delete endpoint, and an endpoint which encodes a URL for a specified period of time, but keeping it out of scope for now.


We can also have an endpoint for seeing site visit statistics but keeping it out of scope for now.


Statistics will be influenced by whether our redirect is temporary or permanent - we need to factor this and arrive at one after discussion with client.


High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.




-- In a single geography -

Server - responsible for getting the long URL and encoding them.

DB -> a simple no-sql database containing information in two tables -

  1. Short URL -> {long_url, visit statistics, metadata}
  2. Long URL -> short URL

first collection will provide efficient lookup, while the second collection will lead to search before encoding a long URL. Alternatively, we can use DB specific nuances to reduce the memory footprint, ex - We can have just the first collection with a GSI on the long_url in DynamoDB.

Cache - serves popular URLs quickly. Invalidation occurs through decreasing popularity in a fixed key space. Can use LFU/LRU cache eviction policy.

Encoder - encodes long URLs into short ones. May use a hash based encoding with character truncation. To avoid collision, the short URL can be appended with a hexadecimal timestamp -> leading to unique short URL for every long URL. In case of multiple encoders being used, a unique machine identifier can be appended as well to ensure zero collisions.



-- multiple geographies -

Challenge becomes - ensure that no single long URL gets more than one short URL. Can be solved using multiple ways.

  1. Leader follower model - read from any geography. Writes to one. Will lead to slower short URL generation times.
  2. Asynchronous short URL generation - wherein each server in each geography generate the short URL on its own, but the response to the client is delayed pending eventual reconciliation.

Both ways lead to some latency on the POST endpoint. The DB will be eventually consistent and global replication will lead to a slight replication lag, but since we are not overwriting any entry, we will not get a wrong URL - either correct URL or no URL pending replication.


Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.