Requirements


Functional Requirements:

  • Create a short URL for a given long URL.
  • Redirect to long URL associated with a given short URL.
  • Optional: Custom Aliases, Expiration time



Non-Functional Requirements:


  • Redirect latency should be low
  • very high read throughput
  • highly available
  • highly durable(no data loss)
  • Strong consistency for URL mapping.
  • Eventual consistency for analytics.


Core Desgin Decisions:

  1. ID Generation Strategy: Auto increment, Random, Base62 encoding of chosen uniqueId
  2. Read Optimization: Cache(Redis)
  3. Storage: key-value store
  4. Redirection speed: CDN + cache


Back of the Envelope Estimation:

  1. Assume 10M new URLs per day
  2. Read: write = 100:1
  3. QPS: write: 10M/(24*3600) = 115 writes/sec. Read = 115*100 = 11,500 reads/sec
  4. Storage: Assume avg URL length = 500 bytes. metadata = 100 bytes, total = 600bytes. 10M/day -> 6GB/day for 1year = 2.2TB
  5. 20% of those urls in cache = 400GB


API Design

Create Short URL: post/shorten

req = json

{

"longurl' : "......",

"customAlias":"...."(optional),

"expiry":"...."

}

Redirect URL: Get/{shortcode}

Analytics URL: Get/stats/{shortcode}



High-Level Design

URL Shorten Service - generates short code

Redirect Service - Handles lookups and redirects

Analytics Service - async processing


DATABASE choise: key-value(DynamoDB), if we choose SQL, scaling writes is hard, sharding complexity.

Schema: table: URLMapping

shortcode (pk)

longurl

createdAt

expiryAt

userId(optional)


Data flow:

URL Creation flow:

  1. User -> API(Url shorten service)
  2. Generate unique ID
  3. Convert to Base62 ->shortcode
  4. Store in DB after storing in cache
  5. Return the short URL


Redirect flow:

  1. User hits short URL
  2. checks cache
    1. Hit -> redirect
    2. Miss -> DB Lookup -> Cache -> redirect


Detailed Component Design

Short URL Generation:

Approach1: Random string

. Generate a random 6-8 chars

. problem: collisions -> retry loops

Approach2: Hashing(MD5)

. Hash long URL

. Take first N chars

. Problem: collision risk

Approach3: Counter + Base62

. Generate a unique ID eveytime

. Convert to Base62:

. Why is it best: No Collisions, Deterministic, Compact

. Problem: Distributed ID generation

. use pre allocated ID ranges per server.



Scalability & Reliability:

Bottlenecks

  1. DB overload: Use Read replicas and sharding
  2. Cache misses: use correct eviction policy(LRU)
  3. Hot URL: use CDN
  4. ID generation bottleneck: Use Distributed ID generator
  5. Traffic spikes: use Auto scaling, and Rate limiting
  6. Request failures: use exponential backoffs