Requirements


Functional Requirements:


  • Create a short URL for a given long URL.
  • Return the long URL associated with a given short URL.



Non-Functional Requirements:


  • Low latency
  • High consistency
  • Scalable to 1 lakh req / s


API Design

GET /urls/?url=

Status code: 301 redirect

{

"redirected_url":

}


POST

/urls/

{

"url": "long format url",

"short_code": ,

"expiry":

}


return {

"url":

}



High-Level Design


APIs flow:


For POST request,

can have unique constraint on short_code ,

we can have bloom filter or can use snowflake mechanism to create short codes across distributed system.

Req will first land on Api gateway post which be handled automatically by LB.

Then application layer can go through bloom filter to check if the short code exists or not or can use snowflake creation mechanism so that we are fully sure that it will never collide across any distributed system. Then it will row in DB with short code if provided otherwise create using base62.




for GET, since there will be millions of data,we can have a redis layer to keep recently x hrs created data,

this would be highly accessible and fast ..

The req with short code can easily access through redis first and if not present can be seen in DB with the short code since it will be indexed.

It will give 301 as status code which will be locally cached in browser hence it will be accessible easily



Availability point of view:

Also this shoudl be highly consistent instead of available to avoid conflict of short codes

For availability perspective, load balancer or service deployed on ec2 instances are horizontally scalable, hence can be scaled.



URL expiration:

For expiry of urls, we can have postgresql extension installed to mark field as inactive when its ttl is crossed

pg_ttl_index can be used to expire short codes

So this is handled via background worker where it will scan and delete rows wherever expiration date exceeds

On update the cache can be dropped for that short code in redis as well.



Creation and redis flow -

On creation recent active links can recide on redis so that they are easily and fastly accessible.

And on deletion ,the will be accorindgly removed as well

The strategy to store most recently created can be on created_ts with their TTL as 24hr. and then can be removed from redis layer...

Mostly the links are created and majority like in 70% cases it will be hit within 24 hrs and then can be served from local chrome cache layer since we are responding with 301 status code..


For other remaining calls, it can go with directly hitting the DB.

Also since we have rate limiters so those can be managed.

we can use token bucket algorithm for rate limiting since these links can be fetched from multiple different IPs



// Cache misses

Cache miss is fine and can take hit on DB only if bloom filter says that the URL exists.. because bloom filter might give a few false positives which should be fine.

Also DB can be partitioned to geolocated spaces as per some IP hashing technique.




// High traffic handling or viral links handling

So we can introduce CDN as well to handle location based cache handling as well

So links creation would be mostly geographically close to each other .

Only miss will be when links are tried to be fetched on some different geo-located space

So each geo location say G1 will have x links generated and all people from G1 might be using x-y links which can be easily served from here.

For y links, first bloom filter will check if the short code exists or not and if yes will provide the long code


// Identifying wrong URL

Since we have introduced CDN we can first take redis hit to check if it exists and then bloom filter can be used to create/get accordingly..



// POST concurrent calls

  • The logic for creating links can be kept entirely on client level maybe in Python written server code
  • Also with this snowflake uses no lock sequence reservation which will make sure that no 2 ids collidate across 2 data nodes


Key Points summary-

URL can be shotened using base 62 encoding

We can have bloom filter to check if the url exists or not for the given short code while creation

We can use simple PostgresDB to store all the data with their expiry time as well


How bloom filter works for both GET and POST and how its useful for scaling and handling viral/wrong links

  • Bloom filter not only creates and check in POST req , it will also check in GET where to give error that code does not exist or hit the DB and get the long code for it
  • With each cache miss, bloom filter will check and only hit if it say True for existence otherwise throws error for the same


Strong consistency can be handled via snowflake package.

We have snowpark where unique short codes can be handled across multiple distributed systems

High availability can be handled via redis layer.



Detailed Component Design



Availaibility perspective:

Since we have redis layer on top of it , it can scale to 1m req / s

Also cannot induce CDN since we want strong consistency



Tradeoffs:

No tradeoffs since we have extension in place to handle expiration.

Snowflake package to handle unique short code generation


Concurrency handling:

Concurrent calls will not usually collide since in packages like snowflake it used multiple parameters to create a hash Id



Thundering herd problem:

We might have issue of redis miss cache for multiple requests, hence we would be introducing rate limiting as well so that LB does not allow request above certain limit to exceed