Requirements


Functional Requirements:

1.System should take the long url and give back the shortened url.

2.System should return the redirection to longurl if shorturl is given.

3.System should support custom aliases and each short url generated should have an expiration date.

Optional:

1.System can also support analytics and tracking for shorturl clicked.

2.Can perform the malicious urls check and spamming of system can be detected.


Non-Functional Requirements:

1.The system should be 99% available.

2.Latency of redirection should be less.

3.System should support 1B urls generated.

4.Urls generated should be unique with less collisions.

Question:

What to do when multiple same long urls are sent to post request should we return the same short url or different short url should get generated.


API Design

POST

\url

{

"long_url",

"optional:custom_alias"

"optional:expiration_date"

}

-> short)(200 status created)

GET

/short_url

-> long_url (302 redirect)


High-Level Design

1.On a high level design, our flow will be like there will be client and there will be url generator service which we are black boxing for now and discuss detail in deep dives so the client sends the long url and the request is sent to the url generator service .The url generator service will generate a unique short code which will be stored in the db.Then the short url is returned back to the client.If it is a get request then the url is checked if present in db then return the long url with status code 302.We can keep the status code 301 but 301 stands for permanent redirect so it's not a good option as we will be having less control and browser will be caching if we select the permanent redirect which is not a good option as we may need to change the url after.So also we can have a sort of background job which will be deleting the urls which are expired in the db returning the 410 gone error when the user tries to redirect to the url which is expired.



Detailed Component Design

Now lets discuss the deep dives of the system so we have to ensure that the url generated by us is unique and also short.

So the ways we can achieve this is by :

1.Generating the url via prefix of long url:

This can generate urls with prefix of long url so our url generated will be short but the problem is many urls can have the same prefix so which will be confusing whether to redirect to which url.

2.Generating url via hash generator

So via this approach we can have a random number generator but the entropy is not high as we can still have collisions so we can have hash function which will give the short url.The entropy is good but the problem is if we have hash function like SHA 256 imagine a scenario where we have to generate different short url for same long url sent then it will be difficult at that time as the same short url will get generated in that scenario. So in that scenario either we can add some salt using HMAC post the generation of url via hash function but still the entropy is not high enough and the url generated may not be that short .

3.So we have another approach to generate the short url which is via base 62 algorithm.So we can have a unique id generator and this unique id will be sent to base 62 to generate the unique code which will be added to our domain name with the current scale that was given this approach was good enought to generate enough number of unique codes.


Now coming to the next non functional requirement the latency of read should be less.

So we can do this by making the short url as the primary key here so the reads will be so fast.This is one way of making the reads faster.

But even in that way the reads will be still slower so we should have a cache before the db to reduce the latency of reads as the read from memory is 1000 times faster than read from ssd and million times faster than HDD.So we can place a cache in between and so if there is any hot url then the cache should be warmed up so the keys never expire and do the cache stampede.

Coming to our next requirement: system should be highly available.

So we should scale our application as currently our redis is a SPOF so we can use either redis sentinel or redis cluster.The database can also be sharded

So any url of get will go to redis first then goes to db if there is cache miss.

We have seen the read latency now if we see the application the read to write latency is 100:1.So clearly this is a read heavy system.Now we can also scale this system by seperating the reads and writes.