System requirements


Functional:

List functional requirements for the system (Ask interviewer if stuck)...

Create shortened url

Update existing urls

Delete urls

Fetch original url given the shorted url




Non-Functional:

List non-functional requirements for the system...

Highly available, multiple servers in case on goes down

TTL for existing urls, 1 year TTL

Cache for LRU urls

Store all URLs in DB



Capacity estimation

Estimate the scale of the system you are going to design...

10000 new urls stored every day

100 MB for each url entry

100 MB * 10,000 entries * 365 days = 365,000,000 MB or 365 GB per year, 3,650,000 entries per year



API design

Define what APIs are expected from the system...

exposed endpoints:

/create{url} -> returns the tiny url

/delete{url} -> returns 200 on success, 400 on DNE

/update{url} -> returns 201 on success, 400 on DNE

/get{url} -> return 200 with tiny url on success, 400 on DNE


internals:

generateUrl -> generates the unique tiny url

UpdateUrl -> updates the DB with the new url

deleteUrl -> removes the url from the DB

addToCache -> adds url to the cache

removeFromCache -> remove url from cache

purgeDB -> removes urls from the DB with TTL 0

updateTTL -> updates the TTLs for urls in the DB


The main algorithm will be the generateURL which needs to generate a unique short url given a full length url. To do this I think utilizing a hash to store the shortened url will help with retrieval from the DB since I anticipate the service to be very read heavy. Hashing will provide a unique and easy to store identification system this way we don't have multiple urls with the same tiny url.


Another consideration is collisions in the DB due to users trying to retrieve and update the same url at the same time. In this case I think it would make sense to read from the replicated DBs and write/update the master DBs. This way we aren't reading from the same DB we perform writes. This will require a replication service to apply the updates to the replicated data bases in order to keep all DBs up to date.





Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


id -> unique id generated in our system

url -> original url given by client

tiny_url -> tiny url generated by our system

ttl -> ttl for url


replication across the multiple DBs, partition the DBs so that partition 0 is handled by one DB and the next partition is handled by another DB




High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...


Client will make a request to the service. A load balancer will direct the request to a server with capacity to handle additional load. The server will attempt to perform the given request and interface with the DB depending on the request type. Creates and Updates will interface with the master DB and reads will interface with the replicated DB.


The LB will also filter out any invalid requests by returning a default error when an invalid request is encountered i.e. request is not apart of the API. Valid requests will be forwarded to available servers. For reads we could first check the cache for the url we seek. If found the url is returned and if not then request is forwarded to a server to perform a fetch on the DB.


We want to limit the number of requests for each client. This way one client cannot hog the service and prevent additional clients from accessing. To do this we will want to implement rate limiting. This threshold should be configurable so that the rate can be scaled as the system scales.




Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

User creates a new url. The request will hit the LB which will determine if the client is under the rate limit then forward the request to an available server. Since this is a create request the generate Url will be called to create a hash and tiny url. On success we try to store the new url in the DB and prepare a response which will include the created url. The write to the DB will kick off a replication service to update all DBs in the cluster.

User fetches url. In this case the request will have the LB check the cache to see if the url exists. If it does then return the url. If not then perform a DB read by calling the get request and prepare a response.




Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Database. Since we are replicating the DB by writes and reads we need to make sure any changes to one DB are reflected in the other DBs. To do this we need a replication service. Writes to the write DB need to be reflected in the read DBs. This happens by keeping track of the write operations and performing them on all DBs in the cluster.


The LRU cache will need to be updated so that only the LRU data is in the cache and stale data is removed from the cache or updated when the DB is updated. So when changes are applied to the write DB those changes also need to be reflected in the cache so that users get the most recent data.



Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

I'm choosing to use a noSql solution like scylla or cassandra because the DBs are very performant and provide default behavior for TTLs which will be useful to remove stale data. This DB solution also utilizes partitions automatically which will help with DB collisions and is particularly useful for read heavy operations which is a good fit for this service. NoSQL solutions also scale horizontally well so as the amount of urls is added we also still have performant DB.




Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.


One service goes down. When this happens load will be dispersed among the available servers which will create additional stress on the system. The best case here is to try and spin up additional service to replace the downed server.


I Identified rate limiting as a solution to prevent single clients from hogging the url shortener resources so that the clients have a limit to the number of times they call the service over a given time frame.




Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?


I believe metrics are valuable insight in any system to help monitor system performance as well as identify potential issues and solve bugs. For this service I would recommend a metric product like prometheus or graphite to handle logging items like requests to the system and any failures that occur like missing data or a service not operating as intended.