System requirements


Functional:

User submits a longURL, a shortURL is generated, stored in the DB, and returned to the user


User hits the shortURL, retrieves the longURL from DB, caches the shortURL and longURL in a distributed cache, then redirects to the longURL


User is able to share the shortURL through various communication channels (will need to integrate with other communication channels) - in this case, we will support messenger, discord, and twitter



Non-Functional:

Response time => in the milliseconds, aim for <100ms

Minimize latency in response time

Minimize delay between reading and writing data to ensure real time availability => High availability

Handles multiple concurrent requests => high throughput

Consider caching



---

Fast response time

Caching may be needed

High throughput

High availability


--


Since we need to have high availability, the tradeoff based on the CAP theorem is that we need to decide on consistency vs availability. Since we cannot have both, we will focus on availability over consistency.


Based on this, we will most likely consider a NoSQL database such as Cassandra in the database design - we will consider the tradeoffs first


data consistency

error handling

fault tolerance


Capacity estimation

Estimate the scale of the system you are going to design...


Read queries per second:

100 reads/second (normal traffic) => ~ 9 million reads / day

1 url per user / day

10,000 DAU => 10,000 writes / day


-----

10,000 writes / day

9,00,000 reads / day


900 reads : 1 write ratio


we will simplify this to 10:1 read/write ratio

-------


during spikes, we will see 5x the traffic


50:5 read/write ratio


API design


The system has to convert a longURL to a shortURL


action: convert longURL and return shortURL

input: longURL


/v1/action/convert_to_short_url

params: {string: longURL}


The system has to take a shortURL and redirect to longURL


action: convert shortURL to longURL then redirect

input: shortURL


/v1/action/get_long_url

params: {string: shortURL}

response: return a 301 or 302 http code for browser


301 vs 302

301 will treat every redirect as a brand new redirect where as 302 will treat the redirect as a known one if it already existed and retrieve that info from a cache


301 is good if you are tracking analytics on that specific url and you want info such as how often it is used for it



Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


  • solution should scale well for both reads and writes
  • we don't want to lose data (fault tolerance, data integrity) during network failures or hardware faults
  • recovering data in case of an outage
  • high availability


based on the data model, we can choose either a nosql or sql database. a sql database would make sense given the simplicity of the object and attributes stored. we also want high availability which would work for a sql database such as nosql because we can set up read replicas since there are more reads than writes. the bottleneck would be the writes but since there are less writes than reads, it is a tradeoff we are making. for fault tolerance, we can also have multiple data centers in different locations and we can set them up based on locale to ensure that users will be hitting datacenters closest to them.


we also need consistent data in which having replication from the write master machine to replicas would ensure that. whereas with a nosql database such as cassadra, we may encounter data that is inconsistent.


considering scaling, we will have to shard the mysql database when we want to add additional machines. this is additonal complexity that we will need to introduce by having a proxying service that will handle the logic for sharding. we will also need to consider an primary id generation in order to ensure that there are no collisions. we will also need to introduce a service discovery service such as zookeeper that will help us determine which machines are up and healthy.


by having multiple datacenters, we also handle the case of hardware failures or network outages because we will be able to reroute traffic to a backup data center if necessary


in terms of leader/followers for data consistency, we can use a master/slave model and if the model goes down, we can promote a slave to a master.


all of this can be done internally by mysql and will not require additional services.


the bottleneck overall is the replication. there may be replication delay due to the data replicating to all the nodes. however, because we need to achieve consistency more than availability (we need to read the correct data), this is important


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...


data model:


longURL, shortURL, created_at


we want to store the longURL and the shortURL so that they can map to each other and we want a created_at for later down the line when we may want to expire certain urls after a certain time


pros: fast writes

cons: during traffic spikes, there will be a lot of writes/reads which can be costly (as a note, something we can discuss further is during traffic spikes, we can add additional worker machines to process writes/reads)


since we will be doing more reads than writes, we should also consider potentially adding a cache when doing reads because the urls may be accessed more often (the ratio is about 10 reads to 1 write) - we will also discuss more of this later on


since we expect writes to be fast (less than 100ms) and reads to also be fast (100ms), we will have to process on the fly via stream data processing rather than batch data processing


----

will store individual urls via stream data processing

store for several days (or chosen TTL) then purge

will also have a cache that will cache urls after they are redirected for faster access



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...


generating short url from long url:

user -> browser -> request -> api gateway -> short url generator service -> database access service -> database


redirecting short url request to long url:


user -> browser -> request -> api gateway -> database access service -> database -> get longURL -> redirection on server side -> browser -> user


user -> url -> sharing service -> fan out based on service -> discord | twitter | messenger -> send to service



Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


components i will focus on:


  1. processing the request to convert a longURL to shortURL and the response back to the user
  2. processing the request from shortURL to longURL and redirection



to process the request to convert the longURL to shortURL:


user -> browser -> request -> api gateway (will also handle the validations) -> long url consumer (establishes and maintains tcp connection, infinite loop, deserializes the data) -> hashing queues (which will take in jobs for each request to be hashed, we can scale up by haviing mulitiple partitions in order to allow jobs to be parallelized), hashing service (also can have multiple in order to parallelize and increase throughput) -> aggregator (which will take all of the jobs completed after a threshold and flush to the db) -> in memory store for the aggregator (saves the jobs in case they are not saved to try again if db is down) -> database writer (reads and writes to db as a service) -> database


in case of any errors, we will have a retry logic to try again - in order to not slam the database, we want to do some jittering or exponential backoff, we can also add a timeout limit if it deosn't respond, if it reaches the timeout limit, we can write to the cache from the aggregator to try again later so we dont block resources based on a time, if it was successful, we will then flush the cache. after writing to the database, we will return the shortURL to the browser and back to the user


to process the request from shortURL to longURL and redirection:





Trade offs/Tech choices

Explain any trade-offs you have made and why you made certain tech choices...




Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?