System requirements
Functional:
User submits a longURL, a shortURL is generated, stored in the DB, and returned to the user
User hits the shortURL, retrieves the longURL from DB, caches the shortURL and longURL in a distributed cache, then redirects to the longURL
User is able to share the shortURL through various communication channels (will need to integrate with other communication channels) - in this case, we will support messenger, discord, and twitter
Non-Functional:
Response time => in the milliseconds, aim for <100ms
Minimize latency in response time
Minimize delay between reading and writing data to ensure real time availability => High availability
Handles multiple concurrent requests => high throughput
Consider caching
---
Fast response time
Caching may be needed
High throughput
High availability
--
Since we need to have high availability, the tradeoff based on the CAP theorem is that we need to decide on consistency vs availability. Since we cannot have both, we will focus on availability over consistency.
Based on this, we will most likely consider a NoSQL database such as Cassandra in the database design - we will consider the tradeoffs first
data consistency
error handling
fault tolerance
Capacity estimation
Estimate the scale of the system you are going to design...
Read queries per second:
100 reads/second (normal traffic) => ~ 9 million reads / day
1 url per user / day
10,000 DAU => 10,000 writes / day
-----
10,000 writes / day
9,00,000 reads / day
900 reads : 1 write ratio
we will simplify this to 10:1 read/write ratio
-------
during spikes, we will see 5x the traffic
50:5 read/write ratio
API design
The system has to convert a longURL to a shortURL
action: convert longURL and return shortURL
input: longURL
/v1/action/convert_to_short_url
params: {string: longURL}
The system has to take a shortURL and redirect to longURL
action: convert shortURL to longURL then redirect
input: shortURL
/v1/action/get_long_url
params: {string: shortURL}
response: return a 301 or 302 http code for browser
301 vs 302
301 will treat every redirect as a brand new redirect where as 302 will treat the redirect as a known one if it already existed and retrieve that info from a cache
301 is good if you are tracking analytics on that specific url and you want info such as how often it is used for it
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
- solution should scale well for both reads and writes
- we don't want to lose data (fault tolerance, data integrity) during network failures or hardware faults
- recovering data in case of an outage
- high availability
based on the data model, we can choose either a nosql or sql database. a sql database would make sense given the simplicity of the object and attributes stored. we also want high availability which would work for a sql database such as nosql because we can set up read replicas since there are more reads than writes. the bottleneck would be the writes but since there are less writes than reads, it is a tradeoff we are making. for fault tolerance, we can also have multiple data centers in different locations and we can set them up based on locale to ensure that users will be hitting datacenters closest to them.
we also need consistent data in which having replication from the write master machine to replicas would ensure that. whereas with a nosql database such as cassadra, we may encounter data that is inconsistent.
considering scaling, we will have to shard the mysql database when we want to add additional machines. this is additonal complexity that we will need to introduce by having a proxying service that will handle the logic for sharding. we will also need to consider an primary id generation in order to ensure that there are no collisions. we will also need to introduce a service discovery service such as zookeeper that will help us determine which machines are up and healthy.
by having multiple datacenters, we also handle the case of hardware failures or network outages because we will be able to reroute traffic to a backup data center if necessary
in terms of leader/followers for data consistency, we can use a master/slave model and if the model goes down, we can promote a slave to a master.
all of this can be done internally by mysql and will not require additional services.
the bottleneck overall is the replication. there may be replication delay due to the data replicating to all the nodes. however, because we need to achieve consistency more than availability (we need to read the correct data), this is important
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...
data model:
longURL, shortURL, created_at
we want to store the longURL and the shortURL so that they can map to each other and we want a created_at for later down the line when we may want to expire certain urls after a certain time
pros: fast writes
cons: during traffic spikes, there will be a lot of writes/reads which can be costly (as a note, something we can discuss further is during traffic spikes, we can add additional worker machines to process writes/reads)
since we will be doing more reads than writes, we should also consider potentially adding a cache when doing reads because the urls may be accessed more often (the ratio is about 10 reads to 1 write) - we will also discuss more of this later on
since we expect writes to be fast (less than 100ms) and reads to also be fast (100ms), we will have to process on the fly via stream data processing rather than batch data processing
----
will store individual urls via stream data processing
store for several days (or chosen TTL) then purge
will also have a cache that will cache urls after they are redirected for faster access
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
generating short url from long url:
user -> browser -> request -> api gateway -> short url generator service -> database access service -> database
redirecting short url request to long url:
user -> browser -> request -> api gateway -> database access service -> database -> get longURL -> redirection on server side -> browser -> user
user -> url -> sharing service -> fan out based on service -> discord | twitter | messenger -> send to service
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
components i will focus on:
- processing the request to convert a longURL to shortURL and the response back to the user
- processing the request from shortURL to longURL and redirection
to process the request to convert the longURL to shortURL:
user -> browser -> request -> api gateway (will also handle the validations) -> long url consumer (establishes and maintains tcp connection, infinite loop, deserializes the data) -> hashing queues (which will take in jobs for each request to be hashed, we can scale up by haviing mulitiple partitions in order to allow jobs to be parallelized), hashing service (also can have multiple in order to parallelize and increase throughput) -> aggregator (which will take all of the jobs completed after a threshold and flush to the db) -> in memory store for the aggregator (saves the jobs in case they are not saved to try again if db is down) -> database writer (reads and writes to db as a service) -> database
in case of any errors, we will have a retry logic to try again - in order to not slam the database, we want to do some jittering or exponential backoff, we can also add a timeout limit if it deosn't respond, if it reaches the timeout limit, we can write to the cache from the aggregator to try again later so we dont block resources based on a time, if it was successful, we will then flush the cache. after writing to the database, we will return the shortURL to the browser and back to the user
to process the request from shortURL to longURL and redirection:
Trade offs/Tech choices
Explain any trade-offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?