System requirements
Functional:
- URL's need to be permanently valid
- shortened URL's should limit to 7 characters
- if we do need to remove URLs, should prioritize the oldest, unused URL's being removed first
- anyone should be allowed to make a URL
- will want to use an advanced hashing algorithm to ensure URL uniqueness
- may want analytics for which URL's are being accessed the most
Non-Functional:
- service should be incredibly quick
- will need to make sure that our URL storage allows for quick validation that the URL is unqiue
Capacity estimation
- 50,000 users a day
- 227 bytes per user
- ~11 million bytes a day -> ~ 11MB a day of data
API design
- POST /shortenURL/{originalURL} - takes in the original URL and shortens it, validates that that shortened URL does not exist in our system, and then returns the shortened URL
- GET /retrieveURL/{shortenedURL} - takes in a shortened URL and returns the original URL
Database design
erDiagram
SHORTENED_URL {
string shortened_url_id "Primary Key"
string original_url "Original URL"
datetime created_at "Creation Timestamp"
}
VISITS {
string visit_id "Primary Key"
string shortened_url_id "Foreign Key referencing SHORTENED_URL"
datetime visit_timestamp "Visit Timestamp"
string visitor_ip "Visitor IP Address"
string referer "Referring URL"
string device_info "Device Information"
}
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...
- have an API gateway to use for rate limiting to avoid potential DDoS attacks and overload on the servers
- have two API services - one for creating the URL and another for retrieving the original URL and updating the visits database
- two SQL databases - one to store the shortened URL's and another to store the number of visits to that URL
- have database replicas in other regions for failovers to avoid single point of failures
- a cache for recently accessed URL's to speed up serve time
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
- users inputs a URL to be shortened -> shortened service hashes the URL -> service stores shortened URL in DB along with new URL if that combo does not already exist in DB
- user goes to visit a shortened URL -> retrieval service checks cache to see if URL is in there -> if cache miss, fetch original URL from read only copy of shortened URL DB, update cache to include URL -> update visits database with the visit
- cache should periodically clear out URL's stored that haven't been visited in a while - > evict based on timestamp of last visited
- read only copies of the DB can be updated using a gossip protocol
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
- shortened service and retrieval service should both be on load balancers to help manage the traffic, and should have auto scalers configured to help accommodate unexpected spikes in traffic
- we can run in to issues with memory in databases, and will have to have a partition strategy at hand. this may add complexity in accessing the DB as we'll have to calculate what table we need to access for the information we want, instead of querying the whole table directly.
- when we partition data for the visits database, we may want to pair the most visited URLs with the least visited to avoid overloading any one particular partition
- we should limit the size of the cache, and if the cache memory limit is reached we should prune the least recently visited URLs being stored in the cache
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- using SQL databses can mean that any data migrations that need to happen can be more difficult because of the more rigid data structure requiring migrations for any new field or change to the schema
- using an API gateway adds another layer of requests ad can increase latency when using the service, but it provides us with some useful tools such as rate limiting which will ultimately help us prevent cybersecurity attacks and prevent too much load from reaching our servers
- utilizing read replicas can help us speed up database access on the retrieval service, but does come at a monetary cost
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- most bottlenecks come down to database access. if we didn't have read only replicas we would run in to an issue of the database locking every time that we write to it, and there could be a bottleneck of requests trying to write and read from a database that is constantly being locked and unlocked
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
- add in checking that there are no duplicate original URL's
- allowing users to make their own custom shortened URL's