Designing A Simple Url Shortening Service A TinyURL Approach - System Design

System requirements

Functional:

URL's need to be permanently valid
shortened URL's should limit to 7 characters
if we do need to remove URLs, should prioritize the oldest, unused URL's being removed first
anyone should be allowed to make a URL
will want to use an advanced hashing algorithm to ensure URL uniqueness
may want analytics for which URL's are being accessed the most

Non-Functional:

service should be incredibly quick
will need to make sure that our URL storage allows for quick validation that the URL is unqiue

Capacity estimation

50,000 users a day
227 bytes per user
~11 million bytes a day -> ~ 11MB a day of data

API design

POST /shortenURL/{originalURL} - takes in the original URL and shortens it, validates that that shortened URL does not exist in our system, and then returns the shortened URL
GET /retrieveURL/{shortenedURL} - takes in a shortened URL and returns the original URL

Database design

erDiagram

SHORTENED_URL {

string shortened_url_id "Primary Key"

string original_url "Original URL"

datetime created_at "Creation Timestamp"

}

VISITS {

string visit_id "Primary Key"

string shortened_url_id "Foreign Key referencing SHORTENED_URL"

datetime visit_timestamp "Visit Timestamp"

string visitor_ip "Visitor IP Address"

string referer "Referring URL"

string device_info "Device Information"

}

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...

have an API gateway to use for rate limiting to avoid potential DDoS attacks and overload on the servers
have two API services - one for creating the URL and another for retrieving the original URL and updating the visits database
two SQL databases - one to store the shortened URL's and another to store the number of visits to that URL
have database replicas in other regions for failovers to avoid single point of failures
a cache for recently accessed URL's to speed up serve time

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

users inputs a URL to be shortened -> shortened service hashes the URL -> service stores shortened URL in DB along with new URL if that combo does not already exist in DB
user goes to visit a shortened URL -> retrieval service checks cache to see if URL is in there -> if cache miss, fetch original URL from read only copy of shortened URL DB, update cache to include URL -> update visits database with the visit
cache should periodically clear out URL's stored that haven't been visited in a while - > evict based on timestamp of last visited
read only copies of the DB can be updated using a gossip protocol

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

shortened service and retrieval service should both be on load balancers to help manage the traffic, and should have auto scalers configured to help accommodate unexpected spikes in traffic
we can run in to issues with memory in databases, and will have to have a partition strategy at hand. this may add complexity in accessing the DB as we'll have to calculate what table we need to access for the information we want, instead of querying the whole table directly.
when we partition data for the visits database, we may want to pair the most visited URLs with the least visited to avoid overloading any one particular partition
we should limit the size of the cache, and if the cache memory limit is reached we should prune the least recently visited URLs being stored in the cache

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

using SQL databses can mean that any data migrations that need to happen can be more difficult because of the more rigid data structure requiring migrations for any new field or change to the schema
using an API gateway adds another layer of requests ad can increase latency when using the service, but it provides us with some useful tools such as rate limiting which will ultimately help us prevent cybersecurity attacks and prevent too much load from reaching our servers
utilizing read replicas can help us speed up database access on the retrieval service, but does come at a monetary cost

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

most bottlenecks come down to database access. if we didn't have read only replicas we would run in to an issue of the database locking every time that we write to it, and there could be a bottleneck of requests trying to write and read from a database that is constantly being locked and unlocked

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

add in checking that there are no duplicate original URL's
allowing users to make their own custom shortened URL's