Codemia | Master System Design Interviews Through Active Practice

My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach with Score: 9/10

by serenade_xenon816

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Users can create a short url with optional alias name
Given a short url system redirects to target (long url)
Short urls expires in given period

Good to have

Authentication and authorisation
Billing based on no of redirections
URL history
Security - Spam filtering

System Properties

System handles short url collision and produces unique short urls
Short urls are human readable and of 6 to 8 chars

Non-Functional:

List non-functional requirements for the system...

Reliability : System produces unique short urls 99% of time
Availability: System supports 100 of thousands of long url redirection within 200ms
System is available for redirection 99.99% of time
System is produces short urls consistently and is available
CAP
1. system prioritises availability over consistency for reads

Capacity estimation

Estimate the scale of the system you are going to design...

Total users = 100M
Total urls per month = 100M x 30 = 30M URLs per month
total Urls per year = 360M
Total Urls for 5 yrs =260.714 = 360M x 5 = rounded to 2B URLs
WPS = 300URLs/sec
Storage: 30M urls x 5KB per record = approx 1.5TB per month, Yearly: 18TB, 5yrs = approx 100TB
RPS = 10K/sec
1. no of redirection per day for a url = 100
2. No of redirections a day = 20 x 30M = 600M redirections a day
3. Spike of 50% when a url goes virus = rounded to 1B redirections a day
4. RPS 10K/sec

API design

Define what APIs are expected from the system...

Core Entities

Url: {shortUrl, longUrl, createdAt, expiresAt, owner, alias}
User: {userid, name, email, ....}

Create a short url

Generate a Short Url for a longUrl with optional expiry date and alias name

POST /v1/urls
Request boody: {longUrl, alias, expiresOn}
Response: 201 Created
Response Body: {shortUrl, expiresAt}

Update a short url

Update alias name or expiry of a Url Entity

PUT /v1/urls
Request Body: {alias, shortUrl, expiresOn}

Get a short Url

get one or all short urls

GET /v1/urls/{shortUrl}
Response: 200 OK
Response body: {one or more instances of Url}

Redirection

Redirect a short url to long url

HEAD /v1/redirect/{shortUrl}
Respose: 301 Temporarily Moved
Response Header : {longUrl}

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

URL - 5KB approx per record

id: int(8 byte) PK
shortUlr: char(10) index, SK
longUrl: char(4096) index
alias: char(10) index // this can be moved to shortUrl
owner: char(15) FK
createdAt: byte(16)
expiresAt: byte(16)
lastUpdatedAt: byte(16)

User: approx 5KB per record

userId: char(15)
name: char(20)
....

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Hash Generation

total usable chars - [A-Za-z0-9] = 62 chars
exclude confusing chars - 0, O,1,l = 58 chars
Base58 encoding = 58^6 = 38B URLs, this is sufficient for 5yrs, post 5 yrs the short URLs can be generated with 7 chars = 2T Urls

Short Url Generation

Approach 1 - Short URLs can be generated by hashing a unique combination of 6 chars out of 58 choosable chars which can produce 38B URLs for 5yrs which satisfies our requirement of 2B urls. In this case one single service generates a set of unique combinations of 6chars and hashes it to generate a short url. However this single service handling 300URLs/sec is reasonable, however incase it goes down we may have to bring another service up or keep a standby service ready. However the chances of it generating a duplicate is high and we will have to check for uniqueness - 300 RPS/sec to database which is not great. When the URL generation request goes to 1000/sec the no of uniqueness checks increases to 3000 RPS to database and so on
Approach - 2 : Standup 3 nodes for url generation and each generates its own hashes. In this case, we let each machine generate a unique number and compute hash by base58 encoding or base64 encoding. However we cannot avoid the confusing chars - 0, o, l, 1, doing this might waste cpu cycles and delay the generation. In this approach each node uses a Twitter-Snowflake Id which can be used to compute hash. By doing so, even if one machine goes, we can bring in another node without worrying about hash collision.

Twitter SnowFlake Id

A 48 bit id,

0th bit - sign byte
1 to 33 bits = a 32 bit timestamp
34 to 37 bits = 4 bit int represents machine id
38 to 48 bits = 8 bit int, a random number

Base 58 hashing a 48 bit id (6 chars) can generate 38B URLs and when we run out of unique hashes, change the epoch date of timestamp to a recent date say 01-01-2027 00:00:00, this will give another 38B unique URLs.

In this case we can horizontally scale no of nodes that generate unique short codes for urls.

Redirecting URLs

Browser of client ask for a long url in a head request which is lightweight and faster than a GET request. The request url contains the short code, this is looked up against a in-memory cache, if the cache misses the the redirection microservice goes to database and pull the url, add to cache with a TTL (LRU eviction strategy) and returns the URL in request header with 301 HTTP response code. We dont want to return 302 HTTP, since this gets cached into CDNs or ISPs. In this case we cannot mark the no of redirections performed for a short code.

Scaling redirection microservice

The redirection service will experience heavy traffic and when the url becomes viral, this will introduce spikes and redirection service goes under heavy load. Redirection service is stateless and we stand up multiple nodes of Redirection service (auto-scaling if lambda/function or auto-scaling pods in Kubernetes deployment) helps here.

Caching Short URLs

A Memcache is introduced to cache ShortUrl=LongUrl, memcache is distribute cache and helps in standing replica in multiple regions and syncs between the cache. We need a simple Map data structure here. This can helps in handling read spikes and viral urls

Data durability

The multi-region data redundancy is ensured to handle disaster recovery, the database nodes are deployed across multiple regions with asynchronous less than 1 sec replication. A master node takes the writes (300 writes/sec) and these gets replicated to 2 nodes making it 3 nodes in the database cluster. The 2 nodes are deployed in different regions. In case of master node going down, any of read replica can be promoted to master node within 1 sec. The master node can be brought up reclaim its membership.

Data Reliability

In each region, the data is replicated to 2 other nodes to ensure 3 copies exists in a region. This means total 3 regions with 2 copies each making it 6 copies of a Short URL record.

Data Eventual Consistency

the data is replicated asynchronously with in a sec, this means the URL generated at Reston, VA may not be available immediately on Europe region, however this is expected and 4xx error on a short URL is acceptable for upto 1 sec

Data Partitioning

Url record is partitioned by generated Short url and sortable by unique Id (twitter-snowflake id). This makes the data residency spread across multiple partitions in master and read-replicas, avoiding hot partitions. The cache also uses the same strategy making it spread across regions and partition data in each region.

Data Availability

Data in database is replicated to 5 other nodes, which are read replicas. These 5 nodes take in only reads and denies writes unless one of these nodes are elected to be a Master node. Only master node takes the writes. On top of database, a caching layer is introduced for fast lookups hence the no of queries to database is reduced.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Memcache - We need to cache a simple key/value pair and extremely fast lookup, a the same time async replicate cached data to multiple regions. Memcache is much suitable over reds for this use-case, however Redis supports more data structure, for instance we can implement bloom filters supported natively in redis for url duplication checks.
Amazon Aurora - Aurora (PostgreSQL) is the database chosen here for its faster data replication within 1 sec and auto-scale + AWS managed maintenance and self-healing properties. This can also be deployed as Global Master and replicas in multi-region and each region with replicas in multi-AZ. this is comparatively cheaper than standing up our own database and insure cost of maintenance
NodeJS/Typescript Microservices - We do not have much business logic in service layer, we have 300 requests per second to generate URL and may 1000 RPS. A minimal service startup/boot time is preferred, these microservice can be deployed into a Kubernetes pod with auto-scaling enabled. It can also be lambda with apt-scaling enabled in AWS
Load Balancer - an ELB (or nginx) based service is good to terminate SSL and forward the request to different API gateway. The load balancer has to redirect based on the URL path. /urls/* or /redirect/*
API Gateway - helps in authenticate, rate limit and filter spam requests. The redirection requests need not have authenticated sessions.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Duplicate Checks

The short URLs are guaranteed to be unique even though its produced by different nodes, however when the date runs out it can produce a duplicate, we introduce a low latency duplication check using bloom filter in cache. However this has to be in each node, which might be not effective.

Master Going Down

The master node going down have an impact and delay for writes since one of the read-replica needs to be elected as master. However this delay or impact is within a sec, with 300 write requests/sec this is acceptable

Read Replica going down

Read replica going down is handled by health checks and the requests are not forwarded to failed node.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Authentication and authorisation
Billing based on no of redirections
URL history
Security - Authz, Spam filtering, rate limiting
Observability - Logging and monitoring