Codemia | Master System Design Interviews Through Active Practice

My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach with Score: 8/10

by wulinjiansheng

System requirements

Functional:

Generate short url for long url
Redirect from short url to long url
Clean up expired urls
Analytics on urls

Non-Functional:

Highly available
Eventual consistency
Scalable

Capacity estimation

Storage:

~200 urls generated per second

Each url is 200 bytes

40Kb/s, 86400 * 40 ~= 3.4 Gb/day

~= 1TB/year

For five years, needs ~ 5 TB

Read/Write

Write ~ 200 RPS

Read 100:1 ~ 20000 RPS

Cache 20%

Hot urls 1:100

Cache for hot urls for one day ~ 680 MB

API design

Create short url

/api/url/generate

method: POST

data: {originUrl}

Redirect to long url from short url

/api/url/redirect?shortUrl={shortUrl}

method: GET

Url analysis

/api/url/track

method: POST

data: {originUrl, shortUrl, device, region, sourcePageUrl}

Database design

For normal shortUrl storage

PK: Index

OriginUrl

Hash

CreatedAt

ExpireAt

For pre-generated keys storage

PK: Index

Hash

CreatedAt

UsedAt

For analytics storage

PK: Analytics Id (UUID)

OriginUrl

Hash

Device

Region

SourcePage

AccessedAt

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Flow to generate a short url for long url

User request with long url. First check if the long url already exists in urls DB, if so, return the short url (append hash to some base url). If not, the url service will get a pre-generated key (hash) from the generated keys DB. The key will be marked as used once accessed, then the key will be inserted to the urls DB, along with the long url and metadata (expire at, user info, .etc)

Multiple same long urls could generate short urls the same time, but we allow them to have different short urls so no contraint here. This is because the long url could be requested by different users, and we don't want them to share the same short url (expiration could be different, and we don't want to pile up analytics data)

To avoid multiple requests accessing the same short url hash, we would lock the hash row (Use SQL SKIP LOCKED) while accessing. And follow requests would find the hash already taken (used) when the lock is released and try to get another hash.

Flow to redirect from short url to long url

Url receives short url, will look up if the hash (extracted from short url) exists in the urls DB. If doesn't exist, return 404. If found, we get the origin url and redirect.

Flow to analyze url

When we found the shorturl in urls DB, we will send an event to messag queue which will consumed by analytics service. The event will have track data we need to store in the DB. (OriginUrl, Hash, Device, Region, SourcePage, AccessedAt)

Flow to clean up urls

We could have job runs daily to clean up expired urls. The clean up job service will scan urls DB and get expired urls, then check from urls analytics DB to check last accessed time for the expired urls. If the url has no/low access for a long period (e.g 6 months), we could remove the expired url from urls DB, and update generated keys DB the corresponding url as not used.

Detailed component design

Key generation service

Key generation service is the core part. At the very beignning, based on the request estimation, we could generate the urls for 30 days.

200 urls/second, 200 * 86400 ~= 17000000 urls/day

30 days ~= 5.1 * 10^8 urls

~ 0.51 billion

The way we generate hash is based on the DB index use base64.

To allow multiple servers to generate hash the same time, we could give each server a base number.

E.g 1000 servers to generate 1 billion urls, each server needs to generate 1M urls. Server 1 base is 0, server 2 base is 1M and so on.

Then hash is generated as:

Server 1, base64(index)

Server 2, base64(index + 1000000)

All hash generated are marked as unused at the beginning.

We could have some coordinator to run jobs on these servers, and when the generation is done, we mark the corresponding generated keys DB ready to use.

Cache could be used here that we put 100K unused hash from DB to cache in advance, and mark these 1K hash used in the DB. When cache hash is low (< 10K), we get more hash from DB to refill.

Url service to get short url for origin url

When user try to get short url from their origin url, the url service will first check if the origin url already exists in urls DB. If not, it will ask the coordinator to get a ready to use generated keys DB then get the first unused key from. The key will be marked used, then inserted to urls DB.

Cache could be used when getting short url for origin url, once we retrieve the short url, we could put it in the cache so the next time when the same origin url is requested, we could return short url from cache.

Url service to get origin url from short url

If we found shorturl in urls DB, we return origin url and redirect. If not or the shorturl expired, return 404.

We will send event to a message queue with track data, even if the url is expired. The events will be consumed by analytics service and recorded in urls analytics DB.

Similar to above, cache could be used (Redis/Memcached), but get origin url from short url is more often. We could store top 1000 accessed short url in cache, evict using LRU.

Trade offs/Tech choices

Using key generation service helps us to avoid generating short url (hash) live, which usually involve random string generation and could lead to conflicts and retries.

But with key generation service, we need to pre-generate the urls, which means we need to roughly estimate user requests, so we could generate enough urls in advance. In addition, even we could depend on the DB index increment, we can't store all hash in a single DB if want to scale. Thus, we need to have multiple DBs and manage some global increment index, which brings additional complexity.

The key generation services makes SQL DB a better choice here as we could depend on SQL auto-increment index, no need to worry about hash conflicting in a single DB. While if we generate urls live, it's easier to scale NoSQL Key-value DB (like Dynamo).

The clean up service runs daily or run when traffic is low, which means we won't recycle low accessed expired urls right away. But since we have plenty pre-generated urls (which could be easily scale up), recycle urls daily won't be a problem.

Failure scenarios/bottlenecks

As mentioned above, if we pre-generate keys, we need to estimate the traffic on our website better, so we could have enough urls in store.

During key generation, if any DB went down, we could assign the key generation job to another DB. We could have replica of DB to avoid any generated keys DB went down. For the cache, if we have some cache broke, it won't hurt much as we would only lose urls in cache, we still have enough to be used keys in DB.

For url retrieving part, if the traffic scale, we could partition the urls DB based on hash key's first one/two letters. Or we could partition based on hash of the hash key, then partition it using consistent hashing. We could also replicate the urls DB based on location, it may cause some consistency issue (to keep our service highly available), but we could gurantee eventual consistency. Initially the urls are inserted in the geo nearest url DB, then we have some async job to replicate to all other DBs.

To avoid different regions using the same short url (hash), we could also put generated keys DB in different regions, and for each region, we have a range of index to generate hash or we add some prefix based on region to the generated hash.

Future improvements

We need rate limit on API gateway to avoid abuse / DDos on our service. E.g if we found some user/IP try to generate short urls over 100 urls in 5 minutes, we return 429 to them. And let them regain access after 1 hour + some random minutes.

We could do analysis based on the urls access, then allocate more/less resources depending on the region traffics.

As mentioned above in DB schema, we could add default expiration.

For disaster recovery, we could replicates DB daily and for the master DB we write logs for each updates. This way, even when master DB is down, we could retrieve all data from the replica and logs. Same strategy could be applied to caches and messasge queue.