Requirements
Functional Requirements:
- Create a short URL for a given long URL.
- Return the long URL associated with a given short URL.
1
Non-Functional Requirements:
- Eventual Consistency - short-long url pair is not modifiable, so main load is read requests. we can afford data to be stale.
- Low latency - we want return endpoint to be quick. Otherwise it will take long time for user who clicks on short link to be redirected. As we have 100M read requests a day, RPS is following: 10,000,000 requests / 86,400 seconds ≈ 115.7 RPS. I would assume 50% of traffic occurs in 3–4 hours. 5M requests / 10,800 seconds ≈ 460+ RPS. Given that handshake takes up to 50ms, db query takes up to 5ms and we have context switching overhead - it we have target of 200ms response - we would end up with 460 x 0.2 = 92 concurrent processings needed. This requires several instances, but still doable. Go, P50 <40ms, P95 < 200ms, P99 <250ms is the final target. Latency for creating a link is more, let's say P95<500ms
- Availability - we want return endpoint to be highly available. Otherwise user will not be able to be redirected. 99.9% is 9 hours a year. 99.999% is 5 minutes a year. This would require multiple availablility zones active-active setup.
- Horizontal Scalability (data partition) - we want to be scalable in case we get huge read traffic.
API Design
Write path: client sends us full URL. We check if it already exists in DB. If yes, we return it. If no, we generate short URL. We store it in db as key-value pair. We return short URL.
Read path: Client sends us short URL and expects to get long url. We read the database/cache, and return long URL. We return 404 if not pair url found.
High-Level Design
We will need API gateway at the entrance will receive request, log it, apply rate limiting, do authorisation via json web token (or OAuth - maybe delegate to trusted providers), forward the request and it will also have load balancing to distribute requests across multiple instances its forwarding to
API gateway would expose /getLongUrl endpoint and /createShortUrl endpoints. it would redirect to main backend-server.
Then i would introduce main back-end server. It's responsible for handling both requests. When generating short URL, i would generate hashcode of it, and attempt to store first 6-7 characters. I would use sha256 for it. That would give me 3T of different combinations, so collisions would be rare. If it is there already, i would attempt to move sliding window of 6-7 characters along generated hashcode. Given that user is not generating urls often, target is <1s so we can afford retries.
As a database, i would go for non-relational db - we have standard key-value pairs. Redis is the first idea candidate. It would be responsible for storing both short and long urls. Redis is also in memory cache that has built in persistent - perfect for our case. I would use two tables. First is would have key as a shortUrl and value as a LongUrl. It would be used for read request. Second would have key as LongUrl and value as Short Url. It would be used to check if we already have URL for create request. Keys in both cases represent the url. And url should always lead to the same place. This that in case we face collision - it simply means we already have the shortUrl generated, and should just return it
Read path: API gateway (/getLongUrl) -> backend -> Redis (check if we have it) -> backend (return) -> API gateway
Write path: API gateway (/createShortUrl) -> backend -> Redis (check if we already have it) -> backend -> (create pair and return existing) -> Redis (save newly created only if we created pair)
Detailed Component Design
Requests to backend would be rooted by our API gateway, ethier by round-robin or by least connections principle.
Then i would introduce back-end server. It's responsible for handling both requests. When generating short URL, i would generate hashcode of it, and attempt to store first 6-7 characters. I would use sha256 for it. That would give me 3T of different combinations, so collisions would be rare. If it is there already, i would attempt to move sliding window of 6-7 characters along generated hashcode. Given that user is not generating urls often, target is <1s so we can afford retries. As it is not storing any data at runtime, it is stateless and we can have active-active configuration. If one instance goes down, other will take over the trafic. no prewarm needed - no downtime. As for split brain issue - when two instances consider themselves masters - we don't have modify endpoints - only create and read. For this reason collisions are not possible. for the same reason, we can scale those instances horizontaly when we have burst traffics. Retries could be handled by API gateway.
For disabled or edited links, we could have separate backend endpoint.
/api/disable/{shortUrl}. It would acess the cache and remove corresponding rows. First by taking long Id as a value from first table, then removing both rows in both tables by primary key. For editing. Customer cannot edit them as per functional requirements.
For this, we would have to lock the row before modifying it - in order to avoid concurrent interactions. I wouldn't purge the cash, only to invalidate single link, as restoring it is expensive process.
In order to decide, if cache can act like a persistent db, i will run calculations. Given that one entity contains following: shortUrl (250b), longUrl (1Kb), isEnabled(1b), we have 1.25kb per one entity. With 100M reads and read:write ration 100:1 we end up with 1M writes a day. With retention policy (TTL per entry) of 1 year, we will have 1.25kb x 1M x 365 = 456 gb of data. One single Redis instance can easily handle up to 100gb of data. Therefore, redis is not an enough solution.
As for the bandwith, we have for write, we have 500RPS x 1.250kb = 625kb/s
I would keep redis a a cache, having only 10% of the hottest trafic in the cache. This could be achieved by daily eviciton of the cache.
As a database, i would go for non-relational db - we have standard key-value pairs. MongoDb is the first idea candidate It would be responsible for storing both short and long urls. I would use two tables. First is would have key as a shortUrl and value as a LongUrl. It would be used for read request. Second would have key as LongUrl and value as Short Url. It would be used to check if we already have URL for create request. Keys in both cases represent the url. And url should always lead to the same place. This that in case we face collision - it simply means we already have the shortUrl generated, and should just return it
For a very hot IDs, we have in-memory cache: Redis. It can handle up to 1M request per second, which is sufficiend even for the celebrity problem - when one famous person creates a link and millions click on it within a minutes. In order to improve that, we could potentially add small in-memory cache like small hashmaps to account for the hottest IDs.