System requirements
Functional:
List functional requirements for the system (Ask interviewer if stuck)...
- User can create short url with given original/longer url
- User can access short url which will redirect them to the original url
- User can delete shortened url that they have created
Non-Functional:
List non-functional requirements for the system...
- Needs to be highly available
- Does not need strict consistency but should achieve high consistency - user creates short url and can access it from the shortened url shortly afterwards
- Reads >> writes (users will access URLs more often than they are created)
- User management (authn/authz) is out of scope
Capacity estimation
Estimate the scale of the system you are going to design...
- 100 million users
- User create 10 URLs
- Each URL - 100 clicks/month
- 100 * 10 = 1 billion URLs created
- 100 * 1 billion = 100 billion redirects per month
- Each URL original is 64 bytes
- 8 bytes for each shortened url
- ~72 billion bytes to store urls --> 72 GB to store urls
API design
Define what APIs are expected from the system...
POST /api/v1/url
Headers: User jwt token/session token
Body:
* original url (limit it to 64 bytes)
Response:
* shortened url
GET /api/v1/url/:shortened-url-id
Response:
* Original url --> used to do HTTP redirect to it
DELETE /api/v1/url/:shorted-url-id
Headers: user jwt/session token
Response:
* HTTP 204 if deleted
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
Entities:
* URL
- original url
- shortened url
- user_id who created
* User (out of scope)
* user_id
* ...
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...
* Client iteracts with load balancer
* load balancer talks to multiple redundant servers
* Servers talk to DB to fetch data
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
* Create URL
- client issues a request to create url with given original url
- Hits load balancer which forwards request to server
- multiple different load balancer schemes - round robin, least network connections
- Can choose a smarter one like fewer network connections
- Load balancer will forward request to server
- Server will generate unique shortened url for the given original url
- Can be random hash of timestamp + user id + sequence number + machine number
- This will help avoid collisions
- Then store the result in DB
- Return shortened url to user
* Get URL
- Client issues request to access URL
- Hits LB
- Forward to server
- Server fetch entry from DB based off the shortened url which is the id + unique
- Return response with HTTP redirect to original url
* Delete URL
- Client issues request to delete URL
- Hits LB
- Forward to server
- Server will fetch entry from DB based off the shortened url
- Validate token and user id for that entry match else it is not permitted
- Delete entry from DB
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
- DB can be SQL database since the data we need to store isn't very large that it won't be able to fit into the DB
- DB can be scaled with read replicas, which introduce replica lag but will help with the high amount of reads and given we don't need strong consistency it should be fine if the url is not immediately accessible right after it is created
- Caching layer can be introduced in front of DB
- LRU cache
- Look aside cache, avoids caching entries that aren't accessed, with write-through will need more latency for writing and will cache everything that is written even if not accessed
- If needed to scale DB, can consider sharding, will shard based off unique shorted url
- Use consistent hashing to avoid hot spots and evenly spread load
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
* SQL DB since reads are more frequent than writes, and it makes look ups fast
* Look aside cache so we don't cache entries that aren't accesssed
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
* Bottleneck could be DB, so we add read replicas and also can consider sharding if we need to scale the DB
* With cache, potentially user can continue to access url after it is deleted, we can make sure we delete the entry in the cache after it is deleted from the DB, more complex logic and there may be a slight delay when the url is still accessible but should be fine given we are not strongly consistent there
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
* Possibly geo replicate the DB so we store URLs closer to the geo graphic region where it was created, and depending on the access patterns, we can replicate it across geos