My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach with Score: 8/10
by john_chen
System requirements
Functional:
- The service should generate a shorter, unique alias for a given URL, known as a short link, which should be easy to copy and paste.
- When users access a short link, the service should redirect them to the original URL.
- Users should have the option to choose a custom short link for their URL.
- Links will expire after a default timespan, but users should be able to specify a custom expiration time.
Non-Functional:
- The system must be highly available to ensure continuous URL redirections.
- URL redirection should occur in real-time with minimal latency.
- Shortened links should be unpredictable to prevent guessing.
Capacity estimation
Read-Heavy Nature:
- Our system will experience significantly more read operations (redirection requests) compared to write operations (new URL shortenings), with an assumed read-to-write ratio of 100:1.
Traffic Estimates:
- Assuming 500 million new URL shortenings per month, the system will handle 50 billion redirections during the same period, calculated as follows:
100 * 500M = 50B
Queries Per Second (QPS) Calculation:
- New URL shortenings per second:
500 million / (30 days * 24 hours * 3600 seconds) ≈ 200 URLs/s
- URL redirections per second, given the 100:1 read/write ratio:
100 * 200 URLs/s = 20K/s
Storage Estimates:
- Assuming each URL shortening request (and its associated short link) is stored for 5 years, the total number of stored objects will be 30 billion:
500 million * 5 years * 12 months = 30 billion
- Estimating each object to be approximately 500 bytes, the total storage needed is 15TB:
30 billion * 500 bytes = 15TB
Bandwidth Estimates:
- For write requests, with 200 new URLs per second, the total incoming data will be 100KB per second:
200 * 500 bytes = 100KB/s
- For read requests, with approximately 20K URL redirections per second, the total outgoing data will be 10MB per second:
20K * 500 bytes = 10MB/s
Memory Estimates for Caching:
- To cache frequently accessed URLs (following the 80-20 rule, where 20% of URLs generate 80% of the traffic), we need to consider:
20K requests/second * 3600 seconds * 24 hours ≈ 1.7 billion requests/day
- To cache 20% of these requests:
0.2 * 1.7 billion * 500 bytes ≈ 170GB
- Note: Due to duplicate requests for the same URL, actual memory usage will be less than 170GB.
High-Level Estimates Summary:
- New URLs: 200/s
- URL redirections: 20K/s
- Incoming data: 100KB/s
- Outgoing data: 10MB/s
- Storage for 5 years: 15TB
- Memory for cache: 170GB
API design
Function: createURL
Parameters:
api_dev_key (string): The API developer key of a registered account, used for throttling users based on their allocated quota.
original_url (string): The original URL to be shortened.
custom_alias (string, optional): A custom key for the URL.
user_name (string, optional): A user name to be used in the encoding.
expire_date (string, optional): The expiration date for the shortened URL.
Returns:
On success, returns the shortened URL.
On failure, returns an error code.
Function: deleteURL
Parameters:
api_dev_key (string): The API developer key of a registered account.
url_key (string): The key of the shortened URL to be deleted.
Returns:
On success, returns ‘URL Removed’.
On failure, returns an error code.
Abuse Detection and Prevention: To protect against abuse, such as a malicious user consuming all available URL keys, we can implement rate limiting based on the api_dev_key. Each developer key can be restricted to a certain number of URL creations and redirections within a specified time frame, which can be configured differently for each developer key.
Database design
Data Observations:
- We need to store billions of records.
- Each stored object is small (less than 1KB).
- There are no relationships between records, except for tracking which user created a URL.
- Our service is read-heavy.
Database Schema:
We will need two tables:
- One for storing information about the URL mappings.
- One for storing data about the users who created the short links.
Database Choice:
Given that we expect to store billions of rows and do not require relationships between objects, a NoSQL database such as DynamoDB, Cassandra, or Riak is ideal. These NoSQL databases are also easier to scale, making them suitable for our needs.
High-level design
The challenge is to generate a short and unique key for a given URL. For example, the shortened URL "https://tinyurl.com/vzet59pa" has "vzet59pa" as the short key. Here are two potential solutions:
Solution 1: Encoding the Actual URL
- Hashing the URL:
- Compute a unique hash (e.g., MD5 or SHA256) for the given URL.
- Encode this hash using base36 ([a-z, 0-9]), base62 ([A-Z, a-z, 0-9]), or Base64 (by adding ‘+’ and ‘/’).
- Decide on the length of the short key, e.g., 6, 8, or 10 characters.
- Base64 Encoding:
- A 6-character key in Base64 can produce 64^6 ≈ 68.7 billion unique strings.
- An 8-character key in Base64 can produce 64^8 ≈ 281 trillion unique strings.
- Six-letter keys (68.7B unique strings) should suffice for our system.
- Handling Key Duplication:
- MD5 produces a 128-bit hash value, resulting in a string longer than 21 characters after Base64 encoding.
- For a 6-character key, take the first 6 characters. In case of duplicates, use other characters from the encoded string or swap characters.
- Challenges:
- Multiple users submitting the same URL will receive the same short URL, which is not acceptable.
- URLs with different encodings (e.g.,
http://example.com?id=design
vs.http://example.com%3Fid%3Ddesign
) should be treated as identical. - Workaround:
- Append an increasing sequence number to each input URL to ensure uniqueness, and then generate its hash. This sequence number does not need to be stored in the database.
- Alternatively, append the user ID to the input URL. If the user is not signed in, ask for a unique key. Continue generating keys until a unique one is found.
Solution 2: Generating Keys Offline
- Key Generation Service (KGS):
- KGS generates random six-letter strings in advance and stores them in a database (key-DB).
- When shortening a URL, retrieve a pre-generated key from the key-DB. This approach avoids encoding and eliminates duplication concerns.
- Concurrency Issues:
- Keys must be marked as used in the database immediately to prevent multiple servers from using the same key.
- KGS can use two tables: one for unused keys and one for used keys. As soon as keys are assigned to a server, they are moved to the used keys table.
- KGS keeps some keys in memory for quick access. If KGS fails before using all keys, these keys are lost, which is acceptable given the large pool of keys.
- Preventing Duplicate Keys:
- Synchronize or lock the data structure holding keys before removing and assigning them to a server.
- Key-DB Size:
- Base64 encoding can generate 68.7 billion unique six-letter keys.
- Storing these keys requires: 6 (characters per key) * 68.7B (unique keys) = 412GB
- Single Point of Failure:
- A standby replica of KGS can take over if the primary server fails.
- App Server Caching:
- Application servers can cache keys from key-DB to speed up the process. Losing these cached keys is acceptable due to the vast number of available keys.
Key Lookup and Custom Aliases:
- Key Lookup:
- Look up the key in the database to retrieve the full URL.
- If found, issue an “HTTP 302 Redirect” with the stored URL in the “Location” field.
- If not found, issue an “HTTP 404 Not Found” status or redirect to the homepage.
- Custom Aliases:
- Users can provide custom aliases, but this is optional.
- Imposing a size limit (e.g., 16 characters) on custom aliases ensures a consistent URL database.
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Data Partitioning and Replication
To scale our database for storing billions of URLs, we need an effective partitioning scheme to distribute data across different database servers.
a. Range-Based Partitioning:
- URLs are stored in partitions based on the first letter of the hash key. For example, URLs with hash keys starting with 'A' (and 'a') are stored in one partition, 'B' in another, and so on.
- Less frequent letters can be combined into a single partition. This static partitioning ensures a predictable storage and retrieval process.
- A potential issue with this approach is unbalanced partitions. For example, if many URLs start with 'E', the partition for 'E' could become overloaded.
b. Hash-Based Partitioning:
- This method involves hashing the object (e.g., the short link) and using the hash value to determine the partition.
- The hash function randomly distributes URLs across partitions (e.g., mapping each 'key' to a number between 1 and 256).
- Although this method reduces the chance of unbalanced partitions, it can still occur. Consistent Hashing can resolve this issue by dynamically distributing the load.
Cache
We can improve performance by caching frequently accessed URLs using a solution like Memcached, which stores full URLs and their hashes. Application servers can check the cache before querying the backend storage.
- Cache Memory Requirements:
- Start by caching 20% of daily traffic. Based on usage patterns, adjust the number of cache servers as needed.
- As estimated, we need 170GB to cache 20% of daily traffic. A modern server with 256GB of memory can handle this, or use multiple smaller servers.
- Cache Eviction Policy:
- Least Recently Used (LRU): Discards the least recently used URL first. A Linked Hash Map or similar data structure can store URLs and their hashes, tracking recent accesses.
- Replication: Distribute the load by replicating cache servers. When a cache miss occurs, update the cache and propagate the new entry to all replicas. Each replica updates its cache, ignoring duplicates if necessary.
Load Balancer (LB)
We can implement a load balancing layer at three key points in our system:
- Between Clients and Application Servers
- Between Application Servers and Database Servers
- Between Application Servers and Cache Servers
Initially, we could use a simple Round Robin approach to distribute incoming requests equally among backend servers. This method is straightforward to implement and introduces minimal overhead. Additionally, it automatically removes a non-responsive server from the rotation, preventing it from receiving traffic.
However, the Round Robin approach does not account for server load. If a server is overloaded or slow, it will still receive new requests. To address this, we can implement a more advanced load balancer that periodically checks the load on backend servers and adjusts traffic distribution accordingly.
Purging or DB Cleanup
Should entries remain indefinitely, or should they be purged upon expiration? If a user-specified expiration time is reached, how should the link be handled?
- Continuously searching for expired links to remove them would strain our database. Instead, we can employ a lazy cleanup approach, gradually removing expired links. This ensures expired links are deleted, though some may persist longer without being returned to users.
- When a user attempts to access an expired link, we can delete it and return an error.
- A separate Cleanup service can periodically remove expired links from storage and cache. This lightweight service should run during periods of low user traffic.
- We can set a default expiration time for each link, such as two years.
- After removing an expired link, its key can be returned to the key-DB for reuse.
- Deciding whether to remove links that haven't been visited for a certain period, such as six months, can be complex. Given the decreasing cost of storage, we may opt to retain links indefinitely.
Telemetry
To track how many times a short URL has been used, user locations, and other statistics, consider how to store this information. If we update a database row for each view, it may not handle a high volume of concurrent requests for popular URLs efficiently.
Statistics to track include:
- Country of the visitor
- Date and time of access
- Referring web page
- Browser or platform used for access
Security and Permissions
Can users create private URLs or restrict access to specific users?
We can store the permission level (public/private) with each URL in the database. Additionally, a separate table can store UserIDs that have permission to access a specific URL. If a user without permission tries to access a URL, we can return an HTTP 401 error. In a NoSQL wide-column database like Cassandra, the table storing permissions would use the URL 'Hash' (or KGS generated 'key') as the key. The columns will store the UserIDs of users who have permission to access the URL.