sho.rt/aB3kZ9) that redirects visitors back to the original. POST /api/v1/urls Create short URL (body: long_url, alias?, ttl?)
GET /{code} Redirect to original URL (301/302)
GET /api/v1/urls/{code} Get URL metadata + analytics
PATCH /api/v1/urls/{code} Update alias, TTL, or active status
DELETE /api/v1/urls/{code} Deactivate / delete a short URL
GET /api/v1/urls/{code}/stats Click analytics (time-series, referrers, geo)
GET /api/v1/users/{uid}/urls List all URLs for a user (paginated)
The URL shortening service sits behind a global CDN layer that caches the most popular redirects at edge nodes close to users, so the majority of traffic never reaches the origin. Requests that miss the edge flow through an API gateway — which handles rate limiting and authentication — and are routed to one of two stateless service pools: a read-optimised redirect service for resolution lookups, and a write service for creating and managing URLs.
The redirect service checks a Redis cluster first; on a cache hit it returns the destination URL in under 5ms. On a miss it falls back to the primary database, backfills Redis, and warms the CDN for subsequent requests. Every resolved click asynchronously emits an event to a message queue, which feeds a stream processor that aggregates click analytics into a time-series store — completely off the critical path so redirects are never slowed by analytics writes.
The write service draws short codes from a pre-generated token pool rather than a live counter, avoiding hot-spot contention at scale. Persistent state lives in a primary relational or key-value database with read replicas serving analytics queries. A background worker continuously replenishes the token pool and a scheduled job handles TTL expiry, keeping the hot path free of any housekeeping work.
This is the most critical design decision. We need ~7-character codes that are globally unique and not guessable. Two viable strategies:
Base62 encoding of an ID — a central counter (or a distributed one via Twitter Snowflake) generates a monotonically increasing integer; we Base62-encode it. A 7-character Base62 string handles 62⁷ ≈ 3.5 trillion URLs. The downside is that sequential IDs produce predictable codes, making enumeration easy.
Pre-generated token pool — a background job pre-generates random Base62 codes, stores them in a "tokens" table, and atomically marks them used on demand. This avoids the hot counter bottleneck and produces unpredictable codes. It's the preferred approach at scale.
The redirect path is entirely read-only and has three tiers: CDN edge (fastest), Redis in-memory cache (fast), and DB (slowest but always correct). On every cache miss, the layer below is queried and the result is backfilled upward. Click events are emitted asynchronously — the redirect response is never blocked on analytics writes.
The core urls table holds code (PK, indexed), long_url, user_id, created_at, expires_at, and is_active. A separate clicks table (or a time-series store like ClickHouse/TimescaleDB) stores code, timestamp, referrer, user_agent, country, and ip_hash. The tokens table for pre-generated codes holds token and used_at.
Click events flow into Kafka. A stream processor (Flink or Lambda) aggregates them — total clicks, unique clicks by day, top referrers — and writes results to a read-optimised analytics store. Raw events are never queried directly.
A scheduled worker scans the urls table for rows where expires_at < now() and marks them inactive. Redis TTLs are set to match the URL's expiry. This prevents serving stale redirects after expiry without scanning the DB on every request.
Redirect and write services are stateless — they hold no session or URL state locally (that all lives in Redis and the database). This means you can run any number of instances behind the load balancer and add or remove them freely. An auto-scaling group watches CPU and request-queue depth; when either crosses a threshold, new instances spin up within ~60 seconds. Because each instance pre-fetches its own token buffer independently, there's no coordination overhead when you scale out the write tier.
Redis scales via cluster mode — the keyspace is sharded across multiple nodes using consistent hashing. Adding a node redistributes a slice of the keyspace with no downtime. For read-heavy workloads you add read replicas; writes still go to the primary shards.
The database scales reads horizontally through read replicas — analytics queries and metadata lookups hit replicas, never the primary. Writes are the harder problem: if write volume grows beyond a single primary's capacity, you partition (shard) the urls table by a hash of the short code, spreading writes across multiple primaries. Each shard owns a disjoint range of the token pool.
The analytics pipeline scales by adding Kafka partitions and stream processor instances in lockstep — Kafka's consumer group model ensures each partition is consumed by exactly one processor at a time, so you scale throughput linearly by adding both.
Burst traffic is handled at multiple layers:
CDN absorption — a viral short URL that suddenly receives millions of hits will have its redirect cached at every CDN edge node after the first request per PoP. Subsequent hits never reach the origin at all. This is the single most effective burst buffer in the system.
API gateway rate limiting — each client (by IP or API key) is capped at a configurable request rate using a token bucket algorithm. When the limit is exceeded, the gateway returns a 429 Too Many Requests response with two headers that tell the client exactly what to do:
Retry-After: 4
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1742123456
Retry-After is the backoff hint — clients and SDKs that respect it will back off for that many seconds before retrying, which prevents a thundering herd from hammering the service the moment a rate limit window resets.
Redis as a shock absorber — because redirects are served from Redis rather than the database, a sudden spike in reads hits an in-memory store capable of millions of ops/second, not a disk-backed database. The database is largely insulated from read bursts.
Write service admission control — if the token buffer runs low or the database write queue backs up, the write service can shed load by returning 503 Service Unavailable with a Retry-After header, signalling clients to back off rather than pile on.
No single points of failure — every tier runs multiple instances. The load balancer performs health checks every few seconds; an unhealthy instance is removed from rotation automatically within one check interval.
Redis failure — if Redis becomes unavailable, the redirect service falls back to the database directly. Latency rises but the service stays up. A Redis Sentinel or cluster setup with automatic failover keeps this fallback rare.
Database failure — the primary DB runs with synchronous replication to a standby. If the primary fails, the standby is promoted automatically (via tools like Patroni for PostgreSQL). Read replicas reconnect to the new primary. Target recovery time is under 30 seconds.
CDN as a buffer during origin outages — because popular redirects are cached at the edge with a non-zero TTL, a brief origin outage is invisible to the majority of redirect traffic. Only cache misses during the outage window experience errors.
Circuit breaker pattern — each service wraps its downstream calls (Redis, DB, token pool) in a circuit breaker. If the error rate on a downstream call exceeds a threshold, the breaker opens and the service immediately returns a degraded response or cached result rather than waiting for timeouts to pile up. This prevents a slow dependency from cascading into a full outage.
Idempotent writes — URL creation requests include a client-generated idempotency key. If a request is retried after a timeout, the write service detects the duplicate key and returns the original response rather than creating a second short URL.