Load Balancing Fundamentals

Course

System Design Fundamentals

Load Balancing Fundamentals

Topics Covered

Layer 4 vs Layer 7 Load Balancing

Load Balancing Algorithms

Health Checks

Session Stickiness (Affinity)

Global Load Balancing (Anycast vs. GeoDNS)

Stateless vs Stateful Web Tiers

Layer 4 vs Layer 7 Load Balancing

Why does the L4 vs L7 distinction come up in every system design interview? Because that single choice decides whether your load balancer is a fast, protocol-agnostic traffic forwarder or a content-aware routing engine. The trade-off is always the same: L4 gives you speed and simplicity, while L7 gives you intelligence at the cost of extra processing.

Layer 4 load balancing operates on transport-layer metadata: source IP, destination IP, source port, destination port, and protocol number. It never looks inside the payload. A TCP SYN arrives, the balancer picks a backend using its algorithm, and every subsequent packet on that connection follows the same path. Because the balancer does not parse HTTP, gRPC, or any application protocol, it handles raw connections extremely quickly and works with any protocol that runs over TCP or UDP.

Consider a game server cluster that uses a custom binary protocol on TCP port 9000. An L4 balancer distributes connections across the fleet without knowing or caring what game packets look like. A video streaming service running a proprietary protocol over UDP benefits the same way. That protocol-agnosticism is L4's core strength: you never need to teach the balancer about your wire format.

From a performance perspective, L4 load balancers are lean. They do not terminate TLS, do not buffer full HTTP requests, and do not evaluate routing policies per request. A well-tuned L4 proxy on modest hardware can handle millions of concurrent connections because the per-connection overhead is minimal. AWS Network Load Balancer and Linux IPVS are examples of production-grade L4 solutions built for this scale.

Layer 7 load balancing terminates or at least understands the application protocol. For HTTP traffic, it can inspect the host header, URL path, method, query parameters, cookies, and custom headers. For gRPC, it can read the service and method names. That visibility enables powerful routing decisions that L4 simply cannot make.

Here are some examples of L7 routing rules that are impossible at L4:

Send /api/v2/* to a new backend service while /api/v1/* stays on the legacy fleet
Route requests with header x-canary: true to a canary deployment for A/B testing
Strip authentication tokens at the edge and inject verified identity headers
Serve static assets from a CDN origin while forwarding API calls to the application fleet
Rate-limit by API key, user ID, or endpoint path

The L7 balancer also terminates TLS, which means it can offload certificate management from your backends, inspect encrypted traffic after decryption, and rewrite headers before forwarding. That is how modern edge proxies like NGINX, HAProxy, and Envoy provide WAF protection, rate limiting, and per-request observability.

Dimension	L4	L7
Routing signal	IP, port, protocol	HTTP path, headers, cookies, body
Protocol support	Any TCP or UDP protocol	Mostly HTTP-family, gRPC
TLS handling	Pass-through or simple termination	Full termination, certificate management
Observability	Connection-level metrics	Per-request metrics, traces, logs
Relative throughput	Higher	Lower due to parsing overhead
Multiplexing	One connection = one backend	Can route different requests on one connection to different backends

The mistake interviewers catch most often is assuming L7 is universally "better." It is only better when you actually need application-aware decisions. If every request goes to the same backend pool and you just want to spread millions of connections with minimal edge overhead, L4 is usually the simpler, cheaper, and faster answer.

In practice, many production architectures use both layers together. An L4 balancer sits at the outermost edge for raw connection distribution and DDoS absorption, and an L7 proxy runs behind it for content-aware routing, TLS termination, and request-level policy enforcement. That layered approach gives you the throughput of L4 and the intelligence of L7 without forcing one component to do everything.

Key Insight

L4 load balancers optimize connection handling. L7 load balancers optimize request handling. If the routing decision depends on bytes above TCP, you need L7. If it depends only on where the connection is going, L4 is enough.

Load Balancing Algorithms

Choosing an algorithm is really choosing what signal the balancer trusts when it picks the next backend. If every server is identical and requests cost roughly the same, simple rotation works. If requests or backends differ in important ways, you need a smarter signal. The art is picking the simplest algorithm that matches reality, because complexity in the routing layer creates its own debugging and operational cost.

Round-robin sends request 1 to server A, request 2 to server B, request 3 to server C, and repeats. It is the cheapest option and easy to reason about. The assumption is that backends are equally fast and requests are equally expensive. When those assumptions hold, round-robin is hard to beat. Most stateless HTTP APIs start here and many never need to leave.

Weighted round-robin keeps the rotation idea but gives stronger machines a larger share. A backend with weight 4 receives roughly twice as much traffic as one with weight 2. This is useful when your fleet has mixed instance sizes, perhaps after a partial hardware upgrade or when running different VM types in the same pool. The weights are static configuration, though, so they drift from reality if load patterns change over time. An operator must update them manually.

Least connections picks the backend with the fewest active connections at the moment a new request arrives. That helps when request duration varies a lot. Imagine a mix of fast health checks finishing in 5 milliseconds and slow file uploads lasting 30 seconds hitting the same pool. Round-robin would distribute them equally by count but not by actual load. A server that happens to receive several uploads in a row would be overwhelmed while its peers sit idle. Least connections adapts to the live shape of the work because it uses current connection count as a proxy for load.

IP hash computes a hash of the client IP and maps it to a backend. This creates implicit stickiness without requiring cookies, which is useful for non-browser protocols such as UDP-based services or raw TCP clients. The downside is that NAT gateways can collapse thousands of users into one public IP, creating artificial hot spots. Mobile users changing networks get a different hash and lose affinity entirely.

Consistent hashing maps a request attribute, such as user_id, session_id, or cache_key, onto a hash ring. The request goes to the first backend clockwise from the hash point. When a node is added or removed, only the keys on the affected arc of the ring move, not the entire keyspace. If you have 10 nodes and add an 11th, roughly 1/11 of keys move rather than the near-total reshuffle that modular hashing would cause. That bounded movement is why consistent hashing is the standard choice for distributed caches, sharded databases, and stateful gateways where reshuffling keys is expensive.

Virtual nodes improve consistent hashing further. Instead of one point per server on the ring, each server owns 100 or more virtual positions. This smooths out the distribution and prevents a single server from owning a disproportionately large arc due to bad luck in the hash function.

Algorithm	Best fit	Failure mode
Round-robin	Identical backends, uniform requests	Ignores request cost differences
Weighted round-robin	Heterogeneous server sizes	Static weights drift from actual load
Least connections	Variable request duration	Connection count misleads if one request is CPU-heavy
IP hash	Sticky routing without cookies	NAT collapses many users to one backend
Consistent hashing	Cache locality, per-user affinity	Hot keys can still overload one node

Start simple. If your servers are identical and your requests are uniform, round-robin is the right default. Complexity is justified only when the traffic pattern actually punishes a simpler choice, and you can name the specific imbalance you are fixing.

Interview Tip

Use the simplest algorithm that matches your workload. Round-robin is a strong default for stateless HTTP fleets. Move to least connections or consistent hashing only when you can point to the specific imbalance you are solving.

Health Checks

A load balancer is only as useful as its health signal. Routing traffic to a dead node is obviously bad, but ejecting a healthy node because one dependency blinked can be just as damaging. Good health-check design is about asking the right question at the right frequency with the right tolerance for noise.

Active health checks probe targets on a schedule. A shallow check might simply open a TCP connection to port 8080 or send an HTTP GET to /healthz and accept any 200 response. A deep check might verify database connectivity, measure queue lag, or execute a synthetic read path that exercises the critical serving code.

Active checks give the balancer a direct, controlled signal. You choose the interval (every 5 seconds, every 10 seconds), you choose what the probe tests, and you define what "healthy" means. The downside is that active probes are periodic. If a node fails 1 second after the last successful probe and the interval is 10 seconds, up to 9 seconds of traffic hits a broken backend before the next probe detects the problem.

Passive health checks infer health from real traffic. If a backend starts returning connection resets, timeouts, or a burst of 5xx responses, the balancer marks it unhealthy based on observed behavior. Passive checks react quickly because they use every real request as an implicit probe. They are especially good at catching "gray failures" where the node responds but with errors or extreme latency that a simple /healthz probe might not exercise.

The downside of passive checks is that users are the canaries. The first few requests to a degraded node fail before the balancer notices the pattern and removes the node. That is why the strongest setups combine both approaches: active probes confirm baseline readiness on a schedule, while passive observation catches failures that happen between probe intervals.

Shallow vs deep checks is an important design axis. A shallow TCP check answers "is something listening on this port?" That is fast and low-overhead, but it cannot tell you whether the application is actually processing requests. The kernel can accept connections even when the application threads are deadlocked. A deep check that queries the database, verifies cache connectivity, and runs a synthetic transaction answers "can this node serve real traffic?" But deep checks are dangerous when designed carelessly.

Consider a checkout service that fails its health check whenever Redis has a 200-millisecond hiccup. If every backend in the pool shares the same Redis cluster, one brief Redis pause causes every node to fail its check simultaneously. The load balancer ejects the entire fleet, turning a minor hiccup into a total outage. A good health check asks "can this particular node serve the traffic I intend to route to it?" rather than "is every dependency in a perfect state?"

Practical protections against flapping and premature ejection:

Consecutive failure thresholds before marking a node unhealthy. Requiring 3 failures in a row absorbs transient network blips without hiding real problems.
Consecutive success thresholds before re-admitting a recovered node. Requiring 5 passes prevents a flapping node from rejoining and immediately failing again.
Slow start ramp after recovery so the node is not instantly hit with a full traffic share while it warms caches and rebuilds connection pools.
Connection draining before planned removal so in-flight requests finish cleanly instead of being dropped mid-response.
Separate readiness from liveness during deploys. A starting instance should not be killed (liveness) just because it has not finished loading configuration (readiness). These are different questions with different consequences.

These controls turn health checks from a binary on/off switch into a graduated system that absorbs transient noise without hiding real failures.

Session Stickiness (Affinity)

Session stickiness sounds like a convenient shortcut: the load balancer pins a client to the same backend, so in-memory state just works. But that convenience comes with a price that compounds over time. Understanding when stickiness helps and when it hurts is a common interview question because it reveals whether a candidate thinks about operational consequences, not just happy-path functionality.

Cookie-based affinity is the most common browser-facing approach. The load balancer injects a cookie (or reads one set by the application) that encodes which backend should handle subsequent requests. As long as the cookie is valid, every request from that browser sticks to the same server. This is precise for browser clients because cookies travel with every HTTP request on that domain.

In practice, the cookie typically contains an opaque identifier for the backend server, not the session data itself. The balancer reads the cookie, maps it to a healthy backend, and forwards the request. If the target backend has died, the cookie becomes invalid, and the balancer must either re-route to a different backend (losing in-memory state) or return an error.

Source-IP hashing is simpler. The balancer computes a hash of the client IP and maps it to a backend. No cookies are needed, so it works for non-browser protocols like raw TCP or UDP clients. But source-IP hashing has two serious problems.

First, many corporate users share a single public IP behind a NAT gateway. Thousands of unrelated users from one enterprise can map to the same backend, creating an artificial hot spot that the balancer cannot fix without changing algorithms. Second, mobile users switching from Wi-Fi to cellular get a different source IP, which breaks the sticky mapping and forces them to start over. The hash input is neither unique enough nor stable enough for reliable affinity.

Stickiness helps when sessions are stored only in process memory and rewriting the application is not immediately feasible. It restores user continuity by constraining the scheduler. But that constraint is the root of the scaling problem. A hot user or a large enterprise customer can overload one node while others sit idle. Draining or replacing that node is risky because moving the session drops in-memory state. Autoscaling loses its value because new instances cannot absorb existing sticky users. Deploys become fragile because every restart drops some user sessions.

The long-term fix is almost always to externalize session state. Move login sessions to Redis with a TTL. Store durable business state like shopping carts and orders in a database. Use signed JWTs for small identity claims that the server can verify without any lookup. Once any web node can reconstruct user context from shared state, the balancer regains full freedom to route traffic evenly, and autoscaling, blue-green deploys, and failure recovery become routine operations again.

Common Pitfall

Sticky sessions are a bridge, not a destination. They preserve user state by constraining the scheduler, which means every scale event, deploy, and failure becomes harder than it needs to be. Plan the exit before you enable them.

Global Load Balancing (Anycast vs. GeoDNS)

Regional load balancing distributes traffic within a single data center or region. Global load balancing answers a different question: which region should the client enter in the first place? The two primary tools are GeoDNS and Anycast, and they operate at fundamentally different layers of the networking stack.

GeoDNS returns different IP addresses based on the location of the DNS resolver, the health status of each region, and business policies. You can express rules like "European users go to eu-west-1 unless that region is unhealthy, in which case fall back to us-east-1." You can also implement weighted traffic splitting for gradual region migrations or blue-green regional deployments. GeoDNS gives you fine-grained, policy-rich control over where users land.

However, GeoDNS has two important weaknesses. First, DNS answers are cached by recursive resolvers, operating systems, and browsers. A TTL of 60 seconds means that after a region goes down, up to 60 seconds of traffic continues flowing to the failed region because clients and resolvers have not yet expired their cached answer. Lowering the TTL helps but increases DNS query volume and is not always respected by all resolvers.

Second, GeoDNS routes by the resolver's location, not the client's actual location. A user in Paris whose company uses a US-based DNS resolver will be routed as if they are in the United States. EDNS Client Subnet (ECS) mitigates this by sending a prefix of the client IP to the authoritative DNS server, but not all resolvers support ECS.

Anycast takes a completely different approach. Instead of using different IPs per region, it advertises the same IP prefix from multiple geographic locations via BGP. When a client connects, the internet's routing infrastructure naturally delivers packets to the topologically "nearest" announcement based on BGP path selection. No DNS propagation delay is involved, because the IP address itself is the same everywhere.

Anycast is excellent for DDoS mitigation. Attack traffic spreads across all announcing locations instead of concentrating at one origin. A 100 Gbps attack hitting an Anycast IP gets divided across 20 edge sites, each absorbing a manageable fraction. Cloudflare, Google, and AWS use Anycast extensively for exactly this reason.

However, Anycast alone does not understand business logic. It cannot enforce data residency rules, premium-tier routing, or explicit failover preferences. It also works best for short-lived or connectionless protocols like DNS queries and HTTP requests. Long-lived TCP connections can break if BGP route changes mid-connection cause packets to shift to a different edge site, although modern Anycast implementations handle this better than they used to.

A common production pattern combines both mechanisms: Anycast to the nearest edge PoP for fast ingress, then GeoDNS or internal routing logic to steer traffic to the correct regional application cluster. Neither mechanism replaces the other everywhere. They solve different parts of the global routing problem.

Stateless vs Stateful Web Tiers

Load balancers deliver their full value when any request can go to any healthy node. That is exactly what a stateless web tier provides. The servers hold application code, configuration, and temporary computation memory, but they never hold authoritative user state. If a node disappears, the next request simply lands on another node and the user notices nothing.

In a stateless architecture, session data lives in purpose-built external systems. Redis handles short-lived session tokens with TTLs. A relational database stores durable business state like orders and account records. Object storage holds uploads and media. Signed JWTs carry small identity claims that the server can verify without any server-side lookup. The web tier becomes a pure compute layer that can scale horizontally without coordination or state transfer.

Think of it this way: a stateless web server is like a restaurant cook who reads the order ticket and prepares the dish. The cook does not remember previous orders. If that cook goes home, another cook picks up the next ticket and the customer never knows the difference. The order tickets (session tokens, database records) are the shared state that makes this possible.

Stateful web tiers keep important per-user or per-connection state in process memory. That is sometimes the right design for specific workloads. Real-time game room servers need in-memory player state for microsecond access latency. WebSocket fan-out hubs maintain connection registries that track which clients are subscribed to which channels. Actor systems co-locate computation with data for performance reasons.

But statefulness demands deliberate infrastructure. You need consistent hashing or shard ownership to direct requests to the correct node. You need replication or periodic snapshotting to survive node failures. You need careful connection draining procedures during deploys so users are migrated without data loss. A generic round-robin balancer is not enough because sending a user to the wrong node means they hit an empty state.

For the vast majority of web applications, statelessness wins because it makes three critical operations dramatically easier:

Autoscaling works because new nodes can serve traffic immediately without needing to receive state from existing nodes. Add 5 instances and they are productive in seconds.
Blue-green deploys work because you can switch traffic between two complete fleets without migrating session state. The old fleet drains and the new fleet absorbs, and users see a seamless transition.
Zone failure recovery works because surviving availability zones absorb the failed zone's traffic without needing to reconstruct any per-user state. The loss of a zone is a capacity event, not a data event.

The price of statelessness is that external state systems now carry the reliability burden. Redis needs replication and failover planning. The database needs connection pooling, read replicas, and capacity headroom. Object storage needs lifecycle policies. But those are well-understood problems with mature tooling, while recovering in-memory state from a crashed web server is a custom, fragile exercise every time it happens.

A useful litmus test for any web tier: if a node disappears mid-request, can another node continue the user's flow without any special migration or recovery step? If yes, the tier is effectively stateless. If no, the balancer is compensating for state that should probably live somewhere else.

Interview Tip

Before adding a new web server feature, ask: does this store user state in process memory? If yes, you are making the tier stateful and every future scaling decision harder. Push that state into Redis, a database, or a signed token instead.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios