Networking & Load Balancing System Design

Load Balancing Fundamentals: L4, L7, and Why Round-Robin Lies

April 2, 2026

Load balancing splits into two layers, and confusing them is where most production incidents start.

L4 load balancing operates on TCP. The LB sees a connection, picks a backend, and pipes bytes through. It does not parse HTTP. It cannot route on path, header, or method. AWS NLB and HAProxy in TCP mode work this way. L4 is fast and cheap because it does almost nothing.

L7 load balancing terminates HTTP. The LB reads the request line, decides where to send it on a per-request basis, and can do things like header-based routing, retries, and circuit breaking. Envoy, NGINX, and AWS ALB live here. L7 is more expensive per request but vastly more useful for microservices.

Now the algorithms.

Round-robin rotates through backends. Simple, stateless, and quietly wrong whenever your backends are not identical. It assumes every request costs the same and every server has the same capacity.

Least-connections sends each new request to the backend with the fewest open connections. Better, because in-flight count loosely correlates with load. Works well for long-lived connections like database proxies.

EWMA of latency, often paired with power-of-two-choices, picks two random backends and routes to the one with the better recent response time. This adapts to changing conditions and is what most modern service meshes default to.

Consistent hashing routes based on a key, usually user ID or session ID. Useful for cache affinity, terrible for elastic scaling.

Sticky sessions, where every request from a user pins to one backend, are a special case of consistent hashing. They make stateful apps possible, but they fight every autoscaler you own. When you scale up, traffic does not rebalance. When you scale down, you drop every session on the drained pod.

Here is the production failure that broke us. An L4 NLB was round-robin balancing across a Kubernetes deployment that ran on mixed instance types. Some pods landed on m6i.large, others on c7i.xlarge. Round-robin sent the same request count to every pod, so the large pods sat at 40 percent CPU while the small pods OOMed at the same per-pod request count. Latency on the small pods spiked, but the LB had no signal to react. We saw 5xxs from a third of the fleet while two-thirds were yawning.

The fix was twofold. Switch the inner mesh to least-connections with topology-aware routing so traffic preferred same-zone backends, and configure the deployment to bind to a single instance class. The lesson: round-robin balances request counts, not work.

Key takeaway

Round-robin distributes requests, not work. When your backends have different CPU classes or your requests have wildly different cost, you need an algorithm that measures the thing you actually care about: latency, in-flight load, or topology.

Originally posted on LinkedIn. View original.