HPA vs Cluster Autoscaler: Two Loops, One Traffic Spike

April 19, 2026

Two autoscalers run in every well-configured Kubernetes cluster, and they work at different layers. Confusing them is how teams end up with thousands of Pending pods during a Black Friday spike.

The Horizontal Pod Autoscaler scales pod count. It watches a metric, usually CPU or a custom signal from Prometheus, and when the metric crosses your target the HPA controller bumps the replicas field on your Deployment. More pods get scheduled. Throughput rises. Latency drops. This works only when there is room on existing nodes.

The Cluster Autoscaler scales node count. It does not care about CPU on your pods. It only watches for one thing: pods stuck in Pending because no node has the resources to run them. When it sees that, it asks the cloud provider for a new node, waits for it to join the cluster, and then the scheduler places the pending pods.

Notice the chain. HPA increases replicas, the scheduler finds no room, pods go Pending, Cluster Autoscaler reacts, a node comes up, pods schedule. Three control loops in series. Each one has its own delay.

The delay that bites: cold node warmup. A new EC2 or GCE instance takes anywhere from 60 seconds to 4 minutes to boot, join the cluster, pull your container images, and become ready. If your traffic spike is 90 seconds long, the new capacity arrives after the incident is over. By then your existing pods have been overloaded, the queue has built up, retries are amplifying the load, and the first thing the new pods see is a backlog they cannot drain.

The production failure: an e-commerce team had HPA tuned to scale on CPU > 60%, with Cluster Autoscaler behind it. A flash sale started at 9:00 AM sharp. CPU spiked, HPA wanted 3X the pods, new pods went Pending, CA started provisioning. For the first three minutes, users saw 502s and 503s from the ingress because the existing pods were saturated. The autoscalers did exactly what they were configured to do, just not fast enough.

Two fixes pair well. Keep a small pool of pre-warmed nodes via Cluster Autoscaler's priority-expander and overprovisioner pods that absorb the first wave. For event-driven workloads, drop in KEDA, which can scale from zero based on queue depth or pub/sub backlog. KEDA is the right tool when "the metric that matters" is not CPU.

Autoscaling is not magic. It is two loops with a measurable latency between them, and your job is to make sure the spike is shorter than the warmup, or the warmup is shorter than the spike.

Key takeaway

HPA and Cluster Autoscaler solve different halves of the same problem. HPA adds pods, then waits for the cluster to find room. If new nodes take three minutes to warm up, your spike is already over by the time you have capacity.

Originally posted on LinkedIn. View original.