Restart VMs in scale set in AKS Node Pool
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
AKS node pools are usually backed by Virtual Machine Scale Sets, but AKS still owns those machines. That detail matters because the safest answer is not "treat the VMs like ordinary Azure VMs" but "use AKS operations first, and touch the backing scale set only when you understand the support and disruption tradeoffs."
What AKS Actually Manages
Microsoft documents that AKS agent nodes appear as normal Azure IaaS resources, but direct customizations through raw IaaS APIs are not the supported management path and may not persist through upgrades, scaling, updates, or reboots. In other words, the VMSS exists, but AKS is the control plane that is supposed to manage it.
That gives you three practical choices depending on what you mean by "restart":
- restart an entire user node pool by stopping and starting the node pool
- recycle workload placement by draining nodes before maintenance
- target a specific VMSS instance only for troubleshooting, with caution
Supported Pool-Level Restart
If the goal is to power-cycle a whole user node pool, use the AKS node pool commands rather than the VMSS command surface directly:
That is the cleanest AKS-level restart workflow for an entire pool. It also keeps autoscaler state and AKS metadata in the expected management path.
Drain Before Disruption
Even when AKS is managing the restart path, you still need to think like a Kubernetes operator. Nodes host pods, and restarting or stopping them without preparation can cause avoidable disruption.
Before taking nodes down, cordon and drain them:
If you are rotating multiple nodes, do it gradually. Pod disruption budgets, daemon sets, and capacity limits all affect how smoothly the cluster recovers.
Restarting a Specific VMSS Instance
Sometimes the issue is one unhealthy node, not the whole pool. In that case you may need to identify the backing scale set and restart a specific instance. Azure supports restarting VMSS instances, but this is better treated as targeted troubleshooting than as your normal AKS operating model.
First discover the node resource group and scale set:
Then restart one instance if you have confirmed it is the correct target:
This can be useful for one-off recovery, but it should not replace normal AKS-managed maintenance patterns.
When a Restart Is the Wrong Fix
A restart is not always the best answer. If you are patching nodes, rotating node images, or changing VM size, other operations are usually cleaner:
- scale out a fresh node pool and drain the old one
- use node image upgrade workflows
- resize by creating a new pool with the target SKU and migrating workloads
Those patterns are more aligned with how AKS expects node pools to evolve over time.
Common Pitfalls
- Managing AKS-backed VMs as if they were standalone hand-managed Azure VMs.
- Restarting nodes without draining workloads first.
- Forgetting to verify cluster capacity before taking a node or pool offline.
- Changing VMSS state while assuming AKS autoscaling and scheduling will immediately reconcile everything cleanly.
- Using direct VMSS actions as the default maintenance workflow instead of as a narrow troubleshooting tool.
Summary
- AKS node pools may be backed by VMSS, but AKS is still the supported management surface.
- For a whole user pool restart, prefer
az aks nodepool stopandaz aks nodepool start. - Drain nodes before disruptive actions so Kubernetes can reschedule pods safely.
- '
az vmss restartcan target a specific instance, but use it carefully and sparingly.' - For long-term maintenance, node pool rotation or upgrade workflows are usually better than hand-managing the scale set.

