Restart VMs in scale set in AKS Node Pool

Azure Kubernetes Service

AKS

VM Scale Sets

Node Pool Management

Cloud Infrastructure

Restart VMs in scale set in AKS Node Pool

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

AKS node pools are usually backed by Virtual Machine Scale Sets, but AKS still owns those machines. That detail matters because the safest answer is not "treat the VMs like ordinary Azure VMs" but "use AKS operations first, and touch the backing scale set only when you understand the support and disruption tradeoffs."

What AKS Actually Manages

Microsoft documents that AKS agent nodes appear as normal Azure IaaS resources, but direct customizations through raw IaaS APIs are not the supported management path and may not persist through upgrades, scaling, updates, or reboots. In other words, the VMSS exists, but AKS is the control plane that is supposed to manage it.

That gives you three practical choices depending on what you mean by "restart":

restart an entire user node pool by stopping and starting the node pool
recycle workload placement by draining nodes before maintenance
target a specific VMSS instance only for troubleshooting, with caution

Supported Pool-Level Restart

If the goal is to power-cycle a whole user node pool, use the AKS node pool commands rather than the VMSS command surface directly:

bash

1az aks nodepool stop \
2  --resource-group myResourceGroup \
3  --cluster-name myAKSCluster \
4  --nodepool-name userpool
5
6az aks nodepool show \
7  --resource-group myResourceGroup \
8  --cluster-name myAKSCluster \
9  --nodepool-name userpool \
10  --query '{powerState:powerState.code, provisioningState:provisioningState}'
11
12az aks nodepool start \
13  --resource-group myResourceGroup \
14  --cluster-name myAKSCluster \
15  --nodepool-name userpool

That is the cleanest AKS-level restart workflow for an entire pool. It also keeps autoscaler state and AKS metadata in the expected management path.

Drain Before Disruption

Even when AKS is managing the restart path, you still need to think like a Kubernetes operator. Nodes host pods, and restarting or stopping them without preparation can cause avoidable disruption.

Before taking nodes down, cordon and drain them:

bash

1kubectl get nodes
2kubectl cordon aks-userpool-12345678-vmss000000
3kubectl drain aks-userpool-12345678-vmss000000 \
4  --ignore-daemonsets \
5  --delete-emptydir-data

If you are rotating multiple nodes, do it gradually. Pod disruption budgets, daemon sets, and capacity limits all affect how smoothly the cluster recovers.

Restarting a Specific VMSS Instance

Sometimes the issue is one unhealthy node, not the whole pool. In that case you may need to identify the backing scale set and restart a specific instance. Azure supports restarting VMSS instances, but this is better treated as targeted troubleshooting than as your normal AKS operating model.

First discover the node resource group and scale set:

bash

1NODE_RG=$(az aks show \
2  --resource-group myResourceGroup \
3  --name myAKSCluster \
4  --query nodeResourceGroup -o tsv)
5
6az vmss list --resource-group "$NODE_RG" -o table
7az vmss list-instances --resource-group "$NODE_RG" --name aks-userpool-12345678-vmss -o table

Then restart one instance if you have confirmed it is the correct target:

bash

1az vmss restart \
2  --resource-group "$NODE_RG" \
3  --name aks-userpool-12345678-vmss \
4  --instance-ids 3

This can be useful for one-off recovery, but it should not replace normal AKS-managed maintenance patterns.

When a Restart Is the Wrong Fix

A restart is not always the best answer. If you are patching nodes, rotating node images, or changing VM size, other operations are usually cleaner:

scale out a fresh node pool and drain the old one
use node image upgrade workflows
resize by creating a new pool with the target SKU and migrating workloads

Those patterns are more aligned with how AKS expects node pools to evolve over time.

Common Pitfalls

Managing AKS-backed VMs as if they were standalone hand-managed Azure VMs.
Restarting nodes without draining workloads first.
Forgetting to verify cluster capacity before taking a node or pool offline.
Changing VMSS state while assuming AKS autoscaling and scheduling will immediately reconcile everything cleanly.
Using direct VMSS actions as the default maintenance workflow instead of as a narrow troubleshooting tool.

Summary

AKS node pools may be backed by VMSS, but AKS is still the supported management surface.
For a whole user pool restart, prefer az aks nodepool stop and az aks nodepool start.
Drain nodes before disruptive actions so Kubernetes can reschedule pods safely.
'az vmss restart can target a specific instance, but use it carefully and sparingly.'
For long-term maintenance, node pool rotation or upgrade workflows are usually better than hand-managing the scale set.