Install GPU Driver on autoscaling Node in GKE Cloud Composer
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
For autoscaling GPU nodes in GKE, you should not rely on logging into individual nodes and installing drivers manually. New autoscaled nodes appear and disappear, so the workable approach is to use a dedicated GPU node pool with either GKE-managed driver installation or a driver installer DaemonSet, then schedule Cloud Composer workloads onto that pool.
Start with a Dedicated GPU Node Pool
Cloud Composer workloads run on a GKE cluster, but GPU workloads should generally not share the default pool used for normal Airflow components. A separate node pool gives you:
- the correct GPU machine type
- autoscaling rules specific to GPU work
- cleaner scheduling controls
- less risk of starving Composer system components
This matches the general Composer guidance to use separate node pools for specialized workloads.
Prefer Automatic Driver Installation When Supported
Modern GKE versions can automatically install NVIDIA GPU drivers for supported GPU node pools. That is the cleanest answer because every new autoscaled GPU node comes up with a consistent driver-management path.
If your cluster version or node-pool configuration supports GKE-managed GPU driver installation, use that instead of maintaining custom per-node bootstrap logic.
If automatic installation is not available for your exact setup, the fallback is a Kubernetes DaemonSet that installs drivers on matching GPU nodes.
Why Manual Node-by-Node Installation Fails with Autoscaling
Autoscaling changes the problem completely. A manual SSH-based driver installation works only until a new node is created. The autoscaler can add a fresh node at any time, and that node will not inherit your one-off manual changes.
That is why the driver installation mechanism must be declarative and cluster-managed.
In practice, that means:
- GKE-managed driver installation where supported
- or a driver installer DaemonSet targeting the GPU node pool
Schedule Composer Workloads onto the GPU Pool
Once the GPU node pool exists, run GPU-heavy Composer-launched pods there by using node selectors, tolerations, or affinity with KubernetesPodOperator.
The exact scheduling details depend on your cluster labels and taints, but the core idea is to target the dedicated GPU pool rather than hoping general-purpose nodes will work.
Validate Both Driver and Scheduling State
After setup, verify more than just node creation. You need to confirm:
- the GPU nodes joined successfully
- the driver installation completed
- the NVIDIA device plugin is advertising GPU resources
- Composer-launched pods can actually land on the GPU pool
Without that full validation, you may have nodes with accelerators attached but no usable nvidia.com/gpu resources exposed to workloads.
Common Pitfalls
- Installing drivers manually on one node and expecting autoscaled replacement nodes to behave the same way.
- Running GPU workloads on the default Cloud Composer node pool instead of a dedicated GPU pool.
- Forgetting that driver installation and the device plugin are separate operational concerns.
- Enabling autoscaling without verifying that newly created nodes receive drivers automatically.
- Checking only node creation and not whether GPU resources are actually schedulable.
Summary
- For autoscaling GPU nodes, use a dedicated GPU node pool, not ad hoc per-node setup.
- Prefer GKE-managed GPU driver installation when your cluster version supports it.
- Otherwise, use a driver installer DaemonSet so new nodes are configured automatically.
- Schedule Composer workloads to the GPU pool with Kubernetes pod scheduling controls.
- Validate driver readiness and
nvidia.com/gpuavailability, not just node existence.

