Install GPU Driver on autoscaling Node in GKE Cloud Composer

GKE

Cloud Composer

GPU Driver Installation

Autoscaling Nodes

Google Cloud Platform

Install GPU Driver on autoscaling Node in GKE Cloud Composer

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

For autoscaling GPU nodes in GKE, you should not rely on logging into individual nodes and installing drivers manually. New autoscaled nodes appear and disappear, so the workable approach is to use a dedicated GPU node pool with either GKE-managed driver installation or a driver installer DaemonSet, then schedule Cloud Composer workloads onto that pool.

Start with a Dedicated GPU Node Pool

Cloud Composer workloads run on a GKE cluster, but GPU workloads should generally not share the default pool used for normal Airflow components. A separate node pool gives you:

the correct GPU machine type
autoscaling rules specific to GPU work
cleaner scheduling controls
less risk of starving Composer system components

This matches the general Composer guidance to use separate node pools for specialized workloads.

Prefer Automatic Driver Installation When Supported

Modern GKE versions can automatically install NVIDIA GPU drivers for supported GPU node pools. That is the cleanest answer because every new autoscaled GPU node comes up with a consistent driver-management path.

If your cluster version or node-pool configuration supports GKE-managed GPU driver installation, use that instead of maintaining custom per-node bootstrap logic.

If automatic installation is not available for your exact setup, the fallback is a Kubernetes DaemonSet that installs drivers on matching GPU nodes.

Why Manual Node-by-Node Installation Fails with Autoscaling

Autoscaling changes the problem completely. A manual SSH-based driver installation works only until a new node is created. The autoscaler can add a fresh node at any time, and that node will not inherit your one-off manual changes.

That is why the driver installation mechanism must be declarative and cluster-managed.

In practice, that means:

GKE-managed driver installation where supported
or a driver installer DaemonSet targeting the GPU node pool

Schedule Composer Workloads onto the GPU Pool

Once the GPU node pool exists, run GPU-heavy Composer-launched pods there by using node selectors, tolerations, or affinity with KubernetesPodOperator.

python

1from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
2
3gpu_task = KubernetesPodOperator(
4    task_id="gpu-job",
5    name="gpu-job",
6    namespace="default",
7    image="gcr.io/my-project/my-gpu-image:latest",
8    cmds=["python", "train.py"],
9    container_resources={"limits": {"nvidia.com/gpu": "1"}},
10)

The exact scheduling details depend on your cluster labels and taints, but the core idea is to target the dedicated GPU pool rather than hoping general-purpose nodes will work.

Validate Both Driver and Scheduling State

After setup, verify more than just node creation. You need to confirm:

the GPU nodes joined successfully
the driver installation completed
the NVIDIA device plugin is advertising GPU resources
Composer-launched pods can actually land on the GPU pool

Without that full validation, you may have nodes with accelerators attached but no usable nvidia.com/gpu resources exposed to workloads.

Common Pitfalls

Installing drivers manually on one node and expecting autoscaled replacement nodes to behave the same way.
Running GPU workloads on the default Cloud Composer node pool instead of a dedicated GPU pool.
Forgetting that driver installation and the device plugin are separate operational concerns.
Enabling autoscaling without verifying that newly created nodes receive drivers automatically.
Checking only node creation and not whether GPU resources are actually schedulable.

Summary

For autoscaling GPU nodes, use a dedicated GPU node pool, not ad hoc per-node setup.
Prefer GKE-managed GPU driver installation when your cluster version supports it.
Otherwise, use a driver installer DaemonSet so new nodes are configured automatically.
Schedule Composer workloads to the GPU pool with Kubernetes pod scheduling controls.
Validate driver readiness and nvidia.com/gpu availability, not just node existence.