Kubernetes GPU support how to enable?

Kubernetes

GPU support

Kubernetes tutorial

Configure GPUs

Cloud computing

Kubernetes GPU support how to enable?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Kubernetes does not use GPUs automatically just because the node has one installed. GPU scheduling works only after the node OS, vendor drivers, container runtime, and Kubernetes device plugin are all configured correctly.

Once that is in place, GPUs appear as extended resources such as nvidia.com/gpu, and pods can request them the same way they request CPU or memory. The cluster scheduler then places those pods only on suitable nodes.

What Has To Be Installed

For NVIDIA GPUs, the usual stack is:

GPU-capable nodes
NVIDIA drivers on those nodes
NVIDIA Container Toolkit
NVIDIA device plugin for Kubernetes

On managed platforms such as GKE, EKS, or AKS, some of that setup can be automated by selecting a GPU node pool image. On self-managed clusters, you usually install each layer yourself.

You can verify node capacity with:

bash

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia\\.com/gpu

If the GPU column is empty, Kubernetes still does not see the accelerator as a schedulable resource.

Installing the Device Plugin

The device plugin is what registers GPUs with the kubelet. Without it, the node may have drivers installed but Kubernetes still cannot advertise nvidia.com/gpu.

A typical installation uses a DaemonSet supplied by NVIDIA. After deployment, check that the plugin pod is running on each GPU node:

bash

kubectl get pods -n kube-system | grep nvidia

You should also label or taint GPU nodes so general workloads do not land there by accident.

Requesting a GPU in a Pod

After the node is configured, a workload must explicitly request GPU resources. This example asks for one GPU and runs nvidia-smi inside the container.

yaml

1apiVersion: v1
2kind: Pod
3metadata:
4  name: gpu-smoke-test
5spec:
6  restartPolicy: Never
7  containers:
8    - name: cuda-container
9      image: nvidia/cuda:12.2.0-base-ubuntu22.04
10      command: ["nvidia-smi"]
11      resources:
12        limits:
13          nvidia.com/gpu: 1

Apply it with:

bash

kubectl apply -f gpu-smoke-test.yaml
kubectl logs gpu-smoke-test

If the logs show GPU information, the node, runtime, and plugin path are working together correctly.

Scheduling GPU Workloads Safely

GPU nodes are expensive, so it is common to isolate them. A straightforward pattern is:

taint GPU nodes
add tolerations only to GPU workloads
use nodeSelector or affinity to target the right accelerator class

Example:

yaml

1spec:
2  nodeSelector:
3    accelerator: nvidia
4  tolerations:
5    - key: "gpu"
6      operator: "Equal"
7      value: "true"
8      effect: "NoSchedule"

This keeps ordinary services from consuming premium nodes unnecessarily.

Operational Considerations

Enabling GPU support is not only about scheduling. You also need to think about utilization and packaging.

For machine learning jobs, the main questions are:

does the container image include the right CUDA libraries
does the driver version match the user-space stack
is the batch size large enough to benefit from the GPU
do you need one GPU per pod or several

Kubernetes will schedule the resource request, but it will not fix CUDA compatibility mistakes inside your image.

Common Pitfalls

Installing drivers on the node but forgetting the Kubernetes device plugin.
Requesting a GPU in requests but not limits, or vice versa, inconsistently across workloads.
Running a CUDA image that does not match the node driver stack.
Assuming a GPU node pool automatically exposes nvidia.com/gpu without verification.
Forgetting to isolate GPU nodes, causing regular workloads to waste accelerator capacity.

Summary

Kubernetes GPU support requires node drivers, container runtime support, and a device plugin.
GPUs are exposed as extended resources such as nvidia.com/gpu.
Pods must explicitly request GPU resources to be scheduled onto GPU nodes.
A simple nvidia-smi pod is the fastest smoke test for validation.
Node labels, taints, and image compatibility matter just as much as scheduler configuration.