Kubernetes GPU support how to enable?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Kubernetes does not use GPUs automatically just because the node has one installed. GPU scheduling works only after the node OS, vendor drivers, container runtime, and Kubernetes device plugin are all configured correctly.
Once that is in place, GPUs appear as extended resources such as nvidia.com/gpu, and pods can request them the same way they request CPU or memory. The cluster scheduler then places those pods only on suitable nodes.
What Has To Be Installed
For NVIDIA GPUs, the usual stack is:
- GPU-capable nodes
- NVIDIA drivers on those nodes
- NVIDIA Container Toolkit
- NVIDIA device plugin for Kubernetes
On managed platforms such as GKE, EKS, or AKS, some of that setup can be automated by selecting a GPU node pool image. On self-managed clusters, you usually install each layer yourself.
You can verify node capacity with:
If the GPU column is empty, Kubernetes still does not see the accelerator as a schedulable resource.
Installing the Device Plugin
The device plugin is what registers GPUs with the kubelet. Without it, the node may have drivers installed but Kubernetes still cannot advertise nvidia.com/gpu.
A typical installation uses a DaemonSet supplied by NVIDIA. After deployment, check that the plugin pod is running on each GPU node:
You should also label or taint GPU nodes so general workloads do not land there by accident.
Requesting a GPU in a Pod
After the node is configured, a workload must explicitly request GPU resources. This example asks for one GPU and runs nvidia-smi inside the container.
Apply it with:
If the logs show GPU information, the node, runtime, and plugin path are working together correctly.
Scheduling GPU Workloads Safely
GPU nodes are expensive, so it is common to isolate them. A straightforward pattern is:
- taint GPU nodes
- add tolerations only to GPU workloads
- use
nodeSelectoror affinity to target the right accelerator class
Example:
This keeps ordinary services from consuming premium nodes unnecessarily.
Operational Considerations
Enabling GPU support is not only about scheduling. You also need to think about utilization and packaging.
For machine learning jobs, the main questions are:
- does the container image include the right CUDA libraries
- does the driver version match the user-space stack
- is the batch size large enough to benefit from the GPU
- do you need one GPU per pod or several
Kubernetes will schedule the resource request, but it will not fix CUDA compatibility mistakes inside your image.
Common Pitfalls
- Installing drivers on the node but forgetting the Kubernetes device plugin.
- Requesting a GPU in
requestsbut notlimits, or vice versa, inconsistently across workloads. - Running a CUDA image that does not match the node driver stack.
- Assuming a GPU node pool automatically exposes
nvidia.com/gpuwithout verification. - Forgetting to isolate GPU nodes, causing regular workloads to waste accelerator capacity.
Summary
- Kubernetes GPU support requires node drivers, container runtime support, and a device plugin.
- GPUs are exposed as extended resources such as
nvidia.com/gpu. - Pods must explicitly request GPU resources to be scheduled onto GPU nodes.
- A simple
nvidia-smipod is the fastest smoke test for validation. - Node labels, taints, and image compatibility matter just as much as scheduler configuration.

