GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.

Taints, Tolerations, and Node Affinity: Controlling Pod Placement

Taints, Tolerations, and Node Affinity#

Pod scheduling in Kubernetes defaults to “run anywhere there is room.” In production, that is rarely what you want. GPU workloads should land on GPU nodes. System components should not compete with application pods. Nodes being drained should stop accepting new work. Taints, tolerations, and node affinity give you control over where pods run and where they do not.

Taints: Repelling Pods from Nodes#

A taint is applied to a node and tells the scheduler “do not place pods here unless they explicitly tolerate this taint.” Taints have three parts: a key, a value, and an effect.