GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.

Kubernetes Namespace Organization: Strategies That Actually Work

Kubernetes Namespace Organization#

Namespaces are Kubernetes’ primary mechanism for dividing a cluster among teams, applications, and environments. Getting the strategy right early saves significant pain later. Getting it wrong means RBAC tangles, resource contention, and deployment confusion.

Strategy 1: Per-Team Namespaces#

Each team gets a namespace (team-platform, team-payments, team-frontend). All applications owned by that team deploy into it.

When it works: Clear team boundaries with shared responsibility for multiple services.

Namespace Strategy and Multi-Tenancy: Isolation, Quotas, and Policies

Namespace Strategy and Multi-Tenancy#

Namespaces are the foundation for isolating workloads in a shared Kubernetes cluster. Without a deliberate strategy, teams deploy into arbitrary namespaces, resources are unbound, and one misbehaving application can take down the entire cluster.

Why Namespaces Matter#

Namespaces provide four isolation boundaries:

  • RBAC scoping: Roles and RoleBindings are namespace-scoped, so you can grant teams access to their namespaces only.
  • Resource quotas: Limit CPU, memory, and object counts per namespace, preventing one team from starving others.
  • Network policies: Restrict traffic between namespaces so a compromised application cannot reach services it should not.
  • Organizational clarity: kubectl get pods -n payments-prod shows exactly what you expect, not a jumble of unrelated workloads.

System Namespaces#

These exist in every cluster and should be off-limits to application teams:

Emulating Production Namespace Organization in Minikube

Emulating Production Namespace Organization in Minikube#

Setting up namespaces locally the same way you organize them in production builds muscle memory for real operations. When your local cluster mirrors production namespace structure, you catch RBAC misconfigurations, resource limit issues, and network policy gaps before they reach staging. It also means your Helm values files, Kustomize overlays, and deployment scripts work identically across environments.

Why Bother Locally#

The default minikube experience is everything deployed into default. This teaches bad habits. Developers forget -n flags, RBAC issues are never caught, resource contention is never simulated, and the first time anyone encounters namespace isolation is in production – where the consequences are real.

Kubernetes Resource Management: QoS Classes, Eviction, OOM Scoring, and Capacity Planning

Kubernetes Resource Management Deep Dive#

Resource management in Kubernetes is the mechanism that decides which pods get scheduled, which pods get killed when the node runs low, and how much CPU and memory each container is actually allowed to use. The surface-level concept of requests and limits is straightforward. The underlying mechanics – QoS classification, CFS CPU quotas, kernel OOM scoring, kubelet eviction thresholds – are where misconfigurations cause production outages.