Realistic GPU/Memory Sizing for Local LLMs

Decision-first: Budget file size + KV(context) + overhead, not file size — and on unified memory, subtract OS + co-resident workloads first. “Barely fits” means doesn’t fit. Size memory by total params, speed by active params.

Scope & freshness: General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

Resident size is bigger than the file#

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.