Realistic GPU/Memory Sizing for Local LLMs

May 25, 2026

Gpu-Memory-Sizing, Model-Selection, Capacity-Planning

Local-Llm, Gpu-Memory, Vram, Unified-Memory, Kv-Cache, Moe, Gguf, Sizing, Ollama, Lm-Studio

Decision-first: Budget file size + KV(context) + overhead, not file size — and on unified memory, subtract OS + co-resident workloads first. “Barely fits” means doesn’t fit. Size memory by total params, speed by active params.

Scope & freshness: General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

Resident size is bigger than the file#

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

February 22, 2026

Kubernetes

Intermediate

Gpu-Scheduling, Ml-Infrastructure, Resource-Management, Workload-Isolation, Gpu-Monitoring

Gpu, Nvidia, Machine-Learning, Device-Plugin, Mig, Time-Slicing, Mps, Cuda, Node-Affinity, Taints, Dcgm

Kubectl, Nvidia-Smi, Helm, Dcgm-Exporter, Prometheus, Grafana

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.