Gguf | Agent Zone

Realistic GPU/Memory Sizing for Local LLMs

May 25, 2026

Gpu-Memory-Sizing, Model-Selection, Capacity-Planning

Local-Llm, Gpu-Memory, Vram, Unified-Memory, Kv-Cache, Moe, Gguf, Sizing, Ollama, Lm-Studio

Decision-first: Budget file size + KV(context) + overhead, not file size — and on unified memory, subtract OS + co-resident workloads first. “Barely fits” means doesn’t fit. Size memory by total params, speed by active params.

Scope & freshness: General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

Resident size is bigger than the file#

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

May 25, 2026

Infrastructure

Intermediate, Advanced

Local-Llm-Deployment, Gpu-Memory-Sizing, Model-Runtime-Selection, Moe-Model-Selection

Gb10, Dgx-Spark, Asus-Ascent-Gx10, Local-Llm, Lm-Studio, Llama-Cpp, Gguf, Unified-Memory, Moe, Grace-Blackwell, Dcgm

Lm-Studio, Lms, Llama.cpp, Dcgm-Exporter, Ssh

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster

May 25, 2026

Infrastructure

Intermediate, Advanced

Mac-Llm-Hosting, Unified-Memory-Budgeting, Runtime-Selection

Apple-Silicon, Macos, Local-Llm, Ollama, Mlx, Gguf, Docker-Desktop, Minikube, Unified-Memory, Metal

Ollama, Lm-Studio, Docker-Desktop, Minikube

Decision-first: A Mac running a dev cluster is a lite-tier LLM host only (~8 GB models). It can’t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on GGUF (Ollama can’t do MLX); don’t lower the Docker VM cap to “free RAM.”

Scope & freshness: 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the shape (cluster + one big model exhausts the box) holds.