---
title: "Realistic GPU/Memory Sizing for Local LLMs"
description: "How to predict whether a local LLM will actually fit: resident size is larger than the GGUF file (KV cache + runtime overhead), unified vs discrete memory behave differently, and active-vs-total params drive speed separately from capacity."
url: https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/
section: knowledge
date: 2026-05-25
categories: ["infrastructure"]
tags: ["local-llm","gpu-memory","vram","unified-memory","kv-cache","moe","gguf","sizing","ollama","lm-studio"]
skills: ["gpu-memory-sizing","model-selection","capacity-planning"]
tools: ["ollama","lm-studio","nvidia-smi"]
levels: ["intermediate"]
word_count: 712
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Realistic+GPU%2FMemory+Sizing+for+Local+LLMs
---


> **Decision-first:** Budget **file size + KV(context) + overhead**, not file size — and on unified memory, subtract OS + co-resident workloads first. "Barely fits" means doesn't fit. Size memory by *total* params, speed by *active* params.

> **Scope & freshness:** General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

## Resident size is bigger than the file

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

```
resident ≈ weights(quantized) + KV cache(context) + runtime/framebuffer overhead
```

Observed in practice:

- A **21 GB** GGUF (35B-A3B, Q4_K_M) occupies **~24 GB resident** at a 128k context.
- A **73 GB** model pushed **~96 GB** of system memory at a 128k context.

The gap is the **KV cache** plus allocator/runtime overhead. Always budget the file size *plus* a context-dependent KV allowance, not just the file.

### KV cache scales with context

KV-cache bytes grow roughly linearly with context length (and with layers × KV-heads × head-dim). Doubling the context window roughly doubles the KV allocation. Two levers:

- **Right-size the context.** Don't load at 128k if your tasks fit in 32k — the KV reservation is wasted memory.
- **Quantize the KV cache** (q8/q4 KV, where supported) to roughly halve/quarter it, at a small quality cost.

## Unified vs discrete memory — different failure modes

**Discrete VRAM** (e.g., a data-center GPU): VRAM is a separate, hard-capped pool. If the model + KV exceed VRAM, you OOM or spill to slow host RAM. Sizing is "does it fit in N GB of VRAM."

**Unified memory** (Apple Silicon, NVIDIA GB10): CPU and GPU share *one* pool. The model competes with the OS, other apps, and any co-resident workloads (e.g., a local k8s cluster) for the same bytes. Two consequences:

- **The GPU can't use 100% of unified memory.** macOS Metal caps the GPU working set to a fraction of total (configurable, but defaulted below 100%). A 64 GB Mac does not give a model 64 GB.
- **Co-resident workloads eat your budget.** A 64 GB Mac running a dev k8s cluster (Docker VM ~7-16 GB) + OS (~10-12 GB) leaves only ~30-40 GB practical headroom — not enough for a 24 GB-resident model with any safety margin.

## The fit formula

```
fits if:  weights + KV(ctx) + overhead  ≤  total_memory − OS − other_workloads − safety_margin
```

Worked example (64 GB Mac, hosting a cluster):
`24 (model+KV) + 7 (Docker VM, bursts to ~16) + 12 (OS) = ~43 GB used` → fits *barely* with the cluster idle, but a cold-start cluster burst or a second model tips it into swap/compression. Treat "barely fits" as "doesn't fit."

## Active vs total parameters: a separate axis (speed)

For **MoE** models, total parameters drive *capacity* (memory) while **active** parameters drive *speed* (decode is bandwidth-bound — it reads active weights per token). A 35B-A3B model occupies 35B-worth of memory but decodes at ~3B-active speed. On bandwidth-limited hardware this is the difference between usable and unusable — see the GB10 guide. Size memory by total params; size throughput by active params.

> **Verify:** load the model at your real context and check resident size (`ollama ps`, LM Studio's loaded-instance size, or host RAM delta) — expect it to exceed the file by the KV allowance, not equal it.

## What didn't work (so you don't repeat it)

- **Trusting the file size as the memory need** — a 21 GB file was 24 GB resident at 128k; a 73 GB file was ~96 GB. Always add KV + overhead.
- **Assuming "free RAM" on a unified-memory host** — a 64 GB Mac running a dev cluster had only ~30 GB practical headroom, not 64 GB; a 24 GB model tipped it into swap.
- **Planning two large models on one device** — they essentially never co-fit; plan one-at-a-time or split hosts.

## Rules of thumb

- Budget **file size × ~1.2-1.5** for a working context, more at large contexts.
- On unified memory, subtract OS + co-resident workloads first; never assume the full RAM number.
- "Barely fits" = doesn't fit. Leave a margin for KV growth and bursts.
- Two large models on one device almost never fit — plan one-at-a-time, or split across hosts.
- Pick MoE active-params for speed independently of total-params for capacity.

