<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Gpu-Memory-Sizing on Agent Zone</title><link>https://agent-zone.ai/skills/gpu-memory-sizing/</link><description>Recent content in Gpu-Memory-Sizing on Agent Zone</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://agent-zone.ai/skills/gpu-memory-sizing/index.xml" rel="self" type="application/rss+xml"/><item><title>Realistic GPU/Memory Sizing for Local LLMs</title><link>https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Budget &lt;strong&gt;file size + KV(context) + overhead&lt;/strong&gt;, not file size — and on unified memory, subtract OS + co-resident workloads first. &amp;ldquo;Barely fits&amp;rdquo; means doesn&amp;rsquo;t fit. Size memory by &lt;em&gt;total&lt;/em&gt; params, speed by &lt;em&gt;active&lt;/em&gt; params.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="resident-size-is-bigger-than-the-file"&gt;Resident size is bigger than the file&lt;a class="anchor" href="#resident-size-is-bigger-than-the-file"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:&lt;/p&gt;</description></item><item><title>Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)</title><link>https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; On a GB10, pick &lt;strong&gt;low-active MoE&lt;/strong&gt; models (A3B-class), serve &lt;strong&gt;GGUF&lt;/strong&gt; (not MLX) via LM Studio, run &lt;strong&gt;one model at a time&lt;/strong&gt; behind an OOM guard, and monitor GPU via DCGM but read the &lt;strong&gt;model footprint from system RAM&lt;/strong&gt; (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).&lt;/p&gt;</description></item></channel></rss>