<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Lm-Studio on Agent Zone</title><link>https://agent-zone.ai/tools/lm-studio/</link><description>Recent content in Lm-Studio on Agent Zone</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://agent-zone.ai/tools/lm-studio/index.xml" rel="self" type="application/rss+xml"/><item><title>An End-to-End Workflow for Evaluating &amp; Tuning Local LLMs for Agents</title><link>https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Follow this order and you&amp;rsquo;ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by &lt;em&gt;active&lt;/em&gt; params, (3) per-model OFAT matrix, (4) run &lt;strong&gt;serially&lt;/strong&gt; with an OOM guard (&lt;strong&gt;smoke first&lt;/strong&gt;), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the &lt;em&gt;findings&lt;/em&gt;, not the &lt;em&gt;workflow&lt;/em&gt;.&lt;/p&gt;</description></item><item><title>Benchmarking Local LLMs for Agentic Coding</title><link>https://agent-zone.ai/knowledge/agent-tooling/benchmarking-local-llms-for-agentic-coding/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/agent-tooling/benchmarking-local-llms-for-agentic-coding/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Evaluate on the &lt;strong&gt;agent loop&lt;/strong&gt; (read/edit/test/push), not one-shot patches. Use a &lt;strong&gt;multi-file execution-stamina&lt;/strong&gt; task as your discriminator, tune &lt;strong&gt;OFAT at N≥3&lt;/strong&gt;, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="why-public-leaderboard-scores-mislead"&gt;Why public leaderboard scores mislead&lt;a class="anchor" href="#why-public-leaderboard-scores-mislead"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, &lt;strong&gt;autonomous tool-using coding&lt;/strong&gt;. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and &lt;em&gt;push&lt;/em&gt; — without giving up, looping, or declaring &amp;ldquo;done&amp;rdquo; early. Evaluate on the loop you&amp;rsquo;ll actually run.&lt;/p&gt;</description></item><item><title>Operational Pitfalls: Running Local LLMs Alongside Dev Clusters</title><link>https://agent-zone.ai/knowledge/sre/operational-pitfalls-local-llms-dev-clusters/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/sre/operational-pitfalls-local-llms-dev-clusters/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don&amp;rsquo;t download during inference.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.&lt;/p&gt;
&lt;/blockquote&gt;&lt;p&gt;A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a &amp;ldquo;what didn&amp;rsquo;t work&amp;rdquo; catalog — that&amp;rsquo;s the point.)&lt;/p&gt;</description></item><item><title>Realistic GPU/Memory Sizing for Local LLMs</title><link>https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/infrastructure/local-llm-gpu-memory-sizing/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Budget &lt;strong&gt;file size + KV(context) + overhead&lt;/strong&gt;, not file size — and on unified memory, subtract OS + co-resident workloads first. &amp;ldquo;Barely fits&amp;rdquo; means doesn&amp;rsquo;t fit. Size memory by &lt;em&gt;total&lt;/em&gt; params, speed by &lt;em&gt;active&lt;/em&gt; params.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="resident-size-is-bigger-than-the-file"&gt;Resident size is bigger than the file&lt;a class="anchor" href="#resident-size-is-bigger-than-the-file"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:&lt;/p&gt;</description></item><item><title>Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)</title><link>https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; On a GB10, pick &lt;strong&gt;low-active MoE&lt;/strong&gt; models (A3B-class), serve &lt;strong&gt;GGUF&lt;/strong&gt; (not MLX) via LM Studio, run &lt;strong&gt;one model at a time&lt;/strong&gt; behind an OOM guard, and monitor GPU via DCGM but read the &lt;strong&gt;model footprint from system RAM&lt;/strong&gt; (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).&lt;/p&gt;</description></item><item><title>Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster</title><link>https://agent-zone.ai/knowledge/infrastructure/llm-serving-on-apple-silicon-with-k8s/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/infrastructure/llm-serving-on-apple-silicon-with-k8s/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; A Mac running a dev cluster is a &lt;strong&gt;lite-tier&lt;/strong&gt; LLM host only (~8 GB models). It can&amp;rsquo;t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on &lt;strong&gt;GGUF&lt;/strong&gt; (Ollama can&amp;rsquo;t do MLX); &lt;strong&gt;don&amp;rsquo;t&lt;/strong&gt; lower the Docker VM cap to &amp;ldquo;free RAM.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the &lt;em&gt;shape&lt;/em&gt; (cluster + one big model exhausts the box) holds.&lt;/p&gt;</description></item><item><title>Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets</title><link>https://agent-zone.ai/knowledge/agent-tooling/tuning-local-llms-sampling-reasoning-budgets/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/agent-tooling/tuning-local-llms-sampling-reasoning-budgets/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Per new model, sweep temperature (don&amp;rsquo;t assume 0.3), try reasoning &lt;strong&gt;off&lt;/strong&gt; for builders, test &lt;code&gt;echo_reasoning&lt;/code&gt; &lt;strong&gt;both ways&lt;/strong&gt;, and on &lt;code&gt;budget_exceeded&lt;/code&gt; check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of &lt;em&gt;shape&lt;/em&gt;, not universal constants — re-sweep for any new model.&lt;/p&gt;</description></item></channel></rss>