<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm-Evaluation-Workflow on Agent Zone</title><link>https://agent-zone.ai/skills/llm-evaluation-workflow/</link><description>Recent content in Llm-Evaluation-Workflow on Agent Zone</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 25 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://agent-zone.ai/skills/llm-evaluation-workflow/index.xml" rel="self" type="application/rss+xml"/><item><title>An End-to-End Workflow for Evaluating &amp; Tuning Local LLMs for Agents</title><link>https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/</link><pubDate>Mon, 25 May 2026 00:00:00 +0000</pubDate><guid>https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Decision-first:&lt;/strong&gt; Follow this order and you&amp;rsquo;ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by &lt;em&gt;active&lt;/em&gt; params, (3) per-model OFAT matrix, (4) run &lt;strong&gt;serially&lt;/strong&gt; with an OOM guard (&lt;strong&gt;smoke first&lt;/strong&gt;), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.&lt;/p&gt;
&lt;/blockquote&gt;&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Scope &amp;amp; freshness:&lt;/strong&gt; Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the &lt;em&gt;findings&lt;/em&gt;, not the &lt;em&gt;workflow&lt;/em&gt;.&lt;/p&gt;</description></item></channel></rss>