---
title: "Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets"
description: "Load-time and harness levers that move agentic-coding success: temperature is model-dependent (don't assume 0.3), reasoning-off beats reasoning-on for execution, echo_reasoning helps some models and hurts others, and turn vs token budgets are distinct ceilings."
url: https://agent-zone.ai/knowledge/agent-tooling/tuning-local-llms-sampling-reasoning-budgets/
section: knowledge
date: 2026-05-25
categories: ["agent-tooling"]
tags: ["local-llm","tuning","temperature","reasoning","sampling","prompt-engineering","moe","ollama","lm-studio","tool-calling"]
skills: ["llm-tuning","sampling-configuration","prompt-directive-design","budget-configuration"]
tools: ["lm-studio","ollama","llama.cpp"]
levels: ["intermediate","advanced"]
word_count: 723
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/tuning-local-llms-sampling-reasoning-budgets/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/tuning-local-llms-sampling-reasoning-budgets/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Tuning+Local+LLMs+for+Agentic+Coding%3A+Sampling%2C+Reasoning%2C+and+Budgets
---


> **Decision-first:** Per new model, sweep temperature (don't assume 0.3), try reasoning **off** for builders, test `echo_reasoning` **both ways**, and on `budget_exceeded` check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

> **Scope & freshness:** Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of *shape*, not universal constants — re-sweep for any new model.

These findings come from running agentic-coding benchmarks (read/edit/test/push loops) across many local and cloud models. The recurring lesson: **the right config is model-specific — sweep, don't assume.**

## Temperature is model-dependent — do NOT default to 0.3

A widely-repeated rule is "use temp ~0.3 for coding." It does not generalize. Measured per model on the same task suite:

- A small lite model peaked at **temp 0.3** (0.0 looped, 1.0-default too random).
- A 26B MoE went **8/9 at any explicit temp (0.0/0.3/0.7) but only 6/9 at its default (~1.0)** — the win was simply *setting* a temperature.
- A 35B-A3B MoE went **9/9 at default/0.0 but dropped to 6-7/9 at 0.3/0.7** — mid-temp *hurt* it.

The only near-universal: **the model's default (~1.0) is rarely optimal**, and **0.0 is rarely the worst** at medium+ capability. Beyond that, sweep `{0.0, 0.3, 0.7, default}` per model and confirm at **N≥3** (single runs lie).

## Reasoning OFF beats reasoning ON for execution

For agentic *builders* (which should execute decisively, not deliberate), turning reasoning off consistently helped:

- A reasoning model went from `record-cost` **0/3 → 3/3** with `reasoning_effort: none`.
- A 120B model with high reasoning fell into **analysis paralysis** — it looped on search/read for dozens of turns and never committed to editing.

Heavy *builders* benefit from "execute, don't think." (This is the opposite of math/proof tasks — match the knob to the job.)

## `echo_reasoning` (CoT re-injection) is model-dependent — in opposite directions

Re-injecting the model's prior chain-of-thought across turns:

- One model (**GLM-4.5-Air**) *needed* it — multi-turn quality degraded without it.
- Another (**qwen3.6-35b-a3b**) was *hurt* by it — heavy-tier dropped 8/9 → 6/9 with echo on.

Never port one model's `echo_reasoning` setting to another blindly. Test on/off per model.

## `reasoning_effort` is provider-specific

`reasoning_effort` works for some backends (gpt-oss, nemotron, grok) but is a **no-op** for others. Some models gate thinking via `chat_template_kwargs.enable_thinking` instead; others are non-thinking entirely. Verify the knob actually changes behavior before trusting it.

## Turn budget vs token budget are DISTINCT ceilings

When a run ends `budget_exceeded`, identify *which* budget:

- If it stopped at **turns < your turn cap** and **wallclock < your wallclock cap**, it hit the **token budget**. Raising the turn limit won't help — raise tokens (or reduce per-turn token burn).
- If it stopped **at** the turn cap, it's a turn ceiling — more turns may help (a near-miss task that lands 4/5 criteria at the cap often clears with +20 turns).

A real case: a model's hardest task failed `budget_exceeded` at ~75 turns with a 100-turn cap — it was the **token** budget, so bumping turns 80→100 made it *worse*, not better. Diagnose before you tune.

Also watch **`max_tokens` set too low**: a task ending `incomplete` with `output_tokens == max_tokens` exactly is **silent mid-reasoning truncation**, not a model give-up. Some models need 16k-96k output budget.

## A pre-push completeness directive (harness lever)

A prompt directive instructing the model to **re-read the spec's file list and verify every file is edited before pushing** flipped a model from 8/9 → **9/9** — its remaining miss was *premature completion* (pushing after editing 4 of 5 files), not incapacity or a budget limit. Cheap, model-agnostic, high-leverage for multi-file specs.

## What didn't work (so you don't repeat it)

- **Defaulting to temp 0.3** — it was best for one model, worst-ish for another (mid-temp *hurt* a 35B-A3B). No universal value.
- **Bumping the turn budget to fix a `budget_exceeded`** that was actually the *token* budget — made results worse. Diagnose which ceiling first.
- **Porting one model's `echo_reasoning` setting to another** — it helped one and hurt another, in opposite directions.
- **Trusting `reasoning_effort`** without checking it changed behavior — it's a no-op on some backends.

## Checklist (per new model)

1. Sweep temperature `{0.0, 0.3, 0.7, default}`, N≥3 — don't assume 0.3.
2. Try **reasoning off** for builder/execution roles.
3. Test `echo_reasoning` **both ways** — it helps some, hurts others.
4. Confirm `reasoning_effort` actually changes behavior (provider-specific).
5. On `budget_exceeded`, check turns-vs-tokens before changing either.
6. Add a pre-push completeness directive for multi-file tasks.

