Building LLM Harnesses: Orchestrating Local Models into Workflows with Scoring, Retries, and Parallel Execution

Building LLM Harnesses#

A harness is the infrastructure that wraps LLM calls into a reliable, testable, and observable workflow. It handles the concerns that a raw API call does not: input preparation, output validation, error recovery, model routing, parallel execution, and quality scoring. Without a harness, you have a script. With one, you have a tool.

Harness Architecture#

Input
  │
  ├── Preprocessing (validate input, select model, prepare prompt)
  │
  ├── Execution (call Ollama with timeout, retry on failure)
  │
  ├── Post-processing (parse output, validate schema, score quality)
  │
  ├── Routing (if quality too low, escalate to larger model or flag)
  │
  └── Output (structured result + metadata)

Core Harness in Python#

import ollama
import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable

@dataclass
class LLMResult:
    content: str
    model: str
    tokens_in: int
    tokens_out: int
    duration_ms: int
    ttft_ms: int
    success: bool
    retries: int = 0
    score: float | None = None
    metadata: dict = field(default_factory=dict)

@dataclass
class HarnessConfig:
    model: str = "qwen2.5-coder:7b"
    temperature: float = 0.0
    max_tokens: int = 1024
    json_mode: bool = False
    timeout_seconds: int = 120
    max_retries: int = 2
    retry_delay_seconds: float = 1.0

def call_llm(
    messages: list[dict],
    config: HarnessConfig,
) -> LLMResult:
    """Make a single LLM call with timing metadata."""
    start = time.monotonic()

    kwargs = {
        "model": config.model,
        "messages": messages,
        "options": {
            "temperature": config.temperature,
            "num_predict": config.max_tokens,
        },
        "stream": False,
    }
    if config.json_mode:
        kwargs["format"] = "json"

    try:
        response = ollama.chat(**kwargs)
        duration = int((time.monotonic() - start) * 1000)

        return LLMResult(
            content=response["message"]["content"],
            model=config.model,
            tokens_in=response.get("prompt_eval_count", 0),
            tokens_out=response.get("eval_count", 0),
            duration_ms=duration,
            ttft_ms=int(response.get("prompt_eval_duration", 0) / 1_000_000),
            success=True,
        )
    except Exception as e:
        duration = int((time.monotonic() - start) * 1000)
        return LLMResult(
            content=str(e),
            model=config.model,
            tokens_in=0,
            tokens_out=0,
            duration_ms=duration,
            ttft_ms=0,
            success=False,
        )

Retry with Validation#

Do not retry blindly. Retry only when the output fails validation:

Choosing a Local Model: Size Tiers, Task Matching, and Cost Comparison with Cloud APIs

Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native

Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server:

Prompt Engineering for Local Models: Presets, Focus Areas, and Differences from Cloud Model Prompting

Prompt Engineering for Local Models#

Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.

This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.

RAG for Codebases Without Cloud APIs: ChromaDB, Embedding Models, and Semantic Code Search

RAG for Codebases Without Cloud APIs#

When a codebase has hundreds of files, neither direct concatenation nor summarize-then-correlate is ideal for targeted questions like “where is authentication handled?” or “what calls the payment API?” RAG (Retrieval-Augmented Generation) indexes the codebase into a vector database and retrieves only the relevant chunks for each query.

The key advantage: query time is constant regardless of codebase size. Whether the codebase has 50 files or 5,000, a query takes the same time because only the top-K relevant chunks are retrieved and sent to the model.

Structured Output from Small Local Models: JSON Mode, Extraction, Classification, and Token Runaway Fixes

Structured Output from Small Local Models#

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model’s limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

JSON Mode#

Ollama’s JSON mode forces the model to produce valid JSON:

Two-Pass Analysis: The Summarize-Then-Correlate Pattern for Scaling Beyond Context Windows

Two-Pass Analysis: Summarize-Then-Correlate#

A 32B model with a 32K context window can process roughly 8-10 source files at once. A real codebase has hundreds. Concatenating everything into one prompt fails — the context overflows, quality degrades, and the model either truncates or hallucinates connections.

The two-pass pattern solves this by splitting analysis into two stages:

  1. Pass 1 (Summarize): A fast 7B model reads each file independently and produces a focused summary.
  2. Pass 2 (Correlate): A capable 32B model reads all summaries (which are much shorter than the original files) and answers the cross-cutting question.

This effectively multiplies your context window by the compression ratio of summarization — typically 10-20x. A 32K context that handles 10 files directly can handle 100-200 files through summaries.