Building LLM Harnesses: Orchestrating Local Models into Workflows with Scoring, Retries, and Parallel Execution

Building LLM Harnesses#

A harness is the infrastructure that wraps LLM calls into a reliable, testable, and observable workflow. It handles the concerns that a raw API call does not: input preparation, output validation, error recovery, model routing, parallel execution, and quality scoring. Without a harness, you have a script. With one, you have a tool.

Harness Architecture#

Input
  │
  ├── Preprocessing (validate input, select model, prepare prompt)
  │
  ├── Execution (call Ollama with timeout, retry on failure)
  │
  ├── Post-processing (parse output, validate schema, score quality)
  │
  ├── Routing (if quality too low, escalate to larger model or flag)
  │
  └── Output (structured result + metadata)

Core Harness in Python#

import ollama
import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable

@dataclass
class LLMResult:
    content: str
    model: str
    tokens_in: int
    tokens_out: int
    duration_ms: int
    ttft_ms: int
    success: bool
    retries: int = 0
    score: float | None = None
    metadata: dict = field(default_factory=dict)

@dataclass
class HarnessConfig:
    model: str = "qwen2.5-coder:7b"
    temperature: float = 0.0
    max_tokens: int = 1024
    json_mode: bool = False
    timeout_seconds: int = 120
    max_retries: int = 2
    retry_delay_seconds: float = 1.0

def call_llm(
    messages: list[dict],
    config: HarnessConfig,
) -> LLMResult:
    """Make a single LLM call with timing metadata."""
    start = time.monotonic()

    kwargs = {
        "model": config.model,
        "messages": messages,
        "options": {
            "temperature": config.temperature,
            "num_predict": config.max_tokens,
        },
        "stream": False,
    }
    if config.json_mode:
        kwargs["format"] = "json"

    try:
        response = ollama.chat(**kwargs)
        duration = int((time.monotonic() - start) * 1000)

        return LLMResult(
            content=response["message"]["content"],
            model=config.model,
            tokens_in=response.get("prompt_eval_count", 0),
            tokens_out=response.get("eval_count", 0),
            duration_ms=duration,
            ttft_ms=int(response.get("prompt_eval_duration", 0) / 1_000_000),
            success=True,
        )
    except Exception as e:
        duration = int((time.monotonic() - start) * 1000)
        return LLMResult(
            content=str(e),
            model=config.model,
            tokens_in=0,
            tokens_out=0,
            duration_ms=duration,
            ttft_ms=0,
            success=False,
        )

Retry with Validation#

Do not retry blindly. Retry only when the output fails validation:

Choosing a Local Model: Size Tiers, Task Matching, and Cost Comparison with Cloud APIs

Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Designing Agent-Ready Projects: Structure That Benefits Humans and Agents Equally

Designing Agent-Ready Projects#

An “agent-ready” project is just a well-documented project. Every practice that helps an agent — clear conventions, explicit commands, tracked progress, documented decisions — also helps a new team member, a future-you who forgot the details, or a contractor picking up the project for the first time.

The difference is that humans can ask follow-up questions and gradually build context through conversation. Agents cannot. They need it written down, in the right place, at the right level of detail. Projects that meet this bar are better for everyone.

Detecting Infrastructure Knowledge Gaps: What Agents Don't Know They Don't Know

Detecting Infrastructure Knowledge Gaps#

The most dangerous thing an agent can do is confidently produce a deliverable based on wrong assumptions. An agent that assumes x86_64 when the target is ARM64, that assumes PostgreSQL 14 behavior when the target runs 15, or that assumes AWS IAM patterns when the target is Azure – that agent produces a runbook that will fail in ways the human did not expect and may not understand.

How Agents Communicate: Explaining Plans, Risks, and Trade-offs to Humans

How Agents Communicate#

The most capable agent in the world is useless if the human does not understand what it is doing, why it is doing it, or what it needs. Poor communication is the single largest cause of failed agent-human collaboration — not poor reasoning, not incorrect code, but the human losing confidence because the agent did not explain itself well enough.

This article covers communication patterns that build trust: how to present plans, explain risks, frame trade-offs, report progress, and escalate problems. It is written for agents to follow and for humans to know what good agent communication looks like.

Human-in-the-Loop Patterns: Approval Gates, Escalation, and Progressive Autonomy

Human-in-the-Loop Patterns#

The most common failure mode in agent-driven work is not a wrong answer – it is a correct action taken without permission. An agent that deletes a file to “clean up,” force-pushes a branch to “fix history,” or restarts a service to “apply changes” can cause more damage in one unauthorized action than a dozen wrong answers.

Human-in-the-loop design is not about limiting agent capability. It is about matching autonomy to risk. Safe, reversible actions should proceed without interruption. Dangerous, irreversible actions should require explicit approval. The challenge is building this classification into the workflow without turning every action into a confirmation dialog.

Long-Running Workflow Orchestration: State Machines, Checkpointing, and Resumable Multi-Agent Execution

Long-Running Workflow Orchestration#

Most agent examples show single-turn or single-session tasks: answer a question, write a function, debug an error. Real projects are different. Building a feature, migrating a database, setting up a monitoring stack – these take hours, span multiple sessions, involve parallel work streams, and must survive context window resets, session timeouts, and partial failures.

This article covers the architecture for workflows that last hours or days: how to model progress as a state machine, how to checkpoint for reliable resumption, how to delegate to parallel sub-agents without losing coherence, and how to recover when things fail partway through.

MCP Server Development: Building Servers from Scratch

MCP Server Development#

This reference covers building MCP servers from scratch – the server lifecycle, defining tools with proper JSON Schema, exposing resources, choosing transports, handling errors, and testing the result. If you want to understand when to use MCP versus alternatives, see the companion article on MCP Server Patterns. This article focuses on how to build one.

Server Lifecycle#

An MCP server goes through four phases: initialization, capability negotiation, operation, and shutdown.

Multi-Agent Coordination: Patterns for Dividing and Conquering Infrastructure Tasks

Multi-Agent Coordination#

A single agent can read files, call APIs, and reason about results. But some tasks are too broad, too slow, or too dangerous for one agent to handle alone. Debugging a production outage might require one agent analyzing logs, another checking infrastructure state, and a third reviewing recent deployments – simultaneously. Multi-agent coordination is how you split work across agents without them stepping on each other.

The hard part is not spawning multiple agents. The hard part is deciding which coordination pattern fits the task, how agents share information, and what happens when they disagree.

Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native

Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server: