Building LLM Harnesses: Orchestrating Local Models into Workflows with Scoring, Retries, and Parallel Execution

Building LLM Harnesses#

A harness is the infrastructure that wraps LLM calls into a reliable, testable, and observable workflow. It handles the concerns that a raw API call does not: input preparation, output validation, error recovery, model routing, parallel execution, and quality scoring. Without a harness, you have a script. With one, you have a tool.

Harness Architecture#

Input
  │
  ├── Preprocessing (validate input, select model, prepare prompt)
  │
  ├── Execution (call Ollama with timeout, retry on failure)
  │
  ├── Post-processing (parse output, validate schema, score quality)
  │
  ├── Routing (if quality too low, escalate to larger model or flag)
  │
  └── Output (structured result + metadata)

Core Harness in Python#

import ollama
import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable

@dataclass
class LLMResult:
    content: str
    model: str
    tokens_in: int
    tokens_out: int
    duration_ms: int
    ttft_ms: int
    success: bool
    retries: int = 0
    score: float | None = None
    metadata: dict = field(default_factory=dict)

@dataclass
class HarnessConfig:
    model: str = "qwen2.5-coder:7b"
    temperature: float = 0.0
    max_tokens: int = 1024
    json_mode: bool = False
    timeout_seconds: int = 120
    max_retries: int = 2
    retry_delay_seconds: float = 1.0

def call_llm(
    messages: list[dict],
    config: HarnessConfig,
) -> LLMResult:
    """Make a single LLM call with timing metadata."""
    start = time.monotonic()

    kwargs = {
        "model": config.model,
        "messages": messages,
        "options": {
            "temperature": config.temperature,
            "num_predict": config.max_tokens,
        },
        "stream": False,
    }
    if config.json_mode:
        kwargs["format"] = "json"

    try:
        response = ollama.chat(**kwargs)
        duration = int((time.monotonic() - start) * 1000)

        return LLMResult(
            content=response["message"]["content"],
            model=config.model,
            tokens_in=response.get("prompt_eval_count", 0),
            tokens_out=response.get("eval_count", 0),
            duration_ms=duration,
            ttft_ms=int(response.get("prompt_eval_duration", 0) / 1_000_000),
            success=True,
        )
    except Exception as e:
        duration = int((time.monotonic() - start) * 1000)
        return LLMResult(
            content=str(e),
            model=config.model,
            tokens_in=0,
            tokens_out=0,
            duration_ms=duration,
            ttft_ms=0,
            success=False,
        )

Retry with Validation#

Do not retry blindly. Retry only when the output fails validation:

FIPS 140 Compliance: Validated Cryptography, FIPS-Enabled Runtimes, and Kubernetes Deployment

FIPS 140 Compliance#

FIPS 140 (Federal Information Processing Standard 140) is a US and Canadian government standard for cryptographic modules. If you sell software to US federal agencies, process federal data, or operate under FedRAMP, you must use FIPS 140-validated cryptographic modules. Many regulated industries (finance, healthcare, defense) also require or strongly prefer FIPS compliance.

FIPS 140 does not tell you which algorithms to use — it validates that a specific implementation of those algorithms has been tested and certified by an accredited lab (CMVP — Cryptographic Module Validation Program).

Rate Limiting Implementation Patterns

Rate Limiting Implementation Patterns#

Rate limiting controls how many requests a client can make within a time period. It protects services from overload, ensures fair usage across clients, prevents abuse, and provides a mechanism for graceful degradation under load. Every production API needs rate limiting at some layer.

Algorithm Comparison#

Fixed Window#

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window. When the count exceeds the limit, reject requests until the next window starts.

Structured Output from Small Local Models: JSON Mode, Extraction, Classification, and Token Runaway Fixes

Structured Output from Small Local Models#

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model’s limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

JSON Mode#

Ollama’s JSON mode forces the model to produce valid JSON:

Terraform Modules: Structure, Composition, and Reuse

What Modules Are#

A Terraform module is a directory containing .tf files. Every Terraform configuration is already a module (the “root module”). When you call another module from your root module, that is a “child module.” Modules let you encapsulate a set of resources behind a clean interface of input variables and outputs.

Module Structure#

A well-organized module looks like this:

modules/vpc/
  main.tf           # resource definitions
  variables.tf      # input variables
  outputs.tf        # output values
  versions.tf       # required providers and terraform version
  README.md         # usage documentation

The module itself has no backend, no provider configuration, and no hardcoded values. Everything configurable comes in through variables. Everything downstream consumers need comes out through outputs.

Writing Custom Prometheus Exporters: Exposing Application and Business Metrics

When to Write a Custom Exporter#

The Prometheus ecosystem has exporters for most infrastructure components: node_exporter for Linux hosts, kube-state-metrics for Kubernetes objects, mysqld_exporter for MySQL, and hundreds more. You write a custom exporter when your application or service does not have a Prometheus endpoint, you need business metrics that no generic exporter can provide (revenue, signups, queue depth), or you need to adapt a non-Prometheus system that exposes metrics in a proprietary format.