Verifying LLM-Written SDK Code Against the Pinned Version: A Recipe Against Type Hallucination

May 18, 2026

Sdk-Verification, Wire-Level-Testing, Hallucination-Detection

Llm-Hallucination, Sdk-Version-Pinning, Dependency-Management, Code-Review, Agent-Debugging, Test-Strategy

An agent writes a 200-line streaming-client implementation against your project’s pinned SDK. It compiles cleanly in the model’s head. The test code references SomeStreamEvent, the streaming function signature is func NewStreaming(ctx, params) (stream, error), and the iteration loop uses stream.Recv(). The reviewer skims it, sees plausible naming, approves. CI fails with “undefined: SomeStreamEvent”. The agent escalates: “the SDK is broken — package not found.” Hours later, somebody figures out that the SDK they’re pinned to has none of those symbols. The import path is different. The function returns one value not two. The iteration pattern is Next() / Current() / Err(), not Recv(). The model invented the API.

Agent Debugging Patterns: Tracing Decisions in Production

February 22, 2026

Agent-Tooling

Intermediate, Advanced

Agent-Debugging, Observability-Design, Production-Monitoring

Debugging, Observability, Tracing, Logging, Hallucination

Python, Typescript, Opentelemetry, Structured-Logging

Agent Debugging Patterns#

When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.

Tracing Agent Decision Chains#

Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.

Agent Error Handling: Retries, Degradation, and Circuit Breakers

February 22, 2026

Agent-Tooling

Intermediate

Error-Recovery, Resilient-Agent-Design

Error-Handling, Retries, Circuit-Breaker, Resilience

Python, Typescript

Agent Error Handling#

Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.

Classify the Failure First#

Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.

Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.

Agent Evaluation and Testing: Measuring What Matters in Agent Performance

February 22, 2026

Agent-Tooling

Advanced

Agent-Evaluation, Test-Harness-Design, Metrics-Engineering

Testing, Evaluation, Metrics, Benchmarks, Regression-Testing, A-B-Testing

Python, Pytest, Json-Schema

Agent Evaluation and Testing#

You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.

Agent Memory and Retrieval: Patterns for Persistent, Searchable Agent Knowledge

February 22, 2026

Agent-Tooling

Intermediate

Memory-System-Design, Rag-Implementation, Context-Optimization

Memory, Retrieval, Rag, Vector-Databases, Context-Window, Embeddings

Chromadb, Pgvector, Sqlite, Redis, Python

Agent Memory and Retrieval#

An agent without memory repeats mistakes, forgets context, and relearns the same facts every session. An agent with too much memory wastes context window tokens on irrelevant history and retrieves noise instead of signal. Effective memory sits between these extremes – storing what matters, retrieving what is relevant, and forgetting what is stale.

This reference covers the concrete patterns for building agent memory systems, from simple file-based approaches to production-grade retrieval pipelines.

Agent Security Patterns: Defending Against Injection, Leakage, and Misuse

February 22, 2026

Agent-Tooling

Intermediate

Secure-Agent-Design, Threat-Modeling

Security, Prompt-Injection, Sandboxing, Secrets, Permissions

Python, Typescript, Docker

Agent Security Patterns#

An AI agent with tool access is a program that can read files, call APIs, execute code, and modify systems – driven by natural language input. Every classic security concern applies, plus new attack surfaces unique to LLM-powered systems. This article covers practical defenses, not theoretical risks.

Prompt Injection Defense#

Prompt injection is the most agent-specific security threat. An attacker embeds instructions in data the agent processes – a file, a web page, an API response – and the agent follows those instructions as if they came from the user.

Agent-Friendly API Design: Building APIs That Agents Can Consume

February 22, 2026

Agent-Tooling

Intermediate

Api-Design, Agent-Integration, Developer-Experience

Api-Design, Rest-Api, Agent-Integration, Error-Handling, Pagination

Openapi, Json-Schema, Typescript, Python

Agent-Friendly API Design#

Most APIs are designed for human developers who read documentation, interpret ambiguous error messages, and adapt their approach based on experience. Agents do not have these skills. They parse structured responses, follow explicit instructions, and fail on ambiguity. An API that is pleasant for humans to use may be impossible for an agent to use reliably.

This reference covers practical patterns for designing APIs – or modifying existing ones – so that agents can consume them effectively.

Automating Operational Runbooks

February 22, 2026

Sre

Intermediate, Advanced

Runbook-Design, Automation-Assessment, Guardrail-Implementation, Approval-Workflow-Design, Audit-Trail-Setup

Runbook-Automation, Rundeck, Ansible-Awx, Stackstorm, Automation, Guardrails, Approval-Workflow, Audit-Trail

Rundeck, Ansible, Ansible-Awx, Stackstorm, Bash, Python, Terraform, Kubectl

The Manual-to-Automated Progression#

Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.

Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.

Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.

Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.

Building LLM Harnesses: Orchestrating Local Models into Workflows with Scoring, Retries, and Parallel Execution

February 22, 2026

Agent-Tooling

Intermediate

Llm-Orchestration, Harness-Design, Output-Validation, Workflow-Automation

Llm-Harness, Orchestration, Local-Llm, Workflow, Scoring, Retry, Parallel-Execution, Ollama

Ollama, Python, Go

Building LLM Harnesses#

A harness is the infrastructure that wraps LLM calls into a reliable, testable, and observable workflow. It handles the concerns that a raw API call does not: input preparation, output validation, error recovery, model routing, parallel execution, and quality scoring. Without a harness, you have a script. With one, you have a tool.

Harness Architecture#

Input
  │
  ├── Preprocessing (validate input, select model, prepare prompt)
  │
  ├── Execution (call Ollama with timeout, retry on failure)
  │
  ├── Post-processing (parse output, validate schema, score quality)
  │
  ├── Routing (if quality too low, escalate to larger model or flag)
  │
  └── Output (structured result + metadata)

Core Harness in Python#

import ollama
import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable

@dataclass
class LLMResult:
    content: str
    model: str
    tokens_in: int
    tokens_out: int
    duration_ms: int
    ttft_ms: int
    success: bool
    retries: int = 0
    score: float | None = None
    metadata: dict = field(default_factory=dict)

@dataclass
class HarnessConfig:
    model: str = "qwen2.5-coder:7b"
    temperature: float = 0.0
    max_tokens: int = 1024
    json_mode: bool = False
    timeout_seconds: int = 120
    max_retries: int = 2
    retry_delay_seconds: float = 1.0

def call_llm(
    messages: list[dict],
    config: HarnessConfig,
) -> LLMResult:
    """Make a single LLM call with timing metadata."""
    start = time.monotonic()

    kwargs = {
        "model": config.model,
        "messages": messages,
        "options": {
            "temperature": config.temperature,
            "num_predict": config.max_tokens,
        },
        "stream": False,
    }
    if config.json_mode:
        kwargs["format"] = "json"

    try:
        response = ollama.chat(**kwargs)
        duration = int((time.monotonic() - start) * 1000)

        return LLMResult(
            content=response["message"]["content"],
            model=config.model,
            tokens_in=response.get("prompt_eval_count", 0),
            tokens_out=response.get("eval_count", 0),
            duration_ms=duration,
            ttft_ms=int(response.get("prompt_eval_duration", 0) / 1_000_000),
            success=True,
        )
    except Exception as e:
        duration = int((time.monotonic() - start) * 1000)
        return LLMResult(
            content=str(e),
            model=config.model,
            tokens_in=0,
            tokens_out=0,
            duration_ms=duration,
            ttft_ms=0,
            success=False,
        )

Retry with Validation#

Do not retry blindly. Retry only when the output fails validation:

FIPS 140 Compliance: Validated Cryptography, FIPS-Enabled Runtimes, and Kubernetes Deployment

February 22, 2026

Security

Intermediate

Fips-Compliance, Cryptographic-Configuration, Compliant-Container-Builds

Fips, Fips-140, Cryptography, Compliance, Federal, Encryption, Fedramp

Openssl, Go, Python, Nodejs, Kubernetes, Red-Hat-Ubi

FIPS 140 Compliance#

FIPS 140 (Federal Information Processing Standard 140) is a US and Canadian government standard for cryptographic modules. If you sell software to US federal agencies, process federal data, or operate under FedRAMP, you must use FIPS 140-validated cryptographic modules. Many regulated industries (finance, healthcare, defense) also require or strongly prefer FIPS compliance.

FIPS 140 does not tell you which algorithms to use — it validates that a specific implementation of those algorithms has been tested and certified by an accredited lab (CMVP — Cryptographic Module Validation Program).