An agent writes a 200-line streaming-client implementation against your project’s pinned SDK. It compiles cleanly in the model’s head. The test code references SomeStreamEvent, the streaming function signature is func NewStreaming(ctx, params) (stream, error), and the iteration loop uses stream.Recv(). The reviewer skims it, sees plausible naming, approves. CI fails with “undefined: SomeStreamEvent”. The agent escalates: “the SDK is broken — package not found.” Hours later, somebody figures out that the SDK they’re pinned to has none of those symbols. The import path is different. The function returns one value not two. The iteration pattern is Next() / Current() / Err(), not Recv(). The model invented the API.
Agent Debugging Patterns: Tracing Decisions in Production
Agent Debugging Patterns#
When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.
Tracing Agent Decision Chains#
Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.
Agent Error Handling: Retries, Degradation, and Circuit Breakers
Agent Error Handling#
Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.
Classify the Failure First#
Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.
Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.
Agent Evaluation and Testing: Measuring What Matters in Agent Performance
Agent Evaluation and Testing#
You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.
Agent Memory and Retrieval: Patterns for Persistent, Searchable Agent Knowledge
Agent Memory and Retrieval#
An agent without memory repeats mistakes, forgets context, and relearns the same facts every session. An agent with too much memory wastes context window tokens on irrelevant history and retrieves noise instead of signal. Effective memory sits between these extremes – storing what matters, retrieving what is relevant, and forgetting what is stale.
This reference covers the concrete patterns for building agent memory systems, from simple file-based approaches to production-grade retrieval pipelines.
Agent Security Patterns: Defending Against Injection, Leakage, and Misuse
Agent Security Patterns#
An AI agent with tool access is a program that can read files, call APIs, execute code, and modify systems – driven by natural language input. Every classic security concern applies, plus new attack surfaces unique to LLM-powered systems. This article covers practical defenses, not theoretical risks.
Prompt Injection Defense#
Prompt injection is the most agent-specific security threat. An attacker embeds instructions in data the agent processes – a file, a web page, an API response – and the agent follows those instructions as if they came from the user.
Agent-Friendly API Design: Building APIs That Agents Can Consume
Agent-Friendly API Design#
Most APIs are designed for human developers who read documentation, interpret ambiguous error messages, and adapt their approach based on experience. Agents do not have these skills. They parse structured responses, follow explicit instructions, and fail on ambiguity. An API that is pleasant for humans to use may be impossible for an agent to use reliably.
This reference covers practical patterns for designing APIs – or modifying existing ones – so that agents can consume them effectively.
Automating Operational Runbooks
The Manual-to-Automated Progression#
Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.
Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.
Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.
Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.
Building LLM Harnesses: Orchestrating Local Models into Workflows with Scoring, Retries, and Parallel Execution
Building LLM Harnesses#
A harness is the infrastructure that wraps LLM calls into a reliable, testable, and observable workflow. It handles the concerns that a raw API call does not: input preparation, output validation, error recovery, model routing, parallel execution, and quality scoring. Without a harness, you have a script. With one, you have a tool.
Harness Architecture#
Input
│
├── Preprocessing (validate input, select model, prepare prompt)
│
├── Execution (call Ollama with timeout, retry on failure)
│
├── Post-processing (parse output, validate schema, score quality)
│
├── Routing (if quality too low, escalate to larger model or flag)
│
└── Output (structured result + metadata)Core Harness in Python#
import ollama
import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable
@dataclass
class LLMResult:
content: str
model: str
tokens_in: int
tokens_out: int
duration_ms: int
ttft_ms: int
success: bool
retries: int = 0
score: float | None = None
metadata: dict = field(default_factory=dict)
@dataclass
class HarnessConfig:
model: str = "qwen2.5-coder:7b"
temperature: float = 0.0
max_tokens: int = 1024
json_mode: bool = False
timeout_seconds: int = 120
max_retries: int = 2
retry_delay_seconds: float = 1.0
def call_llm(
messages: list[dict],
config: HarnessConfig,
) -> LLMResult:
"""Make a single LLM call with timing metadata."""
start = time.monotonic()
kwargs = {
"model": config.model,
"messages": messages,
"options": {
"temperature": config.temperature,
"num_predict": config.max_tokens,
},
"stream": False,
}
if config.json_mode:
kwargs["format"] = "json"
try:
response = ollama.chat(**kwargs)
duration = int((time.monotonic() - start) * 1000)
return LLMResult(
content=response["message"]["content"],
model=config.model,
tokens_in=response.get("prompt_eval_count", 0),
tokens_out=response.get("eval_count", 0),
duration_ms=duration,
ttft_ms=int(response.get("prompt_eval_duration", 0) / 1_000_000),
success=True,
)
except Exception as e:
duration = int((time.monotonic() - start) * 1000)
return LLMResult(
content=str(e),
model=config.model,
tokens_in=0,
tokens_out=0,
duration_ms=duration,
ttft_ms=0,
success=False,
)Retry with Validation#
Do not retry blindly. Retry only when the output fails validation:
FIPS 140 Compliance: Validated Cryptography, FIPS-Enabled Runtimes, and Kubernetes Deployment
FIPS 140 Compliance#
FIPS 140 (Federal Information Processing Standard 140) is a US and Canadian government standard for cryptographic modules. If you sell software to US federal agencies, process federal data, or operate under FedRAMP, you must use FIPS 140-validated cryptographic modules. Many regulated industries (finance, healthcare, defense) also require or strongly prefer FIPS compliance.
FIPS 140 does not tell you which algorithms to use — it validates that a specific implementation of those algorithms has been tested and certified by an accredited lab (CMVP — Cryptographic Module Validation Program).