Verifying LLM-Written SDK Code Against the Pinned Version: A Recipe Against Type Hallucination

May 18, 2026

Sdk-Verification, Wire-Level-Testing, Hallucination-Detection

Llm-Hallucination, Sdk-Version-Pinning, Dependency-Management, Code-Review, Agent-Debugging, Test-Strategy

An agent writes a 200-line streaming-client implementation against your project’s pinned SDK. It compiles cleanly in the model’s head. The test code references SomeStreamEvent, the streaming function signature is func NewStreaming(ctx, params) (stream, error), and the iteration loop uses stream.Recv(). The reviewer skims it, sees plausible naming, approves. CI fails with “undefined: SomeStreamEvent”. The agent escalates: “the SDK is broken — package not found.” Hours later, somebody figures out that the SDK they’re pinned to has none of those symbols. The import path is different. The function returns one value not two. The iteration pattern is Next() / Current() / Err(), not Recv(). The model invented the API.

Local LLMs for AI Agents: When It Makes Sense, When It Doesn't

May 7, 2026

Agent-Tooling

Intermediate

Llm-Cost-Modeling, Hardware-vs-Api-Tradeoff-Analysis, Model-Capability-Benchmarking

Local-Llm, Cost-Analysis, Ollama, Mac-Studio, Dgx-Spark, Agent-Architecture, Hardware

Ollama, Anthropic-Api

A coding agent burns through tokens. The monthly bill from a frontier API provider for a single moderately active agent lands somewhere between fifty and a few hundred dollars, and the natural reaction is to check whether a one-time hardware purchase would be cheaper. The naive comparison — dollars per million tokens versus dollars amortized over five years — almost always concludes that local wins. The honest comparison rarely does, at least for coding workloads, at least as of mid-2026. The reason is a capability gap that doesn’t show up in any cost spreadsheet.

Wake-Filter Pattern: Cheap Classifier Before Expensive Agent

May 7, 2026

Agent-Tooling

Intermediate

Wake-Filter-Design, Classifier-Eval-Harness, Agent-Cost-Arithmetic

Wake-Filter, Classifier, Agent-Architecture, Cost-Optimization, Local-Llm, Ollama

Ollama, Anthropic-Api

An agent fleet wired to a high-volume trigger source — channel mentions, queue events, webhooks — pays full cost on every cycle, even when the trigger is noise. A classifier placed in front of the main agent decides which triggers deserve a real cycle and which to drop. The pattern is old; what is new is that local LLMs make the classifier cost effectively zero, which flips the arithmetic in the pattern’s favor for cases that previously didn’t justify the latency.

Agent Context Preservation for Long-Running Workflows: Checkpoints, Sub-Agent Delegation, and Avoiding Context Pollution

February 22, 2026

Agent-Tooling

Intermediate, Advanced

Context-Preservation, Checkpoint-Design, Sub-Agent-Delegation, Context-Scoping, Workflow-State-Management

Context-Management, Context-Preservation, Checkpoints, Sub-Agents, Context-Pollution, Long-Running, Todo-Lists, Claude-Code, Memory-Files, Skills, Spec-Documents, Agent-Delegation, Context-Window

Claude-Code, Filesystem, Git, Markdown

Agent Context Preservation for Long-Running Workflows#

The context window is the single most important constraint in agent-driven work. A single-turn task uses a fraction of it. A multi-hour project fills it, overflows it, and degrades the agent’s reasoning quality long before the task is complete. Agents that work effectively on ambitious projects are not smarter – they manage context better.

This article covers practical, battle-tested patterns for preserving context across long sessions, delegating to sub-agents without losing coherence, and avoiding context pollution – the gradual degradation that happens when irrelevant information accumulates in the working context.

Agent Debugging Patterns: Tracing Decisions in Production

February 22, 2026

Agent-Tooling

Intermediate, Advanced

Agent-Debugging, Observability-Design, Production-Monitoring

Debugging, Observability, Tracing, Logging, Hallucination

Python, Typescript, Opentelemetry, Structured-Logging

Agent Debugging Patterns#

When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.

Tracing Agent Decision Chains#

Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.

Agent Error Handling: Retries, Degradation, and Circuit Breakers

February 22, 2026

Agent-Tooling

Intermediate

Error-Recovery, Resilient-Agent-Design

Error-Handling, Retries, Circuit-Breaker, Resilience

Python, Typescript

Agent Error Handling#

Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.

Classify the Failure First#

Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.

Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.

Agent Evaluation and Testing: Measuring What Matters in Agent Performance

February 22, 2026

Agent-Tooling

Advanced

Agent-Evaluation, Test-Harness-Design, Metrics-Engineering

Testing, Evaluation, Metrics, Benchmarks, Regression-Testing, A-B-Testing

Python, Pytest, Json-Schema

Agent Evaluation and Testing#

You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.

Agent Memory and Retrieval: Patterns for Persistent, Searchable Agent Knowledge

February 22, 2026

Agent-Tooling

Intermediate

Memory-System-Design, Rag-Implementation, Context-Optimization

Memory, Retrieval, Rag, Vector-Databases, Context-Window, Embeddings

Chromadb, Pgvector, Sqlite, Redis, Python

Agent Memory and Retrieval#

An agent without memory repeats mistakes, forgets context, and relearns the same facts every session. An agent with too much memory wastes context window tokens on irrelevant history and retrieves noise instead of signal. Effective memory sits between these extremes – storing what matters, retrieving what is relevant, and forgetting what is stale.

This reference covers the concrete patterns for building agent memory systems, from simple file-based approaches to production-grade retrieval pipelines.

Agent Runbook Generation: Producing Verified Infrastructure Deliverables

February 22, 2026

Agent-Tooling

Intermediate

Runbook-Generation, Sandbox-Testing, Deliverable-Packaging

Runbooks, Deliverables, Sandbox, Infrastructure, Playbooks, Manifests

Helm, Kubectl, Terraform, Bash, Docker

Agent Runbook Generation#

An agent that says “you should probably add a readiness probe to your deployment” is giving advice. An agent that hands you a tested manifest with the readiness probe configured, verified against a real cluster, with rollback steps if the probe misconfigures – that agent is producing a deliverable. The difference matters.

The core thesis of infrastructure agent work is that the output is always a deliverable – a runbook, playbook, tested manifest, or validated configuration – never a direct action on someone else’s systems. This article covers the complete workflow for generating those deliverables: understanding requirements, planning steps, executing in a sandbox, capturing what worked, and packaging the result.

Agent Sandboxing: Isolation Strategies for Execution Environments

February 22, 2026

Agent-Tooling

Intermediate, Advanced

Sandbox-Design, Security-Architecture, Container-Isolation

Sandboxing, Security, Containers, Isolation, Firecracker, Gvisor

Docker, Firecracker, Gvisor, Seccomp, Apparmor

Agent Sandboxing#

An AI agent that can execute code, run shell commands, or call APIs needs a sandbox. Without one, a single bad tool call – whether from a bug, a hallucination, or a prompt injection attack – can read secrets, modify production data, or pivot to other systems. This article is a decision framework for choosing the right sandboxing strategy based on your trust level, threat model, and performance requirements.