Agent Evaluation and Testing: Measuring What Matters in Agent Performance

Agent Evaluation and Testing#

You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.

Agent-Friendly API Design: Building APIs That Agents Can Consume

Agent-Friendly API Design#

Most APIs are designed for human developers who read documentation, interpret ambiguous error messages, and adapt their approach based on experience. Agents do not have these skills. They parse structured responses, follow explicit instructions, and fail on ambiguity. An API that is pleasant for humans to use may be impossible for an agent to use reliably.

This reference covers practical patterns for designing APIs – or modifying existing ones – so that agents can consume them effectively.

MCP Server Development: Building Servers from Scratch

MCP Server Development#

This reference covers building MCP servers from scratch – the server lifecycle, defining tools with proper JSON Schema, exposing resources, choosing transports, handling errors, and testing the result. If you want to understand when to use MCP versus alternatives, see the companion article on MCP Server Patterns. This article focuses on how to build one.

Server Lifecycle#

An MCP server goes through four phases: initialization, capability negotiation, operation, and shutdown.

Structured Output Patterns: Getting Reliable JSON from LLMs

Structured Output Patterns#

Agents need structured data from LLMs – not free-form text with JSON somewhere inside it. When an agent asks a model to classify a bug as critical/medium/low and gets back a paragraph explaining the classification, the agent cannot act on it programmatically. Structured output is the bridge between LLM reasoning and deterministic code.

Three Approaches#

JSON Mode#

The simplest approach. Tell the API to return valid JSON and describe the shape you want in the prompt.

Structured Skill Definitions: Describing What Agents Can Do

Structured Skill Definitions#

When an agent has access to dozens of tools, it needs more than names and descriptions to use them well. It needs to know what inputs each tool expects, what outputs it produces, what other tools or infrastructure must be present, and how expensive or risky a call is. A structured skill definition captures all of this in a machine-readable format.

Why Not Just Use Function Signatures?#

Function signatures tell you the types of parameters. They do not tell you that a skill requires kubectl to be installed, takes 10-30 seconds to run, needs cluster-admin permissions, and might delete resources if called with the wrong flags. Agents making autonomous decisions need this information up front, not buried in documentation they may not read.