Agent Evaluation and Testing: Measuring What Matters in Agent Performance

Agent Evaluation and Testing#

You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.

Choosing a Monitoring Stack: Prometheus vs Datadog vs Cloud-Native vs VictoriaMetrics

Choosing a Monitoring Stack#

Monitoring is not optional. Without metrics, you are guessing. The question is not whether to monitor but which stack to use. The right choice depends on your cost tolerance, operational capacity, retention requirements, and how much you value control versus convenience.

Decision Criteria#

Before comparing tools, clarify what matters to your organization:

  • Cost model: Are you optimizing for infrastructure spend or engineering time? Self-managed tools cost less in licensing but more in operational hours. SaaS tools cost more in subscription fees but less in engineering effort.
  • Operational burden: Who manages the monitoring system? Do you have an infrastructure team, or are developers responsible for everything?
  • Data retention: Do you need metrics for 15 days, 90 days, or years? Long retention changes the equation significantly.
  • Query capability: Does your team know PromQL? Do they need ad-hoc analysis or mostly pre-built dashboards?
  • Alerting requirements: Simple threshold alerts, or complex multi-signal alerts with routing and escalation?
  • Team expertise: An organization fluent in Prometheus wastes that investment by switching to Datadog. An organization with no Prometheus experience faces a learning curve.

Options at a Glance#

Capability Prometheus + Grafana Prometheus + Thanos/Mimir VictoriaMetrics Datadog Cloud-Native Grafana Cloud
Cost model Infrastructure only Infrastructure only Infrastructure only Per host ($15-23/mo) Per metric/API call Per series/GB
Operational burden High Very high Medium None Low Low
Query language PromQL PromQL MetricsQL (PromQL-compatible) Datadog query language Vendor-specific PromQL, LogQL
Default retention 15 days (local disk) Unlimited (object storage) Unlimited (configurable) 15 months Varies (15 days - 15 months) Plan-dependent
HA built-in No (requires federation) Yes Yes (cluster mode) Yes Yes Yes
Multi-cluster Federation (limited) Yes (global view) Yes (cluster mode) Yes Per-account Yes
APM/Tracing No (separate tools) No (separate tools) No (separate tools) Yes (integrated) Varies Yes (Tempo)
Vendor lock-in None None Low High High Low-Medium

Prometheus + Grafana (Self-Managed)#

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model, scraping metrics from endpoints at configurable intervals, and stores time series data on local disk. Grafana provides visualization. Alertmanager handles alert routing.

Grafana Mimir for Long-Term Prometheus Storage

Grafana Mimir for Long-Term Prometheus Storage#

Prometheus stores metrics on local disk with a practical retention limit of weeks to a few months. Beyond that, you need a long-term storage solution. Grafana Mimir is a horizontally scalable, multi-tenant time series database designed for exactly this purpose. It is API-compatible with Prometheus – Grafana queries Mimir using the same PromQL, and Prometheus pushes data to Mimir via remote_write.

Mimir is the successor to Cortex. Grafana Labs forked Cortex, rewrote significant portions for performance, and released Mimir under the AGPLv3 license. If you see references to Cortex architecture, the concepts map directly to Mimir with improvements.

OpenTelemetry for Kubernetes

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

  • Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
  • Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
  • Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

PromQL Essentials: Practical Query Patterns

Instant Vectors vs Range Vectors#

An instant vector returns one sample per time series at a single point in time. A range vector returns multiple samples per time series over a time window.

# Instant vector: current value of each series
http_requests_total{job="api"}

# Range vector: last 5 minutes of samples for each series
http_requests_total{job="api"}[5m]

You cannot graph a range vector directly. Functions like rate() and increase() consume a range vector and return an instant vector, which Grafana can then plot.

Setting Up Full Observability from Scratch: Metrics, Logs, Traces, and Alerting

Setting Up Full Observability from Scratch#

This operational sequence deploys a complete observability stack on Kubernetes: metrics (Prometheus + Grafana), logs (Loki + Promtail), traces (Tempo + OpenTelemetry), and alerting (Alertmanager). Each phase is self-contained with verification steps. Complete them in order – later phases depend on earlier infrastructure.

Prerequisite: a running Kubernetes cluster with Helm installed and a monitoring namespace created.

kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Phase 1 – Metrics (Prometheus + Grafana)#

Metrics are the foundation. Logging and tracing integrations all route through Grafana, so this phase must be solid before continuing.

Time-Series Database Selection and Operations

Time-Series Database Selection and Operations#

Time-series databases optimize for a specific access pattern: high-volume writes of timestamped data points, queries that aggregate over time ranges, and automatic expiration of old data. Choosing the right one depends on your data model, query patterns, retention requirements, and operational constraints.

When You Need a Time-Series Database#

A dedicated time-series database is justified when you have high write throughput (thousands to millions of data points per second), queries that are predominantly time-range aggregations, and data that has a defined retention period. Common use cases: infrastructure metrics, application performance monitoring, IoT sensor data, financial tick data, and log analytics.

Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics

The Retention Problem#

Prometheus stores metrics on local disk with a default retention of 15 days. Most production teams extend this to 30 or 90 days, but local storage has hard limits. A single Prometheus instance cannot scale disk beyond the node it runs on. It provides no high availability – if the instance goes down, you lose scraping and query access. And each Prometheus instance only sees its own targets, so there is no unified view across clusters or regions.