Agent Evaluation and Testing: Measuring What Matters in Agent Performance

February 22, 2026

Advanced

Agent-Evaluation, Test-Harness-Design, Metrics-Engineering

Testing, Evaluation, Metrics, Benchmarks, Regression-Testing, A-B-Testing

Agent Evaluation and Testing#

You cannot improve what you cannot measure. Agent evaluation is harder than traditional software testing because agents are non-deterministic, their behavior depends on prompt wording, and the same input can produce multiple valid outputs. But “it is hard” is not an excuse for not doing it. This article provides a step-by-step framework for building an agent evaluation pipeline that catches regressions, compares configurations, and quantifies real-world performance.

Choosing a Monitoring Stack: Prometheus vs Datadog vs Cloud-Native vs VictoriaMetrics

February 22, 2026

Observability

Intermediate

Monitoring-Architecture, Cost-Analysis, Tradeoff-Analysis

Prometheus, Datadog, Victoria-Metrics, Cloudwatch, Grafana, Monitoring, Metrics, Decision-Framework

Prometheus, Grafana, Thanos, Mimir, Victoria-Metrics, Datadog

Choosing a Monitoring Stack#

Monitoring is not optional. Without metrics, you are guessing. The question is not whether to monitor but which stack to use. The right choice depends on your cost tolerance, operational capacity, retention requirements, and how much you value control versus convenience.

Decision Criteria#

Before comparing tools, clarify what matters to your organization:

Cost model: Are you optimizing for infrastructure spend or engineering time? Self-managed tools cost less in licensing but more in operational hours. SaaS tools cost more in subscription fees but less in engineering effort.
Operational burden: Who manages the monitoring system? Do you have an infrastructure team, or are developers responsible for everything?
Data retention: Do you need metrics for 15 days, 90 days, or years? Long retention changes the equation significantly.
Query capability: Does your team know PromQL? Do they need ad-hoc analysis or mostly pre-built dashboards?
Alerting requirements: Simple threshold alerts, or complex multi-signal alerts with routing and escalation?
Team expertise: An organization fluent in Prometheus wastes that investment by switching to Datadog. An organization with no Prometheus experience faces a learning curve.

Options at a Glance#

Capability	Prometheus + Grafana	Prometheus + Thanos/Mimir	VictoriaMetrics	Datadog	Cloud-Native	Grafana Cloud
Cost model	Infrastructure only	Infrastructure only	Infrastructure only	Per host ($15-23/mo)	Per metric/API call	Per series/GB
Operational burden	High	Very high	Medium	None	Low	Low
Query language	PromQL	PromQL	MetricsQL (PromQL-compatible)	Datadog query language	Vendor-specific	PromQL, LogQL
Default retention	15 days (local disk)	Unlimited (object storage)	Unlimited (configurable)	15 months	Varies (15 days - 15 months)	Plan-dependent
HA built-in	No (requires federation)	Yes	Yes (cluster mode)	Yes	Yes	Yes
Multi-cluster	Federation (limited)	Yes (global view)	Yes (cluster mode)	Yes	Per-account	Yes
APM/Tracing	No (separate tools)	No (separate tools)	No (separate tools)	Yes (integrated)	Varies	Yes (Tempo)
Vendor lock-in	None	None	Low	High	High	Low-Medium

Prometheus + Grafana (Self-Managed)#

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model, scraping metrics from endpoints at configurable intervals, and stores time series data on local disk. Grafana provides visualization. Alertmanager handles alert routing.

Grafana Mimir for Long-Term Prometheus Storage

February 22, 2026

Observability

Intermediate, Advanced

Mimir-Deployment, Prometheus-Remote-Write, Metrics-Storage-Architecture, Tenant-Isolation-Configuration, Performance-Tuning

Mimir, Prometheus, Long-Term-Storage, Grafana, Metrics, Remote-Write, Tsdb, Object-Storage, Multi-Tenancy

Mimir, Prometheus, Grafana, Helm, Kubectl, S3, Gcs, Minio

Grafana Mimir for Long-Term Prometheus Storage#

Prometheus stores metrics on local disk with a practical retention limit of weeks to a few months. Beyond that, you need a long-term storage solution. Grafana Mimir is a horizontally scalable, multi-tenant time series database designed for exactly this purpose. It is API-compatible with Prometheus – Grafana queries Mimir using the same PromQL, and Prometheus pushes data to Mimir via remote_write.

Mimir is the successor to Cortex. Grafana Labs forked Cortex, rewrote significant portions for performance, and released Mimir under the AGPLv3 license. If you see references to Cortex architecture, the concepts map directly to Mimir with improvements.

OpenTelemetry for Kubernetes

February 22, 2026

Observability

Intermediate

Otel-Collector-Config, Auto-Instrumentation, Trace-Pipeline-Design

Opentelemetry, Otel, Tracing, Metrics, Kubernetes, Collector

Opentelemetry, Otel-Collector, Jaeger, Tempo, Prometheus, Helm, Kubectl

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

PromQL Essentials: Practical Query Patterns

February 22, 2026

Observability

Intermediate

Promql-Query-Writing, Metric-Analysis, Recording-Rule-Design

Promql, Prometheus, Metrics, Monitoring, Queries, Recording-Rules

Prometheus, Grafana

Instant Vectors vs Range Vectors#

An instant vector returns one sample per time series at a single point in time. A range vector returns multiple samples per time series over a time window.

# Instant vector: current value of each series
http_requests_total{job="api"}

# Range vector: last 5 minutes of samples for each series
http_requests_total{job="api"}[5m]

You cannot graph a range vector directly. Functions like rate() and increase() consume a range vector and return an instant vector, which Grafana can then plot.

Setting Up Full Observability from Scratch: Metrics, Logs, Traces, and Alerting

February 22, 2026

Observability

Intermediate

Observability-Stack-Deployment, Monitoring-Configuration, Alerting-Setup, Distributed-Tracing

Prometheus, Grafana, Loki, Tempo, Opentelemetry, Alertmanager, Tracing, Logging, Metrics

Helm, Kubectl, Prometheus, Grafana, Loki, Tempo, Opentelemetry

Setting Up Full Observability from Scratch#

This operational sequence deploys a complete observability stack on Kubernetes: metrics (Prometheus + Grafana), logs (Loki + Promtail), traces (Tempo + OpenTelemetry), and alerting (Alertmanager). Each phase is self-contained with verification steps. Complete them in order – later phases depend on earlier infrastructure.

Prerequisite: a running Kubernetes cluster with Helm installed and a monitoring namespace created.

kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Phase 1 – Metrics (Prometheus + Grafana)#

Metrics are the foundation. Logging and tracing integrations all route through Grafana, so this phase must be solid before continuing.

Time-Series Database Selection and Operations

February 22, 2026

Databases

Intermediate, Advanced

Timeseries-Database-Selection, Retention-Policy-Design, Downsampling-Operations, Metrics-Architecture

Timeseries, Influxdb, Timescaledb, Victoriametrics, Prometheus, Retention, Downsampling, Metrics, Iot

Influx, Psql, Vmctl, Promtool, Thanos, Cortex

Time-Series Database Selection and Operations#

Time-series databases optimize for a specific access pattern: high-volume writes of timestamped data points, queries that aggregate over time ranges, and automatic expiration of old data. Choosing the right one depends on your data model, query patterns, retention requirements, and operational constraints.

When You Need a Time-Series Database#

A dedicated time-series database is justified when you have high write throughput (thousands to millions of data points per second), queries that are predominantly time-range aggregations, and data that has a defined retention period. Common use cases: infrastructure metrics, application performance monitoring, IoT sensor data, financial tick data, and log analytics.

Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

February 21, 2026

Observability

Advanced

Promql-Optimization, Cardinality-Management, Recording-Rule-Strategy, Sli-Query-Design

Promql, Prometheus, Cardinality, Recording-Rules, Performance, Metrics, Monitoring

Prometheus, Grafana, Promtool

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics

February 21, 2026

Observability

Intermediate

Long-Term-Storage-Design, Multi-Cluster-Monitoring, Metrics-Architecture, Cost-Modeling

Prometheus, Thanos, Mimir, Victoriametrics, Long-Term-Storage, Multi-Cluster, Object-Storage, Metrics

Prometheus, Thanos, Grafana-Mimir, Victoriametrics, Grafana, S3, Gcs

The Retention Problem#

Prometheus stores metrics on local disk with a default retention of 15 days. Most production teams extend this to 30 or 90 days, but local storage has hard limits. A single Prometheus instance cannot scale disk beyond the node it runs on. It provides no high availability – if the instance goes down, you lose scraping and query access. And each Prometheus instance only sees its own targets, so there is no unified view across clusters or regions.