Agent Debugging Patterns: Tracing Decisions in Production

Agent Debugging Patterns#

When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.

Tracing Agent Decision Chains#

Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.

Distributed Tracing in Practice

Trace, Span, and Context#

A trace represents a single request flowing through a distributed system. It is identified by a 128-bit trace ID. A span represents one unit of work within that trace – an HTTP handler, a database query, a message publish. Each span has a name, start time, duration, status, attributes (key-value pairs), and events (timestamped annotations). Spans form a tree: every span except the root has a parent span ID.

OpenTelemetry for Kubernetes

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

  • Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
  • Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
  • Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

Setting Up Full Observability from Scratch: Metrics, Logs, Traces, and Alerting

Setting Up Full Observability from Scratch#

This operational sequence deploys a complete observability stack on Kubernetes: metrics (Prometheus + Grafana), logs (Loki + Promtail), traces (Tempo + OpenTelemetry), and alerting (Alertmanager). Each phase is self-contained with verification steps. Complete them in order – later phases depend on earlier infrastructure.

Prerequisite: a running Kubernetes cluster with Helm installed and a monitoring namespace created.

kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Phase 1 – Metrics (Prometheus + Grafana)#

Metrics are the foundation. Logging and tracing integrations all route through Grafana, so this phase must be solid before continuing.