---
title: "Agent Debugging Patterns: Tracing Decisions in Production"
description: "How to debug AI agent behavior in production — tracing decision chains, logging tool calls and responses, identifying hallucination patterns, managing timeouts and retries, and context window observability."
url: https://agent-zone.ai/knowledge/agent-tooling/agent-debugging-patterns/
section: knowledge
date: 2026-02-22
categories: ["agent-tooling"]
tags: ["debugging","observability","tracing","logging","hallucination"]
skills: ["agent-debugging","observability-design","production-monitoring"]
tools: ["python","typescript","opentelemetry","structured-logging"]
levels: ["intermediate","advanced"]
word_count: 1448
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/agent-debugging-patterns/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/agent-debugging-patterns/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Agent+Debugging+Patterns%3A+Tracing+Decisions+in+Production
---


# Agent Debugging Patterns

When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.

## Tracing Agent Decision Chains

Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.

### The Agent Trace

Model each agent turn as a span in a distributed trace. The parent span is the user request. Child spans are individual LLM calls and tool invocations.

```python
from opentelemetry import trace

tracer = trace.get_tracer("agent")

async def agent_loop(user_message: str, session_id: str):
    with tracer.start_as_current_span("agent_request", attributes={
        "session.id": session_id,
        "user.message_length": len(user_message),
    }) as request_span:
        messages = build_context(user_message)

        for turn in range(MAX_TURNS):
            with tracer.start_as_current_span(f"llm_call_{turn}") as llm_span:
                llm_span.set_attribute("context.token_count", count_tokens(messages))
                response = await call_llm(messages)
                llm_span.set_attribute("response.has_tool_calls", bool(response.tool_calls))
                llm_span.set_attribute("response.finish_reason", response.finish_reason)

            if response.tool_calls:
                for call in response.tool_calls:
                    with tracer.start_as_current_span(f"tool_{call.name}") as tool_span:
                        tool_span.set_attribute("tool.name", call.name)
                        tool_span.set_attribute("tool.params", json.dumps(call.arguments))
                        result = await execute_tool(call)
                        tool_span.set_attribute("tool.result_length", len(str(result)))
                        tool_span.set_attribute("tool.is_error", result.get("isError", False))
                        messages.append(make_tool_result(call.id, result))
            else:
                request_span.set_attribute("total_turns", turn + 1)
                return response.content
```

This trace tells you: how many turns the agent took, which tools it called at each turn, how large the context was at each LLM call, and whether any tool returned an error. When something goes wrong, you open the trace and walk the decision chain step by step.

### Key Attributes to Capture

For each LLM call: token count (input and output), finish reason (stop, tool_use, length), model used, latency.

For each tool call: tool name, input parameters, result size, error status, latency.

For the overall request: total turns, total tool calls, total tokens consumed, final response length, session ID.

## Logging Tool Calls and Responses

Structured logging is the foundation of agent debugging. Every tool invocation must be logged with enough detail to reproduce the issue without re-running the agent.

```python
import structlog

logger = structlog.get_logger()

async def execute_tool(call: ToolCall) -> dict:
    start = time.monotonic()

    logger.info("tool_call_start",
        tool=call.name,
        params=redact_sensitive(call.arguments),
        session_id=current_session_id(),
    )

    try:
        result = await tool_registry[call.name](**call.arguments)
        duration = time.monotonic() - start

        logger.info("tool_call_success",
            tool=call.name,
            duration_ms=round(duration * 1000),
            result_size=len(json.dumps(result)),
            result_preview=truncate(str(result), 200),
        )
        return result

    except Exception as e:
        duration = time.monotonic() - start
        logger.error("tool_call_error",
            tool=call.name,
            duration_ms=round(duration * 1000),
            error_type=type(e).__name__,
            error_message=sanitize_error(str(e)),
        )
        return {"isError": True, "content": [{"type": "text", "text": str(e)}]}
```

Critical rules for agent logging:

**Redact sensitive parameters.** Tool calls may contain file paths with usernames, API endpoints with tokens, or database queries with credentials. Redact before logging.

**Truncate large results.** A tool that reads a 10,000-line file should not dump all 10,000 lines into the log. Log a preview (first 200 characters) and the full size.

**Log the decision, not just the action.** When possible, capture why the agent chose a particular tool. This is hard to extract from the model, but you can infer it from the sequence: if the agent called `search` followed by `read_file`, it was looking for something specific.

## Identifying Hallucination Patterns

Agent hallucinations in infrastructure contexts are especially dangerous because they look plausible. The agent might reference a file that does not exist, use a kubectl flag that is not real, or cite a configuration parameter that was never set. There are patterns you can watch for.

### File Path Hallucination

The agent references files it has not actually read. Detect this by comparing tool results against subsequent agent claims.

```python
class HallucinationDetector:
    def __init__(self):
        self.files_read: set[str] = set()
        self.files_confirmed: set[str] = set()

    def on_tool_result(self, tool_name: str, params: dict, result: dict):
        if tool_name == "read_file" and not result.get("isError"):
            self.files_read.add(params["path"])
        if tool_name in ("search", "glob"):
            for path in extract_paths(result):
                self.files_confirmed.add(path)

    def check_response(self, response: str) -> list[str]:
        warnings = []
        mentioned_paths = extract_file_paths(response)
        for path in mentioned_paths:
            if path not in self.files_read and path not in self.files_confirmed:
                warnings.append(f"Agent mentions {path} but never read or confirmed it")
        return warnings
```

### Command Hallucination

The agent suggests or executes commands with flags or subcommands that do not exist. This happens when the agent generalizes from similar commands. Validate commands against known schemas before execution.

```python
KNOWN_KUBECTL_SUBCOMMANDS = {
    "get", "describe", "logs", "apply", "delete", "create",
    "edit", "patch", "rollout", "scale", "exec", "port-forward",
}

def validate_kubectl_command(args: list[str]) -> list[str]:
    warnings = []
    if args and args[0] not in KNOWN_KUBECTL_SUBCOMMANDS:
        warnings.append(f"Unknown kubectl subcommand: {args[0]}")
    return warnings
```

### Confidence Decay

Watch for hallucination signals that correlate with context window usage. As the context fills up, the model has less room for reasoning and is more likely to confabulate. Track the ratio of context used to context available at each turn.

```python
def context_pressure(current_tokens: int, max_tokens: int) -> float:
    ratio = current_tokens / max_tokens
    if ratio > 0.85:
        logger.warning("high_context_pressure",
            ratio=round(ratio, 2),
            tokens_used=current_tokens,
            tokens_max=max_tokens,
        )
    return ratio
```

## Timeout and Retry Debugging

Timeout failures are the hardest to debug because the evidence disappears -- the operation was killed before it could report what went wrong.

### Layered Timeout Tracking

Agent systems have timeouts at multiple levels, and they interact in confusing ways. Track all of them.

```python
@dataclass
class TimeoutContext:
    tool_timeout: float        # Individual tool execution limit
    turn_timeout: float        # Single agent turn limit
    session_timeout: float     # Total session wall-clock limit
    elapsed_session: float     # Time spent so far

    def remaining(self) -> dict:
        return {
            "tool": self.tool_timeout,
            "turn": self.turn_timeout,
            "session": self.session_timeout - self.elapsed_session,
        }
```

When a timeout fires, log which layer triggered it and how much time remained at other layers. A tool timeout at 30 seconds is expected behavior. A session timeout at 5 minutes because the agent retried a failing tool 15 times is a design problem.

### Retry Loop Detection

Agents can get stuck retrying the same failing operation. Detect this by tracking tool call patterns within a session.

```python
class RetryDetector:
    def __init__(self, max_identical_calls: int = 3):
        self.call_history: list[tuple[str, str]] = []
        self.max_identical = max_identical_calls

    def check(self, tool_name: str, params: dict) -> bool:
        key = (tool_name, json.dumps(params, sort_keys=True))
        self.call_history.append(key)

        # Count identical recent calls
        recent = self.call_history[-self.max_identical:]
        if len(recent) == self.max_identical and len(set(recent)) == 1:
            logger.warning("retry_loop_detected",
                tool=tool_name,
                identical_calls=self.max_identical,
            )
            return True
        return False
```

This detector fires when the agent calls the same tool with the same parameters three times in a row. At that point, either the tool is broken or the agent is stuck. Either way, continuing the loop will not help.

## Context Window Management Debugging

Context window overflow is an invisible failure mode. The model silently loses information as earlier messages get truncated or summarized. Debug this by tracking what the agent can and cannot see.

### Context Budget Tracking

```python
class ContextBudgetTracker:
    def __init__(self, max_tokens: int):
        self.max_tokens = max_tokens
        self.entries: list[dict] = []

    def add_entry(self, role: str, content: str, source: str):
        tokens = count_tokens(content)
        self.entries.append({
            "role": role,
            "source": source,
            "tokens": tokens,
            "timestamp": time.time(),
        })

    def report(self) -> dict:
        total = sum(e["tokens"] for e in self.entries)
        by_source = {}
        for e in self.entries:
            by_source.setdefault(e["source"], 0)
            by_source[e["source"]] += e["tokens"]

        return {
            "total_tokens": total,
            "max_tokens": self.max_tokens,
            "utilization": round(total / self.max_tokens, 2),
            "by_source": dict(sorted(by_source.items(), key=lambda x: -x[1])),
        }
```

The `by_source` breakdown tells you what is consuming context. If tool results account for 70% of context, your tools are returning too much data. If system instructions take 30%, they need trimming.

### Context Eviction Logging

When the agent runtime trims old messages to make room, log what was removed. Future debugging depends on knowing whether the agent still had access to a critical piece of information when it made a decision.

```python
def evict_oldest_messages(messages: list, target_tokens: int) -> list:
    evicted = []
    while count_tokens(messages) > target_tokens and len(messages) > 2:
        removed = messages.pop(1)  # Keep system message (index 0)
        evicted.append({
            "role": removed["role"],
            "tokens": count_tokens([removed]),
            "preview": truncate(removed["content"], 100),
        })

    if evicted:
        logger.info("context_eviction",
            messages_removed=len(evicted),
            tokens_freed=sum(e["tokens"] for e in evicted),
            previews=[e["preview"] for e in evicted],
        )

    return messages
```

## Building a Debugging Dashboard

Combine traces, logs, and metrics into a debugging workflow. The practical approach uses three views.

**Session timeline.** A chronological view of one agent session showing each LLM call, tool invocation, and result. Click any step to see full inputs and outputs. This is your primary debugging tool for individual failures.

**Aggregate metrics.** Track across all sessions: average turns per task, tool error rates, timeout frequency, context utilization distribution, and hallucination detection rates. These reveal systemic issues -- a tool that fails 30% of the time, a prompt that consistently leads to retry loops.

**Anomaly detection.** Flag sessions that deviate from normal patterns: unusually high turn counts, same tool called more than 5 times, context utilization above 90%, or tool error rates above the session average. These outliers are where the bugs live.

The investment in agent observability pays off immediately. Without it, debugging an agent means re-running the conversation and hoping you can reproduce the issue. With it, you open the trace and read exactly what happened.