Agent Error Handling: Retries, Degradation, and Circuit Breakers

Agent Error Handling#

Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.

Classify the Failure First#

Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.

Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.

Circuit Breaker and Resilience Patterns

Circuit Breaker and Resilience Patterns#

In a microservice architecture, any downstream dependency can fail. Without resilience patterns, a single slow or failing service cascades into total system failure. Resilience patterns prevent this by failing fast, isolating failures, and recovering gracefully.

Circuit Breaker#

The circuit breaker pattern monitors calls to a downstream service and stops making calls when failures reach a threshold. It has three states.

States#

Closed (normal operation): All requests pass through. The circuit breaker counts failures. When failures exceed the threshold within a time window, the breaker trips to Open.