Game Day and Tabletop Exercise Planning

Sre

Why Run Exercises#

Runbooks that have never been tested are fiction. Failover procedures that have never been executed are hopes. Game days and tabletop exercises convert assumptions about system resilience into verified facts – or reveal that those assumptions were wrong before a real incident does.

The value is not just finding technical gaps. Exercises expose process gaps: unclear escalation paths, missing permissions, outdated contact lists, communication breakdowns between teams. These are invisible until a simulated failure forces people to actually follow the documented procedure.

Agent Error Handling: Retries, Degradation, and Circuit Breakers

Agent Error Handling#

Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.

Classify the Failure First#

Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.

Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.

Chaos Engineering: From First Experiment to Mature Practice

Sre

Why Break Things on Purpose#

Production systems fail in ways that testing environments never reveal. A database connection pool exhaustion under load, a cascading timeout across three services, a DNS cache that masks a routing change until it expires – these failures only surface when real conditions collide in ways nobody predicted. Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they cause outages.

Circuit Breaker and Resilience Patterns

Circuit Breaker and Resilience Patterns#

In a microservice architecture, any downstream dependency can fail. Without resilience patterns, a single slow or failing service cascades into total system failure. Resilience patterns prevent this by failing fast, isolating failures, and recovering gracefully.

Circuit Breaker#

The circuit breaker pattern monitors calls to a downstream service and stops making calls when failures reach a threshold. It has three states.

States#

Closed (normal operation): All requests pass through. The circuit breaker counts failures. When failures exceed the threshold within a time window, the breaker trips to Open.