---
title: "Disaster Recovery Testing: From Tabletop Exercises to Full Regional Failover"
description: "How to validate your DR plan actually works. Covers the four types of DR tests, testing frequency by tier, automation with chaos engineering tools, the most common test failures, post-test reporting, and regulatory requirements for DR testing under SOC2 and PCI-DSS."
url: https://agent-zone.ai/knowledge/infrastructure/disaster-recovery-testing/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["disaster-recovery","dr-testing","chaos-engineering","tabletop-exercise","failover-testing","soc2","pci-dss","compliance","game-day"]
skills: ["dr-test-planning","failover-execution","chaos-experiment-design","compliance-validation"]
tools: ["chaos-mesh","litmus-chaos","gremlin","aws-fis","terraform","pagerduty","runbook"]
levels: ["intermediate","advanced"]
word_count: 1279
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/disaster-recovery-testing/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/disaster-recovery-testing/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Disaster+Recovery+Testing%3A+From+Tabletop+Exercises+to+Full+Regional+Failover
---


# Disaster Recovery Testing: From Tabletop Exercises to Full Regional Failover

An untested DR plan is a hope document. Every organization that has experienced a real disaster and failed to recover had a DR plan on paper. The plan was never tested, so the credentials were expired, the runbook referenced a service that was renamed six months ago, DNS TTLs were longer than assumed, and nobody knew who was supposed to make the failover call.

DR testing is the only way to close the gap between what your plan says and what actually happens.

## Types of DR Tests

### Type 1: Tabletop Walkthrough

The team sits in a room (or video call) and walks through the DR runbook step by step. No systems are touched. The goal is to find gaps in the plan: missing steps, unclear ownership, unstated assumptions.

**How to run one:**
1. The facilitator presents a scenario: "It is 2 AM on a Saturday. AWS us-east-1 is completely unavailable. Walk me through what happens."
2. Each person describes what they would do. The facilitator asks probing questions: "Who makes the decision to failover? How do you know the database is caught up? Where are the credentials for the DR region?"
3. Document every gap, ambiguity, and assumption that surfaces.

**Duration:** 1-2 hours. **Frequency:** Quarterly for all tiers. **Cost:** Engineering time only.

Tabletop exercises are cheap and catch a surprising number of issues. They are the minimum viable DR test.

### Type 2: Component Failover

Test individual components in isolation. Fail over a single database, switch a single service to the DR region, or restore a single backup. This validates that the technical mechanisms work without the risk of a full failover.

**Examples:**
- Promote an RDS read replica to primary, verify application connectivity, then fail back.
- Restore a database backup to a test instance and verify data integrity.
- Switch one microservice to the DR region using weighted DNS routing.

**Duration:** 2-4 hours. **Frequency:** Monthly for Tier 2-4 workloads. **Cost:** Infrastructure costs during test, plus engineering time.

### Type 3: Partial Failover

Fail over a subset of production traffic or a subset of services to the DR region. This tests the interaction between components in the DR environment and validates that services can actually communicate when running in the secondary region.

**Examples:**
- Route 10% of traffic to the DR region via weighted DNS.
- Fail over the entire data tier (database + cache) to DR while keeping application tier in the primary region (tests cross-region latency impact).
- Fail over one complete service including its dependencies.

**Duration:** 4-8 hours. **Frequency:** Quarterly for Tier 3-4 workloads. **Cost:** Significant -- running production traffic through DR infrastructure.

### Type 4: Full Regional Failover

Redirect all production traffic to the DR region. This is the real test. Everything runs from DR for a sustained period (minimum 2 hours, ideally 24 hours).

**Duration:** 8-24+ hours. **Frequency:** Annually for Tier 3-4 workloads. **Cost:** High -- full DR infrastructure at production scale, plus risk of customer impact if something goes wrong.

## Testing Frequency by DR Tier

| DR Tier | Tabletop | Component Failover | Partial Failover | Full Failover |
|---|---|---|---|---|
| Tier 1 (Backup/Restore) | Quarterly | Quarterly (restore test) | N/A | Annually |
| Tier 2 (Warm Standby) | Quarterly | Monthly | Quarterly | Annually |
| Tier 3 (Hot Standby) | Quarterly | Monthly | Quarterly | Semi-annually |
| Tier 4 (Active-Active) | Quarterly | Monthly | Monthly | Quarterly |

## Automating DR Tests with Chaos Engineering Tools

Manual DR tests are expensive and infrequent. Automated chaos experiments let you test DR mechanisms continuously.

**AWS Fault Injection Simulator (FIS)** -- native AWS chaos testing. Inject AZ failures, stop instances, throttle API calls. Integrates with CloudWatch for automated abort conditions.

```json
{
  "description": "Simulate AZ failure for DR validation",
  "targets": {
    "az-instances": {
      "resourceType": "aws:ec2:instance",
      "selectionMode": "ALL",
      "filters": [
        { "path": "Placement.AvailabilityZone", "values": ["us-east-1a"] }
      ]
    }
  },
  "actions": {
    "stop-instances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": { "startInstancesAfterDuration": "PT30M" },
      "targets": { "Instances": "az-instances" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm", "value": "arn:aws:cloudwatch:...:alarm:dr-test-abort" }
  ]
}
```

**Chaos Mesh (Kubernetes)** -- inject pod failures, network partitions, and IO faults. Schedule recurring experiments:

```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-db-failover-test
spec:
  schedule: "0 3 * * 1"   # Every Monday at 3 AM
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      labelSelectors:
        app: postgresql
        role: primary
    duration: "5m"
```

This kills the primary database pod weekly and validates that automatic failover to the replica works. Combine with monitoring alerts to verify that the application recovers within RTO.

## Common DR Test Failures

These are the failures that surface during real DR tests. Expect them.

**Stale credentials.** Service accounts, API keys, or database passwords in the DR region were rotated in production but not in DR. This is the single most common DR test failure. Fix: automate credential sync or use a centralized secret store (Vault, AWS Secrets Manager) that both regions read from.

**DNS TTL surprises.** You planned for a 60-second failover, but the DNS TTL is 300 seconds and clients cache it. Some clients cache beyond TTL. Java applications with `networkaddress.cache.ttl` set to -1 (cache forever) are a classic. Fix: audit TTLs before the test. Set DR-critical records to 60 seconds or less. Test client behavior explicitly.

**Forgotten dependencies.** The application fails over, but it depends on an internal service (authentication, configuration, feature flags) that has no DR presence. Fix: map all dependencies before the test. Include third-party SaaS dependencies -- your DR plan means nothing if your auth provider is also down.

**Data drift.** The DR database schema is out of date because migrations were applied to production but not to DR. Or the DR environment is missing configuration changes made directly in production. Fix: automate schema replication. Never make manual changes to production without also applying them to DR.

**Capacity limits.** The DR region does not have sufficient capacity. Instance quotas, IP address limits, or auto-scaling group maximums are too low. The region may not even have the instance types you need. Fix: pre-provision capacity or verify quotas quarterly.

**Certificate mismatches.** TLS certificates in DR are expired or issued for the wrong domain. Wildcard certs help, but do not assume -- verify before the test.

## Post-Test Report

Every DR test produces a report. This is your compliance evidence and your improvement plan.

**Report template:**
- **Test date and type:** Full regional failover, 2026-01-15
- **Participants:** List of all team members involved
- **Scenario:** Complete loss of us-east-1 at 10:00 UTC
- **Actual RPO achieved:** 3 minutes 22 seconds (target: < 5 minutes)
- **Actual RTO achieved:** 12 minutes 45 seconds (target: < 15 minutes)
- **Issues found:** 4 (2 critical, 1 major, 1 minor)
- **Issue details:** For each issue -- description, impact, remediation, owner, due date
- **Pass/fail:** Conditional pass (RPO/RTO met, but critical issues require remediation)
- **Next test date:** 2026-04-15

## Regulatory Requirements

**SOC 2 (Trust Services Criteria):** CC7.5 requires that the entity tests recovery plan procedures. Annual testing is the minimum expectation. Auditors want evidence: test plans, test results, remediation tracking, and follow-up validation that issues were fixed.

**PCI-DSS (Requirement 12.10):** The incident response plan must be tested at least annually. PCI-DSS 4.0 (effective March 2025) strengthens this with Requirement 12.10.2: the plan must be reviewed and tested, including all elements listed in 12.10.1. DR testing is explicitly expected as part of the incident response plan.

**HIPAA:** The Security Rule requires a contingency plan (45 CFR 164.308(a)(7)) with a disaster recovery plan and testing/revision procedures. Testing frequency is not specified but "addressable" -- you must implement it or document why you did not.

**ISO 27001:** Control A.17.1.3 requires that business continuity plans be "verified at regular intervals" to ensure they remain valid and effective.

For all frameworks: maintain a test log with dates, participants, results, and remediation actions. Auditors do not care about your plan -- they care about evidence that you tested the plan and fixed what was broken.

