---
title: "DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees"
description: "How to write DR runbooks that work under pressure. Covers step-by-step failover procedures with timing estimates, communication templates, decision trees for failover vs ride-it-out, role assignments, pre-failover checklists, post-failover validation, and failback procedures. Includes a complete runbook template."
url: https://agent-zone.ai/knowledge/infrastructure/dr-runbook-design/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["disaster-recovery","runbook","failover","failback","incident-management","communication","decision-tree","sre"]
skills: ["runbook-writing","failover-procedure-design","incident-communication","decision-framework"]
tools: ["pagerduty","opsgenie","slack","terraform","aws-cli","kubectl","route53"]
levels: ["intermediate","advanced"]
word_count: 1525
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/dr-runbook-design/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/dr-runbook-design/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=DR+Runbook+Design%3A+Failover+Procedures%2C+Communication+Plans%2C+and+Decision+Trees
---


# DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees

A DR runbook is used during the worst moments of an engineer's career: systems are down, customers are impacted, leadership is asking for updates, and decisions carry real consequences. The runbook must be clear enough that someone running on adrenaline and three hours of sleep can execute it correctly.

This means: short sentences, numbered steps, explicit commands (copy-paste ready), no ambiguity about who does what, and timing estimates for each phase so the incident commander knows if things are taking too long.

## Runbook Design Principles

**Write for the worst case.** Assume the person executing the runbook has never done this before. The primary on-call engineer is on vacation and the backup is handling their first real incident.

**Include exact commands.** Not "fail over the database" but the literal command to execute. Pre-populate hostnames, regions, and cluster names. The only variables should be things that genuinely change per incident.

**Add timing estimates.** Every step gets an expected duration. If step 3 says "15 minutes" and it has been 45 minutes, the operator knows something is wrong and should escalate.

**Test the runbook.** Run through it during a DR test. If any step is unclear, rewrite it immediately after the test while the confusion is fresh.

**Version and review.** Runbooks drift as infrastructure changes. Review quarterly. Stamp each runbook with a last-reviewed date. If the date is older than 6 months, the runbook is suspect.

## The Decision Tree: Failover vs Ride It Out

Not every outage requires a DR failover. Failover itself carries risk -- data loss from replication lag, application errors from stale caches, customer confusion from changed endpoints. The decision tree helps the incident commander make the call.

```
Is the primary region completely unavailable?
├── YES → Is the estimated time to recovery < 30 minutes?
│   ├── YES → WAIT. Monitor. Prepare failover but do not execute.
│   └── NO → FAILOVER. Execute DR runbook.
└── NO (degraded but partially available)
    ├── Is the degradation affecting revenue-critical paths?
    │   ├── YES → Is the estimated time to recovery < 15 minutes?
    │   │   ├── YES → WAIT. Mitigate with circuit breakers / feature flags.
    │   │   └── NO → FAILOVER.
    │   └── NO → WAIT. Monitor and mitigate. Do not failover for non-critical degradation.
    └── Can we shed load or isolate the affected component?
        ├── YES → Shed load. Do not failover.
        └── NO → Escalate to incident commander for failover decision.
```

Key factors in the decision:
- **Current replication lag:** If the DR database is 10 minutes behind, failover means losing 10 minutes of data. Is that acceptable?
- **Estimated primary recovery time:** Cloud provider status pages, past incident patterns, and direct communication with support.
- **Time of day and traffic volume:** A 3 AM failover with 5% of peak traffic carries less risk than a noon failover at peak.
- **Blast radius of failover:** Does failover affect only this service or does it cascade?

## Role Assignments

Define these roles before an incident. During an incident is too late.

**Incident Commander (IC):** Makes the failover/no-failover decision. Owns the timeline. Does not execute technical steps -- their job is coordination, decision-making, and communication with leadership.

**Technical Lead:** Executes the runbook. Reports status to the IC. Escalates if steps fail or take longer than expected. Has production access and credentials ready.

**Communications Lead:** Sends status page updates, customer notifications, and internal stakeholder updates using pre-written templates. Does not wait for technical details -- the first update goes out within 5 minutes of the incident being declared.

**Scribe:** Documents everything with timestamps. Who said what, what commands were run, what the results were. This becomes the post-incident timeline.

For smaller teams, the IC and Communications Lead can be the same person. The Technical Lead and Scribe should never be the same person -- the person executing commands should not also be documenting.

## Communication Plan Templates

Pre-write these templates. Fill in the blanks during the incident. Do not compose messages under pressure.

**Internal stakeholder notification (T+5 minutes):**
```
INCIDENT: [Service name] is experiencing [outage/degraded performance] in [region].
IMPACT: [Customer-facing description of what is broken].
STATUS: Investigating. DR failover is being evaluated.
NEXT UPDATE: In 15 minutes or sooner if status changes.
IC: [Name] | Tech Lead: [Name]
Bridge: [Zoom/Slack channel link]
```

**Customer-facing status page (T+10 minutes):**
```
We are currently experiencing issues with [feature/service].
Some users may experience [specific symptom].
Our team is actively working to resolve the issue.
We will provide updates every 15 minutes.
```

**Failover decision notification (T+N minutes):**
```
DECISION: Initiating DR failover to [DR region].
EXPECTED DATA LOSS: Up to [replication lag] minutes of transactions.
EXPECTED RECOVERY: [estimated minutes] minutes from now.
DO NOT make changes to production infrastructure during failover.
```

## Complete DR Runbook Template

```markdown
# DR Runbook: [Service Name] Regional Failover

Last reviewed: 2026-02-15 | Owner: SRE Team | DR Tier: 3

## Scope
This runbook covers complete failover of [service] from us-east-1 (primary)
to us-west-2 (DR) when the primary region is unavailable.

## Pre-Failover Checklist (5 minutes)

- [ ] Incident declared and IC assigned
- [ ] Communications Lead sending first update
- [ ] Verify DR region is healthy:
      aws ecs describe-services --cluster myapp-dr --services myapp-api \
          --region us-west-2 --query 'services[0].runningCount'
      # Expected: >= 2 running tasks
- [ ] Check database replication lag:
      aws rds describe-db-instances --db-instance-identifier myapp-dr-replica \
          --region us-west-2 \
          --query 'DBInstances[0].StatusInfos[?StatusType==`read replication`].Message'
      # Expected: replication lag < 60 seconds
- [ ] Record current replication lag: _____ seconds (this is your data loss)
- [ ] IC confirms: PROCEED WITH FAILOVER? (Y/N)

## Phase 1: Database Failover (10-15 minutes)

Step 1.1: Promote DR read replica to primary (10 min)
    aws rds promote-read-replica \
        --db-instance-identifier myapp-dr-replica \
        --region us-west-2
    # Wait for status to become "available"
    aws rds wait db-instance-available \
        --db-instance-identifier myapp-dr-replica \
        --region us-west-2

Step 1.2: Verify database is writable (1 min)
    psql -h myapp-dr-replica.xxxx.us-west-2.rds.amazonaws.com \
         -U myapp -d myapp -c "CREATE TABLE dr_test (id int); DROP TABLE dr_test;"
    # Expected: both commands succeed

Step 1.3: Update application config to point to new primary (2 min)
    aws ssm put-parameter --name "/myapp/prod/db-host" \
        --value "myapp-dr-replica.xxxx.us-west-2.rds.amazonaws.com" \
        --type String --overwrite --region us-west-2

## Phase 2: Application Failover (5-10 minutes)

Step 2.1: Scale up DR application tier (5 min)
    aws ecs update-service --cluster myapp-dr --service myapp-api \
        --desired-count 10 --region us-west-2
    # Wait for tasks to reach RUNNING
    aws ecs wait services-stable --cluster myapp-dr \
        --services myapp-api --region us-west-2

Step 2.2: Verify application health (2 min)
    curl -s https://dr-internal.myapp.com/health | jq .
    # Expected: {"status": "healthy", "database": "connected"}

## Phase 3: Traffic Cutover (5 minutes)

Step 3.1: Update Route 53 to point to DR region (2 min)
    aws route53 change-resource-record-sets \
        --hosted-zone-id Z1234567890 \
        --change-batch '{
          "Changes": [{
            "Action": "UPSERT",
            "ResourceRecordSet": {
              "Name": "api.myapp.com",
              "Type": "A",
              "AliasTarget": {
                "HostedZoneId": "Z2222222222",
                "DNSName": "myapp-dr-alb.us-west-2.elb.amazonaws.com",
                "EvaluateTargetHealth": true
              }
            }
          }]
        }'

Step 3.2: Wait for DNS propagation (2-5 min depending on TTL)
    watch -n 5 'dig api.myapp.com +short'
    # Expected: resolves to DR region ALB IP

## Post-Failover Validation (10 minutes)

- [ ] Customer-facing endpoints return 200
- [ ] Login flow works end-to-end
- [ ] Order placement succeeds (test with internal test account)
- [ ] Background jobs are processing in DR region
- [ ] Monitoring dashboards show traffic in DR region
- [ ] Error rate < 1% (check Datadog/CloudWatch)
- [ ] Communications Lead sends "service restored" update

## Timing Summary

| Phase | Expected Duration | Running Total |
|---|---|---|
| Pre-flight checklist | 5 min | 5 min |
| Database failover | 10-15 min | 15-20 min |
| Application failover | 5-10 min | 20-30 min |
| Traffic cutover | 5 min | 25-35 min |
| Post-failover validation | 10 min | 35-45 min |
| **Total RTO** | **35-45 min** | |
```

## Failback Procedures

Failback (returning to the primary region after it recovers) is often harder than failover. The primary region's database is now stale. Data written during the DR period exists only in the DR region.

**Step 1: Verify primary region is stable.** Do not failback the moment the region comes back. Wait at least 1 hour after the cloud provider declares the incident resolved. Early failback into a still-unstable region creates a second outage.

**Step 2: Re-establish replication.** Set up replication from the current primary (DR region) back to the original primary. This may require a fresh snapshot depending on how long the outage lasted and whether the original primary's data is salvageable.

**Step 3: Wait for replication to catch up.** Monitor lag until it reaches zero. Do not cut over while lag exists.

**Step 4: Schedule a maintenance window.** Unlike the emergency failover, failback should be planned. Pick a low-traffic window. Notify customers in advance.

**Step 5: Execute the reverse of the failover runbook.** Promote the original primary, update application config, redirect traffic, validate.

**Step 6: Re-establish DR replication.** After failback, the DR region becomes a replica again. Verify replication is healthy before closing the incident.

The failback runbook should be a separate document, tested independently. Many teams test failover but never test failback, and then discover the failback process has its own set of problems.

