---
title: "Cloud Managed Database Disaster Recovery"
description: "Disaster recovery options for cloud managed databases — RDS Multi-AZ, Aurora Global Database, Cloud SQL HA and cross-region replicas, Azure SQL geo-replication, Cosmos DB multi-region writes, DynamoDB Global Tables — with real failover timings, cost comparisons, and automation decisions."
url: https://agent-zone.ai/knowledge/databases/cloud-managed-database-dr/
section: knowledge
date: 2026-02-22
categories: ["databases"]
tags: ["disaster-recovery","rds","aurora","cloud-sql","azure-sql","cosmos-db","dynamodb","multi-az","cross-region","failover","aws","gcp","azure"]
skills: ["cloud-database-architecture","disaster-recovery-planning","cost-optimization","failover-management"]
tools: ["aws-cli","gcloud","az-cli","terraform"]
levels: ["intermediate","advanced"]
word_count: 984
formats:
  json: https://agent-zone.ai/knowledge/databases/cloud-managed-database-dr/index.json
  html: https://agent-zone.ai/knowledge/databases/cloud-managed-database-dr/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Cloud+Managed+Database+Disaster+Recovery
---


# Cloud Managed Database Disaster Recovery

Every cloud provider offers managed database DR, but the actual behavior during a failure rarely matches the marketing. The documented failover time is the best case. The real failover time includes detection delay, DNS propagation, and connection draining. This guide covers what actually happens.

## AWS: RDS and Aurora

### RDS Multi-AZ

RDS Multi-AZ maintains a synchronous standby in a different availability zone. When the primary fails, RDS flips the DNS CNAME to the standby.

**Documented failover time:** 60-120 seconds. **Actual failover time:** 60-180 seconds. The variance comes from DNS caching (the 5-second TTL may be ignored by connection pools), failure detection delay (5-30 seconds), and crash recovery on the standby.

Multi-AZ does not protect against region failure. Both AZs are in the same region. **Cost:** 2x the instance cost. The standby cannot serve reads.

### RDS Cross-Region Read Replicas

For cross-region DR, create a read replica in another region. This uses asynchronous replication, so there is a data loss window. Failover is manual -- you must promote the replica yourself:

```bash
aws rds promote-read-replica --db-instance-identifier myapp-dr-west --region us-west-2
```

Promotion takes 5-15 minutes. Your application needs a new connection string pointing to the promoted instance's endpoint. Total real RTO: 10-25 minutes including human decision time.

**Cost:** Full instance cost for the replica plus cross-region data transfer ($0.02/GB). A busy 500 GB database generates roughly $2,600/month in transfer costs alone.

### Aurora Global Database

Aurora Global Database replicates an entire Aurora cluster to up to five secondary regions using dedicated replication infrastructure outside of the database engine.

**Documented replication lag:** Under 1 second typically.
**Actual replication lag:** 100-500ms under normal load. Can spike to 5-10 seconds during heavy write bursts or during Aurora storage scaling events.

**Managed failover** (planned): Aurora supports managed planned failover where it promotes a secondary region and demotes the old primary. This takes 1-3 minutes and involves a brief global write outage.

**Unplanned failover** (detach and promote): If the primary region is unreachable, you detach the secondary cluster and promote it. This takes 1-2 minutes for the promotion, but the decision to trigger it is on you.

```bash
# Detach and promote secondary region
aws rds failover-global-cluster \
  --global-cluster-identifier myapp-global \
  --target-db-cluster-identifier arn:aws:rds:us-west-2:123456789:cluster:myapp-west
```

**Cost:** Full Aurora cluster cost in each region. Aurora storage replication is included in the service. A db.r6g.2xlarge Aurora cluster costs roughly $1,400/month per region. Two regions = $2,800/month minimum for compute alone.

## GCP: Cloud SQL

### Cloud SQL HA

Cloud SQL HA uses a regional instance with a standby in a different zone within the same region. Failover is automatic.

**Documented failover time:** Under 60 seconds for most instance sizes.
**Actual failover time:** 30-120 seconds. Smaller instances fail over faster. The failover includes an IP address reassignment (not DNS), which eliminates the DNS propagation problem that plagues RDS.

### Cloud SQL Cross-Region Replicas

Promotion is manual:

```bash
gcloud sql instances promote-replica myapp-dr-west --project=my-project
```

**Actual promotion time:** 5-10 minutes. After promotion, the old primary and the new primary are completely independent -- no automatic reconfiguration.

**Cost:** Full instance cost in the DR region plus cross-region egress at $0.08-0.12/GB.

## Azure: SQL Database and Cosmos DB

### Azure SQL Geo-Replication

Azure SQL Database supports active geo-replication to up to four secondary regions. Each secondary is readable. **Failover groups** add a listener abstraction -- a single read-write endpoint and a read-only endpoint that automatically update DNS on failover.

```bash
az sql failover-group create \
  --name myapp-fg \
  --server myapp-primary-eastus \
  --resource-group myapp-rg \
  --partner-server myapp-dr-westus \
  --partner-resource-group myapp-dr-rg \
  --failover-policy Automatic \
  --grace-period 60
```

The grace period (in minutes) prevents flapping on transient failures. The default is 60 minutes; set it to 5 for critical workloads.

**Actual failover time with failover groups:** 30-60 seconds for the database promotion plus the grace period, plus 30-60 seconds for DNS propagation.

### Cosmos DB Multi-Region Writes

Cosmos DB supports multi-region writes where every region accepts writes simultaneously. Conflicts use last-write-wins by default or a custom stored procedure.

**Failover time:** Near zero -- all regions already accept writes. If a region becomes unreachable, clients redirect via SDK retry logic (10-30 seconds).

**Cost:** Multi-region writes roughly double your RU cost. A 10,000 RU/s container in two regions costs approximately $1,170/month.

## AWS: DynamoDB Global Tables

DynamoDB Global Tables replicate tables across regions with multi-region writes. Conflict resolution is last-write-wins.

```bash
aws dynamodb update-table --table-name Orders \
  --replica-updates '[{"Create": {"RegionName": "us-west-2"}}]'
```

**Replication lag:** Typically under 1 second. DynamoDB publishes a `ReplicationLatency` CloudWatch metric per region pair.

**Failover:** There is no "failover" because all regions accept writes. If us-east-1 fails, your application in us-west-2 keeps working. You need to route traffic to the healthy region, but the database itself does not need any promotion.

**Cost:** Replicated write capacity is charged at 1.625x the standard rate. A table doing 1,000 WCU costs $467/month in one region and $759/month replicated to a second region.

## Cost Comparison Summary

| Service | Single-Region HA | Cross-Region DR | Monthly Cost Premium |
|---|---|---|---|
| RDS Multi-AZ | 2x instance | + replica + transfer | 2x-2.5x base |
| Aurora Global DB | Included | + full cluster per region | 2x-3x base |
| Cloud SQL HA | ~2x instance | + replica + egress | 2x-2.5x base |
| Azure SQL + FG | Included in tier | + secondary DTUs | 1.5x-2x base |
| Cosmos DB multi-write | N/A (serverless) | + RUs per region | 2x RU cost |
| DynamoDB Global Tables | N/A (serverless) | 1.625x WCU | 1.6x write cost |

## Automated vs Manual Failover Decisions

Automated failover sounds better, but it introduces the risk of split-brain: both regions think they are primary. Every managed service handles this differently, and not all of them handle it safely.

**Automate failover when:** You have a single-writer architecture, the service guarantees fencing of the old primary (Aurora Global, Azure SQL Failover Groups), and your RPO tolerance exceeds the typical replication lag.

**Keep failover manual when:** You have application-level state to coordinate (cache invalidation, queue draining), unpredictable replication lag, or the cost of a false positive exceeds a few extra minutes of downtime. Most teams start manual and automate only after doing it manually at least three times.

