---
title: "PostgreSQL Disaster Recovery"
description: "Complete PostgreSQL DR strategy — streaming replication for failover, WAL archiving to object storage, point-in-time recovery with concrete commands, pgBackRest for backup management, cross-region replication with Patroni, and failover testing procedures."
url: https://agent-zone.ai/knowledge/databases/postgres-disaster-recovery/
section: knowledge
date: 2026-02-22
categories: ["databases"]
tags: ["postgresql","disaster-recovery","streaming-replication","wal-archiving","pitr","pgbackrest","patroni","failover","s3","backup"]
skills: ["disaster-recovery-planning","postgres-replication","backup-management","failover-execution"]
tools: ["postgresql","pg_basebackup","pgbackrest","patroni","pg_ctl","psql","aws-cli"]
levels: ["intermediate","advanced"]
word_count: 1011
formats:
  json: https://agent-zone.ai/knowledge/databases/postgres-disaster-recovery/index.json
  html: https://agent-zone.ai/knowledge/databases/postgres-disaster-recovery/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=PostgreSQL+Disaster+Recovery
---


# PostgreSQL Disaster Recovery

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode -- replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

## Streaming Replication for DR

### Synchronous vs Asynchronous -- The Core Tradeoff

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Synchronous replication makes the primary wait until the standby confirms it has written the WAL to disk. Zero data loss, but every write now includes the round-trip latency to the standby.

Same-region sync replication adds 1-3ms per commit. Cross-region sync replication adds 30-80ms per commit. For a workload doing 500 commits/second, cross-region sync replication is usually not viable.

```
# postgresql.conf on primary
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on
```

A practical middle ground: use `synchronous_commit = remote_write` instead of `on`. The primary waits for the standby to receive the WAL and write it to the OS cache, but not to flush it to disk. This protects against primary failure but not against simultaneous primary and standby power loss.

### Initial Seeding with pg_basebackup

The standby needs a full copy of the primary's data directory before it can start receiving WAL.

```bash
pg_basebackup -h primary.db.internal -U replicator \
  -D /var/lib/postgresql/16/main \
  --checkpoint=fast --wal-method=stream -R -P
```

The `-R` flag writes `standby.signal` and connection info into `postgresql.auto.conf` so the standby connects to the primary on startup. For a 500 GB database, expect this to take 30-60 minutes over a 1 Gbps link.

## WAL Archiving to Object Storage

Streaming replication is only as durable as your replicas. If a bug corrupts data, that corruption replicates. WAL archiving to S3 or GCS gives you an independent timeline you can recover to any point within your retention window.

### Configure Archive Command

```
# postgresql.conf
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60
```

Using pgBackRest's archive-push instead of a raw `aws s3 cp` gives you parallel archiving, compression, checksumming, and de-duplication.

### pgBackRest Configuration

```ini
# /etc/pgbackrest/pgbackrest.conf
[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

[global]
repo1-type=s3
repo1-s3-bucket=mycompany-pg-backups
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<encryption-passphrase>
compress-type=zst
compress-level=3

[global:archive-push]
compress-level=3
```

### Backup Schedule

```bash
# Full backup weekly (Sunday 2 AM)
0 2 * * 0  pgbackrest --stanza=main --type=full backup
# Differential backup daily (2 AM other days)
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup
```

A 500 GB database typically produces a 100-120 GB compressed full backup. Differentials are usually 5-15% of full size depending on churn.

## Point-in-Time Recovery (PITR)

This is the scenario: someone ran `DELETE FROM orders WHERE status = 'pending'` without a WHERE clause at 14:32:17 UTC on Tuesday. You need to recover the database to 14:32:16 UTC.

### Step 1 -- Identify the Target Time

```sql
-- Check the current WAL position before starting
SELECT pg_current_wal_lsn(), now();
```

Talk to the team. Confirm the exact timestamp. Get it wrong and you either lose more data or include the bad transaction.

### Step 2 -- Restore with pgBackRest

```bash
# Stop PostgreSQL
sudo systemctl stop postgresql-16

# Restore to a new data directory (safer than overwriting)
pgbackrest --stanza=main --type=time \
  --target="2026-02-18 14:32:16+00" \
  --target-action=promote \
  --set=20260216-020005F \
  --db-path=/var/lib/postgresql/16/recovery \
  restore
```

The `--set` flag picks which full backup to restore from. pgBackRest then replays WAL segments from the archive up to the target time.

### Step 3 -- Verify Before Promoting

Start PostgreSQL against the recovery directory in single-user mode or on a different port. Verify the data looks correct:

```bash
pg_ctl -D /var/lib/postgresql/16/recovery start -o "-p 5433"
psql -p 5433 -d myapp -c "SELECT count(*) FROM orders WHERE status = 'pending';"
```

Once verified, promote it or swap data directories.

### Recovery Timing

For a 500 GB database with 48 hours of WAL: restoring the base backup takes 20-40 minutes from S3, replaying WAL takes 10-30 minutes depending on write volume. Total recovery: 30-70 minutes. This is your actual RTO for PITR scenarios.

## Cross-Region DR with Patroni

Patroni manages automated failover within a region. For cross-region DR, run a Patroni standby cluster in the DR region that replicates from the primary cluster.

```yaml
# patroni.yml for DR cluster
bootstrap:
  dcs:
    standby_cluster:
      host: primary-pgbouncer.us-east-1.internal
      port: 5432
      create_replica_methods:
        - basebackup
  pg_hba:
    - host replication replicator 10.0.0.0/8 scram-sha-256
    - host all all 10.0.0.0/8 scram-sha-256

postgresql:
  parameters:
    hot_standby: 'on'
    wal_level: replica
```

The DR cluster runs as a cascading standby. It cannot accept writes. To failover to the DR region:

```bash
# Remove the standby_cluster configuration from DCS
patronictl -c /etc/patroni/patroni.yml edit-config
# Delete the standby_cluster block, save

# The DR cluster will automatically promote a member to primary
```

This gives you a cross-region RTO of 2-5 minutes (detect failure + operator decision + Patroni promotion). The RPO depends on replication lag, typically 1-10 seconds for async cross-region.

## Monitoring Replication Lag

```sql
-- On the primary: check each standby's lag
SELECT client_addr, state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
  replay_lag
FROM pg_stat_replication;
```

Alert if replay lag exceeds your RPO. If your RPO is 30 seconds, alert at 15 seconds so you have time to investigate before it matters.

## Testing Failover

Untested failover is not failover. Run this quarterly.

```bash
# On the primary: simulate a crash
sudo systemctl stop postgresql-16

# On the standby: promote
pg_ctl promote -D /var/lib/postgresql/16/main

# Verify the standby is accepting writes
psql -c "CREATE TABLE failover_test (id serial); DROP TABLE failover_test;"
```

Time every step. Document the actual RTO. The first time, something will go wrong -- misconfigured pg_hba.conf, stale DNS, an application connection string that does not support failover. Finding these during a drill is the entire point.

### Recovery Scenarios Summary

| Scenario | Method | Typical RTO | RPO |
|---|---|---|---|
| Primary crash, same region | Patroni auto-failover | 10-30 seconds | 0 (sync) or seconds (async) |
| Primary crash, cross-region | Patroni standby cluster promote | 2-5 minutes | 1-10 seconds (async) |
| Data corruption (bad DELETE) | PITR from pgBackRest | 30-70 minutes | To the second |
| Complete region loss | pgBackRest restore from S3 | 30-90 minutes | Minutes (last archived WAL) |
| Backup verification | pgBackRest restore --type=immediate | 20-40 minutes | N/A |

