{"page":{"agent_metadata":{"content_type":"guide","outputs":["dr-replication-setup","wal-archiving-pipeline","pitr-recovery-procedure","failover-runbook"],"prerequisites":["postgresql-basics","postgresql-configuration","streaming-replication-concepts","object-storage-basics"]},"categories":["databases"],"content_plain":"PostgreSQL Disaster Recovery# A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode \u0026ndash; replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.\nStreaming Replication for DR# Synchronous vs Asynchronous \u0026ndash; The Core Tradeoff# Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.\nSynchronous replication makes the primary wait until the standby confirms it has written the WAL to disk. Zero data loss, but every write now includes the round-trip latency to the standby.\nSame-region sync replication adds 1-3ms per commit. Cross-region sync replication adds 30-80ms per commit. For a workload doing 500 commits/second, cross-region sync replication is usually not viable.\n# postgresql.conf on primary synchronous_standby_names = \u0026#39;FIRST 1 (standby1, standby2)\u0026#39; synchronous_commit = onA practical middle ground: use synchronous_commit = remote_write instead of on. The primary waits for the standby to receive the WAL and write it to the OS cache, but not to flush it to disk. This protects against primary failure but not against simultaneous primary and standby power loss.\nInitial Seeding with pg_basebackup# The standby needs a full copy of the primary\u0026rsquo;s data directory before it can start receiving WAL.\npg_basebackup -h primary.db.internal -U replicator \\ -D /var/lib/postgresql/16/main \\ --checkpoint=fast --wal-method=stream -R -PThe -R flag writes standby.signal and connection info into postgresql.auto.conf so the standby connects to the primary on startup. For a 500 GB database, expect this to take 30-60 minutes over a 1 Gbps link.\nWAL Archiving to Object Storage# Streaming replication is only as durable as your replicas. If a bug corrupts data, that corruption replicates. WAL archiving to S3 or GCS gives you an independent timeline you can recover to any point within your retention window.\nConfigure Archive Command# # postgresql.conf archive_mode = on archive_command = \u0026#39;pgbackrest --stanza=main archive-push %p\u0026#39; archive_timeout = 60Using pgBackRest\u0026rsquo;s archive-push instead of a raw aws s3 cp gives you parallel archiving, compression, checksumming, and de-duplication.\npgBackRest Configuration# # /etc/pgbackrest/pgbackrest.conf [main] pg1-path=/var/lib/postgresql/16/main pg1-port=5432 [global] repo1-type=s3 repo1-s3-bucket=mycompany-pg-backups repo1-s3-region=us-east-1 repo1-s3-endpoint=s3.amazonaws.com repo1-path=/pgbackrest repo1-retention-full=4 repo1-retention-diff=14 repo1-cipher-type=aes-256-cbc repo1-cipher-pass=\u0026lt;encryption-passphrase\u0026gt; compress-type=zst compress-level=3 [global:archive-push] compress-level=3Backup Schedule# # Full backup weekly (Sunday 2 AM) 0 2 * * 0 pgbackrest --stanza=main --type=full backup # Differential backup daily (2 AM other days) 0 2 * * 1-6 pgbackrest --stanza=main --type=diff backupA 500 GB database typically produces a 100-120 GB compressed full backup. Differentials are usually 5-15% of full size depending on churn.\nPoint-in-Time Recovery (PITR)# This is the scenario: someone ran DELETE FROM orders WHERE status = 'pending' without a WHERE clause at 14:32:17 UTC on Tuesday. You need to recover the database to 14:32:16 UTC.\nStep 1 \u0026ndash; Identify the Target Time# -- Check the current WAL position before starting SELECT pg_current_wal_lsn(), now();Talk to the team. Confirm the exact timestamp. Get it wrong and you either lose more data or include the bad transaction.\nStep 2 \u0026ndash; Restore with pgBackRest# # Stop PostgreSQL sudo systemctl stop postgresql-16 # Restore to a new data directory (safer than overwriting) pgbackrest --stanza=main --type=time \\ --target=\u0026#34;2026-02-18 14:32:16+00\u0026#34; \\ --target-action=promote \\ --set=20260216-020005F \\ --db-path=/var/lib/postgresql/16/recovery \\ restoreThe --set flag picks which full backup to restore from. pgBackRest then replays WAL segments from the archive up to the target time.\nStep 3 \u0026ndash; Verify Before Promoting# Start PostgreSQL against the recovery directory in single-user mode or on a different port. Verify the data looks correct:\npg_ctl -D /var/lib/postgresql/16/recovery start -o \u0026#34;-p 5433\u0026#34; psql -p 5433 -d myapp -c \u0026#34;SELECT count(*) FROM orders WHERE status = \u0026#39;pending\u0026#39;;\u0026#34;Once verified, promote it or swap data directories.\nRecovery Timing# For a 500 GB database with 48 hours of WAL: restoring the base backup takes 20-40 minutes from S3, replaying WAL takes 10-30 minutes depending on write volume. Total recovery: 30-70 minutes. This is your actual RTO for PITR scenarios.\nCross-Region DR with Patroni# Patroni manages automated failover within a region. For cross-region DR, run a Patroni standby cluster in the DR region that replicates from the primary cluster.\n# patroni.yml for DR cluster bootstrap: dcs: standby_cluster: host: primary-pgbouncer.us-east-1.internal port: 5432 create_replica_methods: - basebackup pg_hba: - host replication replicator 10.0.0.0/8 scram-sha-256 - host all all 10.0.0.0/8 scram-sha-256 postgresql: parameters: hot_standby: \u0026#39;on\u0026#39; wal_level: replicaThe DR cluster runs as a cascading standby. It cannot accept writes. To failover to the DR region:\n# Remove the standby_cluster configuration from DCS patronictl -c /etc/patroni/patroni.yml edit-config # Delete the standby_cluster block, save # The DR cluster will automatically promote a member to primaryThis gives you a cross-region RTO of 2-5 minutes (detect failure + operator decision + Patroni promotion). The RPO depends on replication lag, typically 1-10 seconds for async cross-region.\nMonitoring Replication Lag# -- On the primary: check each standby\u0026#39;s lag SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes, replay_lag FROM pg_stat_replication;Alert if replay lag exceeds your RPO. If your RPO is 30 seconds, alert at 15 seconds so you have time to investigate before it matters.\nTesting Failover# Untested failover is not failover. Run this quarterly.\n# On the primary: simulate a crash sudo systemctl stop postgresql-16 # On the standby: promote pg_ctl promote -D /var/lib/postgresql/16/main # Verify the standby is accepting writes psql -c \u0026#34;CREATE TABLE failover_test (id serial); DROP TABLE failover_test;\u0026#34;Time every step. Document the actual RTO. The first time, something will go wrong \u0026ndash; misconfigured pg_hba.conf, stale DNS, an application connection string that does not support failover. Finding these during a drill is the entire point.\nRecovery Scenarios Summary# Scenario Method Typical RTO RPO Primary crash, same region Patroni auto-failover 10-30 seconds 0 (sync) or seconds (async) Primary crash, cross-region Patroni standby cluster promote 2-5 minutes 1-10 seconds (async) Data corruption (bad DELETE) PITR from pgBackRest 30-70 minutes To the second Complete region loss pgBackRest restore from S3 30-90 minutes Minutes (last archived WAL) Backup verification pgBackRest restore \u0026ndash;type=immediate 20-40 minutes N/A ","date":"2026-02-22","description":"Complete PostgreSQL DR strategy — streaming replication for failover, WAL archiving to object storage, point-in-time recovery with concrete commands, pgBackRest for backup management, cross-region replication with Patroni, and failover testing procedures.","lastmod":"2026-02-22","levels":["intermediate","advanced"],"reading_time_minutes":5,"section":"knowledge","skills":["disaster-recovery-planning","postgres-replication","backup-management","failover-execution"],"tags":["postgresql","disaster-recovery","streaming-replication","wal-archiving","pitr","pgbackrest","patroni","failover","s3","backup"],"title":"PostgreSQL Disaster Recovery","tools":["postgresql","pg_basebackup","pgbackrest","patroni","pg_ctl","psql","aws-cli"],"url":"https://agent-zone.ai/knowledge/databases/postgres-disaster-recovery/","word_count":1011}}