PostgreSQL Disaster Recovery

PostgreSQL Disaster Recovery#

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode – replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

Streaming Replication for DR#

Synchronous vs Asynchronous – The Core Tradeoff#

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Cloud Managed Database Disaster Recovery

Cloud Managed Database Disaster Recovery#

Every cloud provider offers managed database DR, but the actual behavior during a failure rarely matches the marketing. The documented failover time is the best case. The real failover time includes detection delay, DNS propagation, and connection draining. This guide covers what actually happens.

AWS: RDS and Aurora#

RDS Multi-AZ#

RDS Multi-AZ maintains a synchronous standby in a different availability zone. When the primary fails, RDS flips the DNS CNAME to the standby.

Disaster Recovery Strategy: RPO/RTO-Driven Decision Framework

Disaster Recovery Strategy: RPO/RTO-Driven Decision Framework#

Every DR conversation starts with two numbers: RPO and RTO. Recovery Point Objective is how much data you can afford to lose. Recovery Time Objective is how long the business can survive without the system. These numbers drive everything – architecture, tooling, staffing, and cost.

The mistake most teams make is treating DR as a technical problem. It is a business problem with technical solutions. A payment processing system and an internal wiki do not need the same DR tier, and pretending they do either wastes money or leaves critical systems exposed.

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees#

A DR runbook is used during the worst moments of an engineer’s career: systems are down, customers are impacted, leadership is asking for updates, and decisions carry real consequences. The runbook must be clear enough that someone running on adrenaline and three hours of sleep can execute it correctly.

This means: short sentences, numbered steps, explicit commands (copy-paste ready), no ambiguity about who does what, and timing estimates for each phase so the incident commander knows if things are taking too long.

Active-Passive vs Active-Active: Decision Framework for Multi-Region Architecture

The Core Difference#

Active-passive: one region handles all traffic, a second region stands ready to take over. Failover is an event – something triggers it, traffic shifts, and there is a gap between detection and recovery.

Active-active: both regions handle production traffic simultaneously. There is no failover event for regional traffic – if one region fails, the other is already serving users. The complexity is in keeping data consistent across regions, not in switching traffic.

Global Load Balancing and Geo-Routing: DNS GSLB, Anycast, and Cloud Provider Configurations

DNS-Based Global Server Load Balancing#

Global server load balancing (GSLB) directs users to the nearest or healthiest regional deployment. The most common approach is DNS-based: the authoritative DNS server returns different IP addresses depending on the querying client’s location, the health of backend regions, or configured routing policies.

When a user resolves app.example.com, the GSLB-aware DNS server considers the user’s location (inferred from the resolver’s IP or EDNS Client Subnet), the health of each regional endpoint, and the configured routing policy. It returns the IP address of the best region for that user.

DNS Failover Patterns: TTL Tradeoffs, Health Check Design, and Real-World Failover Timing

DNS Is Not a Load Balancer#

This needs to be said upfront: DNS was designed for name resolution, not traffic management. Using DNS for failover is a pragmatic hack that works well enough for most use cases, but it has fundamental limitations.

DNS responses are cached at multiple levels (recursive resolvers, OS caches, application caches, browser caches). You cannot force a client to re-resolve. You can set a TTL, but clients and resolvers are free to ignore it (and some do). Java applications, for example, cache DNS indefinitely by default in some JVM versions unless you explicitly set networkaddress.cache.ttl.

Database High Availability Patterns

Database High Availability Patterns#

Every database HA decision starts with two numbers: RPO (Recovery Point Objective – how much data you can afford to lose) and RTO (Recovery Time Objective – how long the database can be unavailable). These numbers dictate the pattern, and each pattern carries specific operational tradeoffs.

Core Concepts#

RPO = 0 means zero data loss. Every committed transaction must survive a failure. This requires synchronous replication, which adds latency to every write.