PostgreSQL Disaster Recovery

PostgreSQL Disaster Recovery#

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode – replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

Streaming Replication for DR#

Synchronous vs Asynchronous – The Core Tradeoff#

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Database Cross-Region Replication Patterns

Database Cross-Region Replication Patterns#

Cross-region replication exists because regions fail. AWS us-east-1 has had multiple multi-hour outages. If your database runs in a single region, a regional failure takes your application down entirely. Cross-region replication gives you a copy of the data somewhere else so you can recover.

The fundamental problem is physics. Light through fiber between US East and US West takes about 30ms one way. Every replication strategy is a different answer to the question: do you wait for the remote region to confirm it has the data before telling the client the write succeeded?

Cloud Managed Database Disaster Recovery

Cloud Managed Database Disaster Recovery#

Every cloud provider offers managed database DR, but the actual behavior during a failure rarely matches the marketing. The documented failover time is the best case. The real failover time includes detection delay, DNS propagation, and connection draining. This guide covers what actually happens.

AWS: RDS and Aurora#

RDS Multi-AZ#

RDS Multi-AZ maintains a synchronous standby in a different availability zone. When the primary fails, RDS flips the DNS CNAME to the standby.

Data Consistency in Multi-Region Deployments

Data Consistency in Multi-Region Deployments#

When you replicate data across regions, you are forced to choose between consistency, latency, and availability. You cannot have all three. Every multi-region system makes this tradeoff explicitly or, more dangerously, implicitly by ignoring it until production exposes the consequences.

The Fundamental Tension#

Strong consistency means every read sees the most recent write, regardless of which region it comes from. This requires cross-region coordination on every write (30-100ms per round trip). Eventual consistency means reads might see stale data, but replicas converge given enough time – usually milliseconds to seconds, but during partitions it can be minutes.

CockroachDB Day-2 Operations

Adding and Removing Nodes#

Adding a node: start a new cockroach process with --join pointing to existing nodes. CockroachDB automatically rebalances ranges to the new node.

cockroach start --insecure --store=node4-data \
  --advertise-addr=node4:26257 \
  --join=node1:26257,node2:26257,node3:26257

Watch rebalancing in the DB Console under Metrics > Replication, or query directly:

SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status;

Decommissioning a node moves all range replicas off before shutdown, preventing under-replication:

cockroach node decommission 4 --insecure --host=node1:26257

# Monitor progress
cockroach node status --insecure --host=node1:26257 --decommission

Do not simply kill a node. Without decommissioning, CockroachDB treats it as a failure and waits 5 minutes before re-replicating. On Kubernetes with the operator, scale by changing spec.nodes in the CrdbCluster resource.

CockroachDB Debugging and Troubleshooting

Node Liveness Issues#

Every node must renew its liveness record every 4.5 seconds. Failure to renew marks the node suspect, then dead, triggering re-replication of its ranges.

cockroach node status --insecure --host=localhost:26257

Look at is_live. If a node shows false, check in order:

Process crashed. Check cockroach-data/logs/ for fatal or panic entries. OOM kills are the most common cause – check dmesg | grep -i oom on the host.

Network partition. The node runs but cannot reach peers. If cockroach node status succeeds locally but fails from other nodes, the problem is network-level (firewalls, security groups, DNS).

CockroachDB Setup and Architecture

Architecture: What CockroachDB Actually Does Under the Hood#

CockroachDB is a distributed SQL database that stores data across multiple nodes while presenting a single logical database to clients. Understanding three concepts is essential before deploying it.

Ranges. All data is stored in key-value pairs, sorted by key. CockroachDB splits this sorted keyspace into contiguous chunks called ranges, each targeting 512 MiB by default. Every SQL table, index, and system table maps to one or more ranges. When a range grows beyond the threshold, it splits automatically.

Database High Availability Patterns

Database High Availability Patterns#

Every database HA decision starts with two numbers: RPO (Recovery Point Objective – how much data you can afford to lose) and RTO (Recovery Time Objective – how long the database can be unavailable). These numbers dictate the pattern, and each pattern carries specific operational tradeoffs.

Core Concepts#

RPO = 0 means zero data loss. Every committed transaction must survive a failure. This requires synchronous replication, which adds latency to every write.

Database Performance Investigation Runbook

Database Performance Investigation Runbook#

When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.

Phase 1 – Identify Slow Queries#

The first step is always finding which queries are consuming the most time.

PostgreSQL: pg_stat_statements#

Enable the extension if not already loaded:

Database Testing Strategies

Database Testing Strategies#

Database tests are the tests most teams get wrong. They either skip them entirely (testing with mocks, then discovering schema mismatches in production), or they build a fragile suite sharing a single database where tests interfere with each other. The right approach depends on what you are testing and what tradeoffs you can accept.

Fixtures vs Factories#

Fixtures#

Fixtures are static SQL files loaded before a test suite runs: