Game Day and Tabletop Exercise Planning

Why Run Exercises#

Runbooks that have never been tested are fiction. Failover procedures that have never been executed are hopes. Game days and tabletop exercises convert assumptions about system resilience into verified facts – or reveal that those assumptions were wrong before a real incident does.

The value is not just finding technical gaps. Exercises expose process gaps: unclear escalation paths, missing permissions, outdated contact lists, communication breakdowns between teams. These are invisible until a simulated failure forces people to actually follow the documented procedure.

PostgreSQL Disaster Recovery

PostgreSQL Disaster Recovery#

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode – replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

Streaming Replication for DR#

Synchronous vs Asynchronous – The Core Tradeoff#

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Database Cross-Region Replication Patterns

Database Cross-Region Replication Patterns#

Cross-region replication exists because regions fail. AWS us-east-1 has had multiple multi-hour outages. If your database runs in a single region, a regional failure takes your application down entirely. Cross-region replication gives you a copy of the data somewhere else so you can recover.

The fundamental problem is physics. Light through fiber between US East and US West takes about 30ms one way. Every replication strategy is a different answer to the question: do you wait for the remote region to confirm it has the data before telling the client the write succeeded?

Cloud Managed Database Disaster Recovery

Cloud Managed Database Disaster Recovery#

Every cloud provider offers managed database DR, but the actual behavior during a failure rarely matches the marketing. The documented failover time is the best case. The real failover time includes detection delay, DNS propagation, and connection draining. This guide covers what actually happens.

AWS: RDS and Aurora#

RDS Multi-AZ#

RDS Multi-AZ maintains a synchronous standby in a different availability zone. When the primary fails, RDS flips the DNS CNAME to the standby.

Disaster Recovery Strategy: RPO/RTO-Driven Decision Framework

Disaster Recovery Strategy: RPO/RTO-Driven Decision Framework#

Every DR conversation starts with two numbers: RPO and RTO. Recovery Point Objective is how much data you can afford to lose. Recovery Time Objective is how long the business can survive without the system. These numbers drive everything – architecture, tooling, staffing, and cost.

The mistake most teams make is treating DR as a technical problem. It is a business problem with technical solutions. A payment processing system and an internal wiki do not need the same DR tier, and pretending they do either wastes money or leaves critical systems exposed.

Disaster Recovery Testing: From Tabletop Exercises to Full Regional Failover

Disaster Recovery Testing: From Tabletop Exercises to Full Regional Failover#

An untested DR plan is a hope document. Every organization that has experienced a real disaster and failed to recover had a DR plan on paper. The plan was never tested, so the credentials were expired, the runbook referenced a service that was renamed six months ago, DNS TTLs were longer than assumed, and nobody knew who was supposed to make the failover call.

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees#

A DR runbook is used during the worst moments of an engineer’s career: systems are down, customers are impacted, leadership is asking for updates, and decisions carry real consequences. The runbook must be clear enough that someone running on adrenaline and three hours of sleep can execute it correctly.

This means: short sentences, numbered steps, explicit commands (copy-paste ready), no ambiguity about who does what, and timing estimates for each phase so the incident commander knows if things are taking too long.

Active-Active Architecture Patterns: Multi-Region, Data Replication, and Split-Brain Resolution

What Active-Active Actually Means#

Active-active means both (or all) regions are serving production traffic simultaneously. Not standing by. Not warmed up and waiting. Actually processing real user requests right now. A user in Frankfurt hits the EU region; a user in Virginia hits the US-East region. Both regions are authoritative. Both can read and write.

This is fundamentally different from active-passive, where the secondary region exists but does not serve traffic until failover. The distinction matters because active-active introduces a class of problems that active-passive avoids entirely – primarily, what happens when two regions modify the same data at the same time.

Active-Passive vs Active-Active: Decision Framework for Multi-Region Architecture

The Core Difference#

Active-passive: one region handles all traffic, a second region stands ready to take over. Failover is an event – something triggers it, traffic shifts, and there is a gap between detection and recovery.

Active-active: both regions handle production traffic simultaneously. There is no failover event for regional traffic – if one region fails, the other is already serving users. The complexity is in keeping data consistent across regions, not in switching traffic.

DNS Failover Patterns: TTL Tradeoffs, Health Check Design, and Real-World Failover Timing

DNS Is Not a Load Balancer#

This needs to be said upfront: DNS was designed for name resolution, not traffic management. Using DNS for failover is a pragmatic hack that works well enough for most use cases, but it has fundamental limitations.

DNS responses are cached at multiple levels (recursive resolvers, OS caches, application caches, browser caches). You cannot force a client to re-resolve. You can set a TTL, but clients and resolvers are free to ignore it (and some do). Java applications, for example, cache DNS indefinitely by default in some JVM versions unless you explicitly set networkaddress.cache.ttl.