DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees

DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees#

A DR runbook is used during the worst moments of an engineer’s career: systems are down, customers are impacted, leadership is asking for updates, and decisions carry real consequences. The runbook must be clear enough that someone running on adrenaline and three hours of sleep can execute it correctly.

This means: short sentences, numbered steps, explicit commands (copy-paste ready), no ambiguity about who does what, and timing estimates for each phase so the incident commander knows if things are taking too long.

Database Performance Investigation Runbook

Database Performance Investigation Runbook#

When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.

Phase 1 – Identify Slow Queries#

The first step is always finding which queries are consuming the most time.

PostgreSQL: pg_stat_statements#

Enable the extension if not already loaded:

Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures

Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.