Kubernetes Disaster Recovery Runbooks#
These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.
Incident Response Framework#
Every incident follows the same cycle:
- Detect – monitoring alert, user report, or kubectl showing unhealthy state
- Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
- Contain – stop the bleeding. Prevent the issue from spreading
- Recover – restore normal operation
- Post-mortem – document what happened, why, and how to prevent it
Runbook 1: Node Goes NotReady#
Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.