---
title: "Kubernetes Disaster Recovery: Runbooks for Common Incidents"
description: "Step-by-step runbooks for node failures, etcd quorum loss, control plane outages, certificate expiry, PVC data loss, bad deployments, and stuck namespaces."
url: https://agent-zone.ai/knowledge/kubernetes/disaster-recovery-runbooks/
section: knowledge
date: 2026-02-21
categories: ["kubernetes"]
tags: ["disaster-recovery","runbooks","incident-response","etcd","certificates","rollback","velero"]
skills: ["incident-response","etcd-recovery","certificate-renewal","deployment-rollback","backup-restore"]
tools: ["kubectl","etcdctl","kubeadm","velero","openssl"]
levels: ["intermediate"]
word_count: 1486
formats:
  json: https://agent-zone.ai/knowledge/kubernetes/disaster-recovery-runbooks/index.json
  html: https://agent-zone.ai/knowledge/kubernetes/disaster-recovery-runbooks/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Kubernetes+Disaster+Recovery%3A+Runbooks+for+Common+Incidents
---


# Kubernetes Disaster Recovery Runbooks

These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.

## Incident Response Framework

Every incident follows the same cycle:

1. **Detect** -- monitoring alert, user report, or kubectl showing unhealthy state
2. **Assess** -- determine scope and severity. Is it one pod, one node, or the entire cluster?
3. **Contain** -- stop the bleeding. Prevent the issue from spreading
4. **Recover** -- restore normal operation
5. **Post-mortem** -- document what happened, why, and how to prevent it

## Runbook 1: Node Goes NotReady

**Detection:** Node condition changes to `Ready=False`. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.

**Diagnosis:**

```bash
# Check node status and conditions
kubectl describe node <node-name>
```

Look at the Conditions section:

| Condition | Meaning |
|-----------|---------|
| `Ready=False` | Kubelet is not healthy or not communicating |
| `MemoryPressure=True` | Node is running out of memory |
| `DiskPressure=True` | Node is running out of disk |
| `PIDPressure=True` | Too many processes on the node |
| `NetworkUnavailable=True` | Network plugin not configured |

**If you have SSH access to the node:**

```bash
# Check kubelet
systemctl status kubelet
journalctl -u kubelet --since "15 minutes ago" --no-pager

# Check if the node can reach the API server
curl -k https://<api-server>:6443/healthz

# Check disk space
df -h

# Check memory
free -m
```

**Recovery:**

```bash
# Restart kubelet if it is stuck
systemctl restart kubelet

# If the node is unrecoverable, drain it first (if it came back briefly)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Then remove it from the cluster
kubectl delete node <node-name>
```

On cloud providers with auto-scaling groups (ASG on AWS, VMSS on Azure), the unhealthy node will be replaced automatically. Verify by watching for a new node to join:

```bash
kubectl get nodes -w
```

**Prevention:** Set up node health monitoring. Use the node-problem-detector daemonset to surface kernel issues, container runtime problems, and hardware failures as node conditions.

## Runbook 2: etcd Cluster Degraded or Quorum Lost

**Detection:** API server returns errors (`connection refused`, `etcdserver: leader changed`). `etcdctl endpoint health` shows unhealthy members.

**Diagnosis:**

```bash
# Check etcd health (run on a control plane node)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --cluster

# Check member list
etcdctl member list --write-out=table
```

**Single member down (quorum maintained):** A 3-member cluster survives 1 failure. The cluster continues to operate normally. Fix or replace the failed member:

```bash
# Remove the failed member
etcdctl member remove <member-id>

# Add a new member
etcdctl member add <new-name> --peer-urls=https://<new-ip>:2380
```

**Quorum lost (2 of 3 down):** The cluster is read-only. No writes, no new pods, no updates. This is a critical incident.

**Recovery from snapshot:**

```bash
# On a healthy etcd node, restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.1.10:2380 \
  --initial-cluster-token=etcd-cluster-restored \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --data-dir=/var/lib/etcd-restored

# Stop etcd, replace data directory, restart
systemctl stop etcd
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcd
```

**Prevention:** Take regular etcd snapshots. For production, run 5 members (survives 2 failures). Automate snapshots:

```bash
# Cron job for etcd backup
0 */6 * * * ETCDCTL_API=3 etcdctl snapshot save \
  /backup/etcd-$(date +\%Y\%m\%d-\%H\%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
```

## Runbook 3: Control Plane Down (Managed Kubernetes)

**Detection:** `kubectl` commands timeout. New pods are not scheduled. But existing running pods continue to serve traffic.

This is an important detail: Kubernetes is designed so that the data plane survives control plane outages. Pods keep running, services keep routing, containers keep serving. You just cannot make changes.

**Diagnosis:**

```bash
# Check if the API server is reachable
kubectl cluster-info
# If this times out, the control plane is down

# Check cloud provider status
# AWS: https://health.aws.amazon.com
# Azure: https://status.azure.com
# GCP: https://status.cloud.google.com
```

**Recovery:** For managed Kubernetes (EKS, AKS, GKE), the cloud provider is responsible for control plane availability. Open a support ticket with high severity. There is little you can do except wait.

**What to communicate to stakeholders:** "Existing services continue to run normally. We cannot deploy new changes or scale workloads until the control plane recovers. Running applications are not affected."

**Prevention:** For critical workloads, run multi-cluster with failover. Do not put all workloads in a single cluster with a single control plane.

## Runbook 4: Certificate Expiry

**Detection:** Kubelet stops communicating with the API server. `kubectl` returns `x509: certificate has expired or is not yet valid`. Nodes go NotReady.

**Diagnosis:**

```bash
# Check certificate expiry dates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# For kubeadm clusters, check all certificates at once
kubeadm certs check-expiration
```

Output shows each certificate and its expiry date:

```
CERTIFICATE                EXPIRES                  RESIDUAL TIME
admin.conf                 Feb 21, 2027 00:00 UTC   364d
apiserver                  Feb 21, 2027 00:00 UTC   364d
apiserver-etcd-client      Feb 21, 2027 00:00 UTC   364d
apiserver-kubelet-client   Feb 21, 2027 00:00 UTC   364d
```

**Recovery:**

```bash
# Renew all certificates (kubeadm clusters)
kubeadm certs renew all

# Restart control plane components to pick up new certs
systemctl restart kubelet

# If using static pods for control plane, move manifests out and back
mv /etc/kubernetes/manifests/*.yaml /tmp/
# Wait 10 seconds for pods to stop
mv /tmp/*.yaml /etc/kubernetes/manifests/
```

After renewal, update kubeconfig files on any machine that uses them:

```bash
# Regenerate admin.conf
kubeadm kubeconfig user --client-name=admin --org=system:masters > /etc/kubernetes/admin.conf

# Copy to your user's kubeconfig
cp /etc/kubernetes/admin.conf ~/.kube/config
```

**Prevention:** Set up monitoring that alerts 30 days before expiry. cert-manager can automate certificate rotation. Prometheus can scrape certificate expiry metrics:

```yaml
# Prometheus alerting rule
- alert: KubernetesCertificateExpiringSoon
  expr: apiserver_client_certificate_expiration_seconds_count > 0
    and histogram_quantile(0.01, rate(apiserver_client_certificate_expiration_seconds_bucket[5m])) < 2592000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Kubernetes certificate expiring in less than 30 days"
```

## Runbook 5: PVC Data Loss

**Detection:** Application reports data missing. PVC status shows `Lost` or the PV has been deleted.

**Assessment:** Check the reclaim policy of the storage class:

```bash
kubectl get storageclass -o custom-columns=NAME:.metadata.name,RECLAIM:.reclaimPolicy
```

If `reclaimPolicy` is `Delete`, deleting a PVC also deletes the PV and the underlying storage volume. The data is gone unless you have backups or storage-level snapshots.

**Recovery:**

```bash
# Option 1: Restore from Velero backup
velero restore create --from-backup <backup-name> \
  --include-resources persistentvolumeclaims,persistentvolumes \
  --include-namespaces <namespace>

# Option 2: Restore from cloud provider snapshot
# AWS example: create a new EBS volume from snapshot, then create a PV pointing to it
```

```yaml
# Manual PV creation from existing volume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: restored-pv
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  awsElasticBlockStore:
    volumeID: vol-0abc123def456789
    fsType: ext4
```

**Prevention:**

```bash
# Always use Retain for important data
kubectl patch storageclass <name> -p '{"reclaimPolicy":"Retain"}'

# Schedule Velero backups
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --ttl 720h
```

Use VolumeSnapshots for point-in-time recovery:

```yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snapshot
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: postgres-data
```

## Runbook 6: Deployment Rollback

**Detection:** New deployment is causing errors, increased latency, or crashes.

**Immediate action:**

```bash
# Undo the last rollout
kubectl rollout undo deployment/<name> -n <namespace>

# Verify the rollback is progressing
kubectl rollout status deployment/<name> -n <namespace>
```

**If you need a specific revision:**

```bash
# Check revision history
kubectl rollout history deployment/<name> -n <namespace>

# Rollback to a specific revision
kubectl rollout undo deployment/<name> -n <namespace> --to-revision=3

# Inspect what a specific revision contained
kubectl rollout history deployment/<name> -n <namespace> --revision=3
```

For ArgoCD-managed deployments, rollback means reverting the git commit:

```bash
git revert <bad-commit>
git push
# ArgoCD will sync automatically (or trigger manual sync)
```

**Prevention:** Use progressive delivery (canary or blue-green) so bad deployments affect only a fraction of traffic before full rollout. Set `maxUnavailable: 0` in the deployment strategy so the old pods are not removed until new pods pass readiness checks.

## Runbook 7: Namespace Stuck in Terminating

**Detection:** `kubectl delete namespace <ns>` hangs. `kubectl get namespace <ns>` shows `Terminating` status indefinitely.

**Diagnosis:** Something in the namespace has a finalizer that cannot be processed.

```bash
# Find all remaining resources
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n1 -I{} sh -c 'echo "--- {}:" && kubectl get {} -n <ns> --no-headers 2>/dev/null'
```

**Recovery:** For each stuck resource, remove its finalizers:

```bash
# Find resources with finalizers
kubectl get <resource-type> <name> -n <ns> -o jsonpath='{.metadata.finalizers}'

# Patch to remove finalizers
kubectl patch <resource-type> <name> -n <ns> \
  -p '{"metadata":{"finalizers":null}}' --type=merge
```

If no individual resources are visible but the namespace is still stuck, the namespace itself may have a finalizer blocking it. Patch it directly via the API:

```bash
kubectl get namespace <ns> -o json | \
  jq '.spec.finalizers = []' | \
  kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -
```

**Prevention:** Before deleting a namespace, ensure all CRD controllers that manage resources in that namespace are still running. The most common cause is deleting a CRD operator before deleting the namespace it managed.

## Post-Incident

After every incident, document:

1. **Timeline** -- when was it detected, when was it resolved, total impact duration
2. **Root cause** -- what actually broke and why
3. **Impact** -- which services were affected, how many users impacted
4. **Resolution** -- exact steps taken to recover
5. **Action items** -- what changes will prevent recurrence (with owners and deadlines)

Store runbooks alongside monitoring configuration. The alert that fires should link directly to the runbook that resolves it.

