---
title: "Kubernetes Cluster Disaster Recovery: etcd Backup, Velero, and GitOps Recovery"
description: "Complete DR strategy for Kubernetes clusters covering etcd snapshot backup and restore, Velero for namespace and cluster-level recovery, GitOps-based rebuild, and what backup tools cannot capture."
url: https://agent-zone.ai/knowledge/kubernetes/kubernetes-disaster-recovery/
section: knowledge
date: 2026-02-22
categories: ["kubernetes"]
tags: ["disaster-recovery","etcd","velero","backup","restore","argocd","gitops","cluster-rebuild"]
skills: ["etcd-backup-restore","velero-dr","gitops-recovery","cluster-rebuild-planning"]
tools: ["etcdctl","velero","kubectl","argocd","helm"]
levels: ["intermediate","advanced"]
word_count: 1100
formats:
  json: https://agent-zone.ai/knowledge/kubernetes/kubernetes-disaster-recovery/index.json
  html: https://agent-zone.ai/knowledge/kubernetes/kubernetes-disaster-recovery/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Kubernetes+Cluster+Disaster+Recovery%3A+etcd+Backup%2C+Velero%2C+and+GitOps+Recovery
---


# Kubernetes Cluster Disaster Recovery

Your cluster will fail. The question is whether you can rebuild it in hours or weeks. Kubernetes DR is not a single tool -- it is a layered strategy combining etcd snapshots, resource-level backups, GitOps state, and tested recovery procedures.

The three layers of Kubernetes DR: etcd gives you raw cluster state, Velero gives you portable resource and volume backups, and GitOps gives you declarative rebuild capability. You need at least two of these.

## etcd Backup and Restore

etcd holds every Kubernetes object. Losing etcd means losing the entire cluster state -- every deployment, service, secret, and configmap.

### Taking a Snapshot

```bash
export ETCDCTL_API=3
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
```

Automate this with a cron job on every control plane node. Store snapshots off-cluster -- in S3, GCS, or an NFS mount. A snapshot sitting on the same disk as etcd is not a backup.

```bash
# Cron: every 6 hours, retain 7 days
0 */6 * * * /usr/local/bin/etcd-backup.sh && find /backup -name "etcd-*.db" -mtime +7 -delete
```

Verify snapshots are valid:

```bash
etcdctl snapshot status /backup/etcd-20260222-020000.db --write-table
```

### Restoring from Snapshot

Restoring etcd replaces all cluster state. This is a full cluster recovery -- not a surgical restore.

```bash
# Stop kube-apiserver and etcd on all control plane nodes
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# Restore on each member (different data-dir, different name, different peer URLs)
etcdctl snapshot restore /backup/etcd-20260222-020000.db \
  --data-dir=/var/lib/etcd-restored \
  --name=cp-1 \
  --initial-cluster=cp-1=https://10.0.1.10:2380,cp-2=https://10.0.1.11:2380,cp-3=https://10.0.1.12:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Update etcd manifest to point to new data-dir, then move manifests back
sed -i 's|/var/lib/etcd|/var/lib/etcd-restored|' /tmp/etcd.yaml
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
```

The critical gotcha: after an etcd restore, the actual state of nodes and pods no longer matches what etcd thinks. Pods that were running are now unknown to the restored etcd. Kubernetes will reconcile -- controllers will recreate deployments and pods -- but there is a period of disruption while this happens.

## Velero for Namespace and Cluster Backup

Velero operates at the Kubernetes resource level. It exports resources as JSON and backs up persistent volumes via snapshots or file-level copies. Unlike etcd snapshots, Velero backups are portable -- you can restore to a completely different cluster.

### DR-Focused Backup Schedule

```bash
# Full cluster backup every 6 hours
velero schedule create cluster-dr \
  --schedule="0 */6 * * *" \
  --ttl 720h \
  --default-volumes-to-fs-backup

# Critical namespaces every hour
velero schedule create critical-hourly \
  --schedule="0 * * * *" \
  --include-namespaces production,payments,auth \
  --ttl 168h \
  --default-volumes-to-fs-backup
```

### Restore Procedures

Full cluster restore to a new cluster:

```bash
# Install Velero on the new cluster pointing to the same backup bucket
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket velero-backups \
  --backup-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --use-node-agent

# Verify backups are visible
velero backup get

# Restore everything
velero restore create full-recovery \
  --from-backup cluster-dr-20260222-060000

# Monitor progress
velero restore describe full-recovery --details
```

Selective namespace restore for partial recovery:

```bash
velero restore create payments-recovery \
  --from-backup cluster-dr-20260222-060000 \
  --include-namespaces payments \
  --restore-volumes=true
```

## What Velero Cannot Backup

This is where teams get burned in real incidents. Velero captures Kubernetes API objects and PV data. It does not capture:

**Cluster-scoped configuration that is not in API objects.** Node-level configurations, kubelet arguments, CNI configs on disk, and container runtime settings are not Kubernetes resources. Velero does not touch them.

**External state.** External databases, cloud IAM roles, DNS records, TLS certificates in external stores (ACM, Let's Encrypt), and load balancer configurations managed by cloud providers are not in your cluster.

**CRDs and operator ordering.** Velero restores CRDs, but if you restore a CR before its operator is running, the controller will not reconcile it. You must install operators first, then restore their custom resources.

**Encrypted secrets at rest.** If your cluster uses KMS encryption for secrets, the restored cluster needs the same KMS key access. Otherwise, secrets are unreadable.

## Cluster Rebuild vs Restore: The Tradeoff

**Restore from backup** gets you running faster. etcd restore brings the cluster back in minutes. Velero restore to a new cluster takes longer but works cross-infrastructure. The risk: you restore whatever caused the failure in the first place.

**Rebuild from scratch** using GitOps is cleaner. You get a known-good cluster with only the intended state. The cost: it takes longer. You need to reinstall operators, wait for CRDs, re-sync applications, and restore data separately.

In practice, the best DR strategy combines both:

1. Rebuild the cluster infrastructure (Terraform or Cluster API)
2. Install core components (CNI, cert-manager, operators) via GitOps
3. Restore stateful data from Velero backups
4. Let ArgoCD sync application workloads

## GitOps as Disaster Recovery

If everything is in Git, rebuilding is straightforward. ArgoCD or Flux watches repositories and converges cluster state to match.

```yaml
# ArgoCD Application that deploys everything
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cluster-bootstrap
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/k8s-platform
    path: clusters/production
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
```

After installing ArgoCD on a fresh cluster and pointing it at your repo, it recreates every application, namespace, RBAC policy, and configuration. What it cannot recreate: persistent data, runtime state, and anything not committed to Git.

The gap between GitOps and full recovery is your data. Databases, message queues, uploaded files -- these need separate backup and restore procedures.

## DR for Stateful Workloads

Stateful workloads need application-consistent backups, not just volume snapshots. A snapshot of a PostgreSQL data directory taken while the database is writing may be corrupted.

```bash
# CSI VolumeSnapshot for a PVC
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-snap-$(date +%Y%m%d)
  namespace: production
spec:
  volumeSnapshotClassName: csi-aws-vsc
  source:
    persistentVolumeClaimName: data-postgres-0
EOF
```

For application consistency, freeze writes before snapshotting. PostgreSQL supports `pg_backup_start()` and `pg_backup_stop()`. For databases managed by operators (CloudNativePG, Percona), use the operator's built-in backup CRDs -- they handle the freeze/snapshot/thaw cycle.

## Testing Cluster Recovery

An untested backup is not a backup. Schedule quarterly DR tests:

1. Provision a test cluster in an isolated environment
2. Restore from the latest Velero backup
3. Verify all deployments reach Ready state
4. Run smoke tests against restored applications
5. Validate data integrity in restored databases
6. Measure time-to-recovery and document it

```bash
# Quick validation after restore
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
kubectl get pvc --all-namespaces | grep -v Bound
```

Track your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) against actual test results. If your RTO is 1 hour but your restore takes 3, you need a different strategy -- not a better backup schedule.

The most common DR failure is not a bad backup. It is missing credentials, expired certificates, or changed cloud IAM policies that prevent the restore from completing. Test the full chain, not just the `velero restore` command.

