{"page":{"agent_metadata":{"content_type":"guide","outputs":["etcd-snapshot-backup-restore","velero-dr-configuration","gitops-recovery-strategy","dr-testing-procedures"],"prerequisites":["kubernetes-control-plane-concepts","etcd-basics","velero-basics","gitops-concepts"]},"categories":["kubernetes"],"content_plain":"Kubernetes Cluster Disaster Recovery# Your cluster will fail. The question is whether you can rebuild it in hours or weeks. Kubernetes DR is not a single tool \u0026ndash; it is a layered strategy combining etcd snapshots, resource-level backups, GitOps state, and tested recovery procedures.\nThe three layers of Kubernetes DR: etcd gives you raw cluster state, Velero gives you portable resource and volume backups, and GitOps gives you declarative rebuild capability. You need at least two of these.\netcd Backup and Restore# etcd holds every Kubernetes object. Losing etcd means losing the entire cluster state \u0026ndash; every deployment, service, secret, and configmap.\nTaking a Snapshot# export ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \\ --endpoints=https://127.0.0.1:2379 \\ --cacert=/etc/kubernetes/pki/etcd/ca.crt \\ --cert=/etc/kubernetes/pki/etcd/server.crt \\ --key=/etc/kubernetes/pki/etcd/server.keyAutomate this with a cron job on every control plane node. Store snapshots off-cluster \u0026ndash; in S3, GCS, or an NFS mount. A snapshot sitting on the same disk as etcd is not a backup.\n# Cron: every 6 hours, retain 7 days 0 */6 * * * /usr/local/bin/etcd-backup.sh \u0026amp;\u0026amp; find /backup -name \u0026#34;etcd-*.db\u0026#34; -mtime +7 -deleteVerify snapshots are valid:\netcdctl snapshot status /backup/etcd-20260222-020000.db --write-tableRestoring from Snapshot# Restoring etcd replaces all cluster state. This is a full cluster recovery \u0026ndash; not a surgical restore.\n# Stop kube-apiserver and etcd on all control plane nodes mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ mv /etc/kubernetes/manifests/etcd.yaml /tmp/ # Restore on each member (different data-dir, different name, different peer URLs) etcdctl snapshot restore /backup/etcd-20260222-020000.db \\ --data-dir=/var/lib/etcd-restored \\ --name=cp-1 \\ --initial-cluster=cp-1=https://10.0.1.10:2380,cp-2=https://10.0.1.11:2380,cp-3=https://10.0.1.12:2380 \\ --initial-advertise-peer-urls=https://10.0.1.10:2380 # Update etcd manifest to point to new data-dir, then move manifests back sed -i \u0026#39;s|/var/lib/etcd|/var/lib/etcd-restored|\u0026#39; /tmp/etcd.yaml mv /tmp/etcd.yaml /etc/kubernetes/manifests/ mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/The critical gotcha: after an etcd restore, the actual state of nodes and pods no longer matches what etcd thinks. Pods that were running are now unknown to the restored etcd. Kubernetes will reconcile \u0026ndash; controllers will recreate deployments and pods \u0026ndash; but there is a period of disruption while this happens.\nVelero for Namespace and Cluster Backup# Velero operates at the Kubernetes resource level. It exports resources as JSON and backs up persistent volumes via snapshots or file-level copies. Unlike etcd snapshots, Velero backups are portable \u0026ndash; you can restore to a completely different cluster.\nDR-Focused Backup Schedule# # Full cluster backup every 6 hours velero schedule create cluster-dr \\ --schedule=\u0026#34;0 */6 * * *\u0026#34; \\ --ttl 720h \\ --default-volumes-to-fs-backup # Critical namespaces every hour velero schedule create critical-hourly \\ --schedule=\u0026#34;0 * * * *\u0026#34; \\ --include-namespaces production,payments,auth \\ --ttl 168h \\ --default-volumes-to-fs-backupRestore Procedures# Full cluster restore to a new cluster:\n# Install Velero on the new cluster pointing to the same backup bucket velero install \\ --provider aws \\ --plugins velero/velero-plugin-for-aws:v1.10.0 \\ --bucket velero-backups \\ --backup-location-config region=us-east-1 \\ --secret-file ./credentials-velero \\ --use-node-agent # Verify backups are visible velero backup get # Restore everything velero restore create full-recovery \\ --from-backup cluster-dr-20260222-060000 # Monitor progress velero restore describe full-recovery --detailsSelective namespace restore for partial recovery:\nvelero restore create payments-recovery \\ --from-backup cluster-dr-20260222-060000 \\ --include-namespaces payments \\ --restore-volumes=trueWhat Velero Cannot Backup# This is where teams get burned in real incidents. Velero captures Kubernetes API objects and PV data. It does not capture:\nCluster-scoped configuration that is not in API objects. Node-level configurations, kubelet arguments, CNI configs on disk, and container runtime settings are not Kubernetes resources. Velero does not touch them.\nExternal state. External databases, cloud IAM roles, DNS records, TLS certificates in external stores (ACM, Let\u0026rsquo;s Encrypt), and load balancer configurations managed by cloud providers are not in your cluster.\nCRDs and operator ordering. Velero restores CRDs, but if you restore a CR before its operator is running, the controller will not reconcile it. You must install operators first, then restore their custom resources.\nEncrypted secrets at rest. If your cluster uses KMS encryption for secrets, the restored cluster needs the same KMS key access. Otherwise, secrets are unreadable.\nCluster Rebuild vs Restore: The Tradeoff# Restore from backup gets you running faster. etcd restore brings the cluster back in minutes. Velero restore to a new cluster takes longer but works cross-infrastructure. The risk: you restore whatever caused the failure in the first place.\nRebuild from scratch using GitOps is cleaner. You get a known-good cluster with only the intended state. The cost: it takes longer. You need to reinstall operators, wait for CRDs, re-sync applications, and restore data separately.\nIn practice, the best DR strategy combines both:\nRebuild the cluster infrastructure (Terraform or Cluster API) Install core components (CNI, cert-manager, operators) via GitOps Restore stateful data from Velero backups Let ArgoCD sync application workloads GitOps as Disaster Recovery# If everything is in Git, rebuilding is straightforward. ArgoCD or Flux watches repositories and converges cluster state to match.\n# ArgoCD Application that deploys everything apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: cluster-bootstrap namespace: argocd spec: project: default source: repoURL: https://github.com/org/k8s-platform path: clusters/production targetRevision: main destination: server: https://kubernetes.default.svc syncPolicy: automated: prune: true selfHeal: trueAfter installing ArgoCD on a fresh cluster and pointing it at your repo, it recreates every application, namespace, RBAC policy, and configuration. What it cannot recreate: persistent data, runtime state, and anything not committed to Git.\nThe gap between GitOps and full recovery is your data. Databases, message queues, uploaded files \u0026ndash; these need separate backup and restore procedures.\nDR for Stateful Workloads# Stateful workloads need application-consistent backups, not just volume snapshots. A snapshot of a PostgreSQL data directory taken while the database is writing may be corrupted.\n# CSI VolumeSnapshot for a PVC kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: postgres-snap-$(date +%Y%m%d) namespace: production spec: volumeSnapshotClassName: csi-aws-vsc source: persistentVolumeClaimName: data-postgres-0 EOFFor application consistency, freeze writes before snapshotting. PostgreSQL supports pg_backup_start() and pg_backup_stop(). For databases managed by operators (CloudNativePG, Percona), use the operator\u0026rsquo;s built-in backup CRDs \u0026ndash; they handle the freeze/snapshot/thaw cycle.\nTesting Cluster Recovery# An untested backup is not a backup. Schedule quarterly DR tests:\nProvision a test cluster in an isolated environment Restore from the latest Velero backup Verify all deployments reach Ready state Run smoke tests against restored applications Validate data integrity in restored databases Measure time-to-recovery and document it # Quick validation after restore kubectl get pods --all-namespaces | grep -v Running | grep -v Completed kubectl get pvc --all-namespaces | grep -v BoundTrack your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) against actual test results. If your RTO is 1 hour but your restore takes 3, you need a different strategy \u0026ndash; not a better backup schedule.\nThe most common DR failure is not a bad backup. It is missing credentials, expired certificates, or changed cloud IAM policies that prevent the restore from completing. Test the full chain, not just the velero restore command.\n","date":"2026-02-22","description":"Complete DR strategy for Kubernetes clusters covering etcd snapshot backup and restore, Velero for namespace and cluster-level recovery, GitOps-based rebuild, and what backup tools cannot capture.","lastmod":"2026-02-22","levels":["intermediate","advanced"],"reading_time_minutes":6,"section":"knowledge","skills":["etcd-backup-restore","velero-dr","gitops-recovery","cluster-rebuild-planning"],"tags":["disaster-recovery","etcd","velero","backup","restore","argocd","gitops","cluster-rebuild"],"title":"Kubernetes Cluster Disaster Recovery: etcd Backup, Velero, and GitOps Recovery","tools":["etcdctl","velero","kubectl","argocd","helm"],"url":"https://agent-zone.ai/knowledge/kubernetes/kubernetes-disaster-recovery/","word_count":1100}}