Kubernetes Cluster Disaster Recovery: etcd Backup, Velero, and GitOps Recovery

Kubernetes Cluster Disaster Recovery#

Your cluster will fail. The question is whether you can rebuild it in hours or weeks. Kubernetes DR is not a single tool – it is a layered strategy combining etcd snapshots, resource-level backups, GitOps state, and tested recovery procedures.

The three layers of Kubernetes DR: etcd gives you raw cluster state, Velero gives you portable resource and volume backups, and GitOps gives you declarative rebuild capability. You need at least two of these.

etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

Velero Backup and Restore: Disaster Recovery for Kubernetes

Velero Backup and Restore#

Velero backs up Kubernetes resources and persistent volume data to object storage. It handles scheduled backups, on-demand snapshots, and restores to the same or a different cluster. It is the standard tool for Kubernetes disaster recovery.

Velero captures two things: Kubernetes API objects (stored as JSON) and persistent volume data (via cloud volume snapshots or file-level backup with Kopia).

Installation#

You need an object storage bucket (S3, GCS, Azure Blob, or MinIO) and write credentials.