Kubernetes Cluster Disaster Recovery: etcd Backup, Velero, and GitOps Recovery

Kubernetes Cluster Disaster Recovery#

Your cluster will fail. The question is whether you can rebuild it in hours or weeks. Kubernetes DR is not a single tool – it is a layered strategy combining etcd snapshots, resource-level backups, GitOps state, and tested recovery procedures.

The three layers of Kubernetes DR: etcd gives you raw cluster state, Velero gives you portable resource and volume backups, and GitOps gives you declarative rebuild capability. You need at least two of these.

etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

Upgrading Self-Managed Kubernetes Clusters with kubeadm: Step-by-Step

Upgrading Self-Managed Kubernetes Clusters with kubeadm#

Upgrading a kubeadm-managed cluster is a multi-step procedure that must be executed in a precise order. The control plane upgrades first, then worker nodes one at a time. Skipping steps or upgrading in the wrong order causes version skew violations that can break cluster communication.

This article provides the complete operational sequence. Execute each step in order. Do not skip ahead.

Version Skew Policy#

Kubernetes enforces strict version compatibility rules between components. Violating these rules results in undefined behavior – sometimes things work, sometimes the API server rejects requests, sometimes components silently fail.