etcd Maintenance for Self-Managed Clusters

February 22, 2026

Kubernetes

Etcd-Backup-Restore, Etcd-Health-Monitoring, Etcd-Member-Management, Etcd-Disaster-Recovery

Etcd, Backup, Restore, Compaction, Defragmentation, Disaster-Recovery

Etcdctl, Kubectl

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

Securing etcd: Encryption at Rest, TLS, and Access Control

February 22, 2026

Security

Intermediate

Etcd-Security-Hardening, Encryption-at-Rest, Certificate-Management, Backup-Security

Etcd, Encryption, Tls, Secrets, Backup-Security

Kubectl, Etcdctl, Kubeadm, Openssl

Securing etcd#

etcd is the single most critical component in a Kubernetes cluster. It stores everything: pod specs, secrets, configmaps, RBAC rules, service account tokens, and all cluster state. By default, Kubernetes secrets are stored in etcd as base64-encoded plaintext. Anyone with read access to etcd has read access to every secret in the cluster. Securing etcd is not optional.

Why etcd Is the Crown Jewel#

Run this against an unencrypted etcd and you will see why:

Upgrading Self-Managed Kubernetes Clusters with kubeadm: Step-by-Step

February 22, 2026

Kubernetes

Intermediate

Cluster-Upgrade-Execution, Etcd-Backup-Restore, Node-Drain, Version-Management, Rollback-Planning

Kubeadm, Upgrade, Self-Managed, Etcd, Backup, Rollback, Version-Skew, Control-Plane, Worker-Nodes, Drain

Kubeadm, Kubectl, Etcdctl, Systemctl, Apt-Get, Crictl

Upgrading Self-Managed Kubernetes Clusters with kubeadm#

Upgrading a kubeadm-managed cluster is a multi-step procedure that must be executed in a precise order. The control plane upgrades first, then worker nodes one at a time. Skipping steps or upgrading in the wrong order causes version skew violations that can break cluster communication.

This article provides the complete operational sequence. Execute each step in order. Do not skip ahead.

Version Skew Policy#

Kubernetes enforces strict version compatibility rules between components. Violating these rules results in undefined behavior – sometimes things work, sometimes the API server rejects requests, sometimes components silently fail.

Choosing a Kubernetes Backup Strategy: Velero vs Native Snapshots vs Application-Level Backups

February 21, 2026

Infrastructure

Intermediate, Advanced

Backup-Strategy, Disaster-Recovery-Planning, Data-Protection

Velero, Backup, Disaster-Recovery, Volume-Snapshots, Etcd, Kubernetes, Persistent-Volumes

Velero, Etcdctl, Kubectl, Pg_dump, Mysqldump, Mongodump

Choosing a Kubernetes Backup Strategy#

Kubernetes clusters contain two fundamentally different types of state: cluster state (the Kubernetes objects themselves – Deployments, Services, ConfigMaps, Secrets, CRDs) and application data (the contents of Persistent Volumes). A complete backup strategy must address both. Most backup failures happen because teams back up one but not the other, or because they never test the restore process.

What Needs Backing Up#

Before choosing tools, inventory what your cluster contains:

Kubernetes Disaster Recovery: Runbooks for Common Incidents

February 21, 2026

Kubernetes

Intermediate

Incident-Response, Etcd-Recovery, Certificate-Renewal, Deployment-Rollback, Backup-Restore

Disaster-Recovery, Runbooks, Incident-Response, Etcd, Certificates, Rollback, Velero

Kubectl, Etcdctl, Kubeadm, Velero, Openssl

Kubernetes Disaster Recovery Runbooks#

These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.

Incident Response Framework#

Every incident follows the same cycle:

Detect – monitoring alert, user report, or kubectl showing unhealthy state
Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
Contain – stop the bleeding. Prevent the issue from spreading
Recover – restore normal operation
Post-mortem – document what happened, why, and how to prevent it

Runbook 1: Node Goes NotReady#

Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.