etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

Securing etcd: Encryption at Rest, TLS, and Access Control

Securing etcd#

etcd is the single most critical component in a Kubernetes cluster. It stores everything: pod specs, secrets, configmaps, RBAC rules, service account tokens, and all cluster state. By default, Kubernetes secrets are stored in etcd as base64-encoded plaintext. Anyone with read access to etcd has read access to every secret in the cluster. Securing etcd is not optional.

Why etcd Is the Crown Jewel#

Run this against an unencrypted etcd and you will see why:

Security Hardening a Kubernetes Cluster: End-to-End Operational Sequence

Security Hardening a Kubernetes Cluster#

This operational sequence takes a default Kubernetes cluster and locks it down. Phases are ordered by impact and dependency: assessment first, then RBAC, pod security, networking, images, auditing, and finally data protection. Each phase includes the commands, policy YAML, and verification steps.

Do not skip the assessment phase. You need to know what you are fixing before you start fixing it.


Phase 1 – Assessment#

Before changing anything, establish a baseline. This phase produces a prioritized list of findings that drives the order of remediation in later phases.

Upgrading Kubernetes Clusters Safely

Upgrading Kubernetes Clusters Safely#

Kubernetes releases a new minor version roughly every four months. Staying current is not optional – clusters more than three versions behind lose security patches, and skipping versions during upgrade is not supported. Every upgrade must step through each minor version sequentially.

Version Skew Policy#

The version skew policy defines which component version combinations are supported:

  • kube-apiserver instances within an HA cluster can differ by at most 1 minor version.
  • kubelet can be up to 3 minor versions older than kube-apiserver (changed from 2 in Kubernetes 1.28+), but never newer.
  • kube-controller-manager, kube-scheduler, and kube-proxy must not be newer than kube-apiserver and can be up to 1 minor version older.
  • kubectl is supported within 1 minor version (older or newer) of kube-apiserver.

The practical consequence: always upgrade the control plane first, then node pools. Never upgrade nodes past the control plane version.

Upgrading Self-Managed Kubernetes Clusters with kubeadm: Step-by-Step

Upgrading Self-Managed Kubernetes Clusters with kubeadm#

Upgrading a kubeadm-managed cluster is a multi-step procedure that must be executed in a precise order. The control plane upgrades first, then worker nodes one at a time. Skipping steps or upgrading in the wrong order causes version skew violations that can break cluster communication.

This article provides the complete operational sequence. Execute each step in order. Do not skip ahead.

Version Skew Policy#

Kubernetes enforces strict version compatibility rules between components. Violating these rules results in undefined behavior – sometimes things work, sometimes the API server rejects requests, sometimes components silently fail.

Choosing a Kubernetes Backup Strategy: Velero vs Native Snapshots vs Application-Level Backups

Choosing a Kubernetes Backup Strategy#

Kubernetes clusters contain two fundamentally different types of state: cluster state (the Kubernetes objects themselves – Deployments, Services, ConfigMaps, Secrets, CRDs) and application data (the contents of Persistent Volumes). A complete backup strategy must address both. Most backup failures happen because teams back up one but not the other, or because they never test the restore process.

What Needs Backing Up#

Before choosing tools, inventory what your cluster contains:

Kubernetes Disaster Recovery: Runbooks for Common Incidents

Kubernetes Disaster Recovery Runbooks#

These runbooks cover the incidents you will encounter in production Kubernetes environments. Each follows the same structure: detection, diagnosis, recovery, and prevention. Print these out, bookmark them, put them in your on-call wiki. When the alert fires at 2 AM, you want a checklist, not a tutorial.

Incident Response Framework#

Every incident follows the same cycle:

  1. Detect – monitoring alert, user report, or kubectl showing unhealthy state
  2. Assess – determine scope and severity. Is it one pod, one node, or the entire cluster?
  3. Contain – stop the bleeding. Prevent the issue from spreading
  4. Recover – restore normal operation
  5. Post-mortem – document what happened, why, and how to prevent it

Runbook 1: Node Goes NotReady#

Detection: Node condition changes to Ready=False. Pods on the node are rescheduled (if using Deployments). Monitoring alerts on node status.