Upgrading Self-Managed Kubernetes Clusters with kubeadm: Step-by-Step

Upgrading Self-Managed Kubernetes Clusters with kubeadm#

Upgrading a kubeadm-managed cluster is a multi-step procedure that must be executed in a precise order. The control plane upgrades first, then worker nodes one at a time. Skipping steps or upgrading in the wrong order causes version skew violations that can break cluster communication.

This article provides the complete operational sequence. Execute each step in order. Do not skip ahead.

Version Skew Policy#

Kubernetes enforces strict version compatibility rules between components. Violating these rules results in undefined behavior – sometimes things work, sometimes the API server rejects requests, sometimes components silently fail.

Validation Playbook Format: Structuring Portable Validation Procedures

Validation Playbook Format#

A validation playbook is a structured procedure that tells an agent exactly how to validate a specific type of infrastructure change. The key problem it solves: the same validation (for example, “verify this Helm chart works”) requires different commands depending on whether the agent has access to kind, minikube, a cloud cluster, or nothing but a linter. A playbook encodes all path variants in one document so the agent picks the right commands for its environment.

Velero Backup and Restore: Disaster Recovery for Kubernetes

Velero Backup and Restore#

Velero backs up Kubernetes resources and persistent volume data to object storage. It handles scheduled backups, on-demand snapshots, and restores to the same or a different cluster. It is the standard tool for Kubernetes disaster recovery.

Velero captures two things: Kubernetes API objects (stored as JSON) and persistent volume data (via cloud volume snapshots or file-level backup with Kopia).

Installation#

You need an object storage bucket (S3, GCS, Azure Blob, or MinIO) and write credentials.

Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.

Admission Controllers and Webhooks: Intercepting and Enforcing Kubernetes API Requests

Admission Controllers and Webhooks#

Every request to the Kubernetes API server passes through a chain: authentication, authorization, and then admission control. Admission controllers are plugins that intercept requests after a user is authenticated and authorized but before the object is persisted to etcd. They can validate requests, reject them, or mutate objects on the fly. This is where you enforce organizational policy, inject sidecar containers, set defaults, and block dangerous configurations.

Advanced Kubernetes Debugging: CrashLoopBackOff, ImagePullBackOff, OOMKilled, and Stuck Pods

Advanced Kubernetes Debugging#

Every Kubernetes failure follows a pattern, and every pattern has a diagnostic sequence. This guide covers the most common failure modes you will encounter in production, with the exact commands and thought process to move from symptom to resolution.

Systematic Debugging Methodology#

Before diving into specific scenarios, internalize this sequence. It applies to nearly every pod issue:

# Step 1: What state is the pod in?
kubectl get pod <pod> -n <ns> -o wide

# Step 2: What does the full pod spec and event history show?
kubectl describe pod <pod> -n <ns>

# Step 3: What did the application log before it failed?
kubectl logs <pod> -n <ns> --previous --all-containers

# Step 4: Can you get inside the container?
kubectl exec -it <pod> -n <ns> -- /bin/sh

# Step 5: Is the node healthy?
kubectl describe node <node-name>
kubectl top node <node-name>

Each failure mode below follows this pattern, with specific things to look for at each step.

ARM64 Kubernetes: The QEMU Problem with Go Binaries

ARM64 Kubernetes: The QEMU Problem with Go Binaries#

If you run Kubernetes on Apple Silicon (M1/M2/M3/M4) via minikube with the Docker driver, you will eventually try to run an amd64-only container image. For most software this works through QEMU emulation. For Go binaries, it crashes hard.

The Problem#

Go’s garbage collector uses a lock-free stack (lfstack) that packs pointers with counter bits in the high bits of a 64-bit integer. QEMU’s user-mode address translation changes the effective address space layout, which breaks this packing assumption. The result:

Canary Deployments Deep Dive: Argo Rollouts, Flagger, and Metrics-Based Progressive Delivery

Canary Deployments Deep Dive#

A canary deployment sends a small percentage of traffic to a new version of your application while the majority continues hitting the stable version. You monitor the canary for errors, latency regressions, and business metric anomalies. If the canary is healthy, you gradually increase its traffic share until it handles 100%. If something is wrong, you roll back with minimal user impact.

Why Canary Over Rolling Update#

A standard Kubernetes rolling update replaces pods one by one until all pods run the new version. The problem is timing. By the time you notice a bug in your monitoring dashboards, the rolling update may have already replaced most or all pods. Every user is now hitting the broken version.

Choosing a Database Strategy: On Kubernetes vs Managed Service, and PostgreSQL vs MySQL vs CockroachDB

Choosing a Database Strategy#

Every Kubernetes-based platform eventually faces two questions: should the database run inside the cluster or as a managed service, and which database engine fits the workload? These decisions are difficult to reverse. A database migration is one of the highest-risk operations in production. Getting the initial decision roughly right saves months of future pain.

Where to Run: Kubernetes vs Managed Service#

This is not a technology question. It is an organizational question about who owns database operations and what tradeoffs the team will accept.

Choosing a GitOps Tool: ArgoCD vs Flux vs Jenkins vs GitHub Actions for Kubernetes Deployments

Choosing a GitOps Tool#

The term “GitOps” is applied to everything from a simple kubectl apply in a GitHub Actions workflow to a fully reconciled, pull-based deployment architecture with drift detection. These are fundamentally different approaches. Choosing between them depends on your team’s operational maturity, cluster count, and tolerance for running controllers in your cluster.

What GitOps Actually Means#

GitOps, as defined by the OpenGitOps principles (a CNCF sandbox project), has four requirements: declarative desired state, state versioned in git, changes applied automatically, and continuous reconciliation with drift detection. The last two are what separate true GitOps from “CI/CD that uses git.”