Advanced Kubernetes Debugging: CrashLoopBackOff, ImagePullBackOff, OOMKilled, and Stuck Pods

Advanced Kubernetes Debugging#

Every Kubernetes failure follows a pattern, and every pattern has a diagnostic sequence. This guide covers the most common failure modes you will encounter in production, with the exact commands and thought process to move from symptom to resolution.

Systematic Debugging Methodology#

Before diving into specific scenarios, internalize this sequence. It applies to nearly every pod issue:

# Step 1: What state is the pod in?
kubectl get pod <pod> -n <ns> -o wide

# Step 2: What does the full pod spec and event history show?
kubectl describe pod <pod> -n <ns>

# Step 3: What did the application log before it failed?
kubectl logs <pod> -n <ns> --previous --all-containers

# Step 4: Can you get inside the container?
kubectl exec -it <pod> -n <ns> -- /bin/sh

# Step 5: Is the node healthy?
kubectl describe node <node-name>
kubectl top node <node-name>

Each failure mode below follows this pattern, with specific things to look for at each step.

ARM64 Kubernetes: The QEMU Problem with Go Binaries

ARM64 Kubernetes: The QEMU Problem with Go Binaries#

If you run Kubernetes on Apple Silicon (M1/M2/M3/M4) via minikube with the Docker driver, you will eventually try to run an amd64-only container image. For most software this works through QEMU emulation. For Go binaries, it crashes hard.

The Problem#

Go’s garbage collector uses a lock-free stack (lfstack) that packs pointers with counter bits in the high bits of a 64-bit integer. QEMU’s user-mode address translation changes the effective address space layout, which breaks this packing assumption. The result:

Canary Deployments Deep Dive: Argo Rollouts, Flagger, and Metrics-Based Progressive Delivery

Canary Deployments Deep Dive#

A canary deployment sends a small percentage of traffic to a new version of your application while the majority continues hitting the stable version. You monitor the canary for errors, latency regressions, and business metric anomalies. If the canary is healthy, you gradually increase its traffic share until it handles 100%. If something is wrong, you roll back with minimal user impact.

Why Canary Over Rolling Update#

A standard Kubernetes rolling update replaces pods one by one until all pods run the new version. The problem is timing. By the time you notice a bug in your monitoring dashboards, the rolling update may have already replaced most or all pods. Every user is now hitting the broken version.

Choosing a Database Strategy: On Kubernetes vs Managed Service, and PostgreSQL vs MySQL vs CockroachDB

Choosing a Database Strategy#

Every Kubernetes-based platform eventually faces two questions: should the database run inside the cluster or as a managed service, and which database engine fits the workload? These decisions are difficult to reverse. A database migration is one of the highest-risk operations in production. Getting the initial decision roughly right saves months of future pain.

Where to Run: Kubernetes vs Managed Service#

This is not a technology question. It is an organizational question about who owns database operations and what tradeoffs the team will accept.

Choosing a GitOps Tool: ArgoCD vs Flux vs Jenkins vs GitHub Actions for Kubernetes Deployments

Choosing a GitOps Tool#

The term “GitOps” is applied to everything from a simple kubectl apply in a GitHub Actions workflow to a fully reconciled, pull-based deployment architecture with drift detection. These are fundamentally different approaches. Choosing between them depends on your team’s operational maturity, cluster count, and tolerance for running controllers in your cluster.

What GitOps Actually Means#

GitOps, as defined by the OpenGitOps principles (a CNCF sandbox project), has four requirements: declarative desired state, state versioned in git, changes applied automatically, and continuous reconciliation with drift detection. The last two are what separate true GitOps from “CI/CD that uses git.”

Choosing a Kubernetes Backup Strategy: Velero vs Native Snapshots vs Application-Level Backups

Choosing a Kubernetes Backup Strategy#

Kubernetes clusters contain two fundamentally different types of state: cluster state (the Kubernetes objects themselves – Deployments, Services, ConfigMaps, Secrets, CRDs) and application data (the contents of Persistent Volumes). A complete backup strategy must address both. Most backup failures happen because teams back up one but not the other, or because they never test the restore process.

What Needs Backing Up#

Before choosing tools, inventory what your cluster contains:

Choosing an Infrastructure as Code Tool: Terraform vs Pulumi vs CloudFormation/Bicep vs Crossplane

Choosing an Infrastructure as Code Tool#

Infrastructure as Code tools differ in language, state management, provider ecosystem, and operational model. The choice affects how your team writes, reviews, tests, and maintains infrastructure definitions for years. Switching IaC tools mid-project is possible but expensive – it typically means rewriting all definitions and carefully importing existing resources into the new tool’s state.

Decision Criteria#

Before comparing tools, establish what matters to your organization:

Choosing Kubernetes Storage: Local vs Network vs Cloud CSI Drivers

Choosing Kubernetes Storage#

Storage decisions in Kubernetes are harder to change than almost any other architectural choice. Migrating data between storage backends in production involves downtime, risk, and careful planning. Understand the tradeoffs before provisioning your first PersistentVolumeClaim.

The decision comes down to five criteria: performance (IOPS and latency), durability (can you survive node failure), portability (can you move the workload), cost, and access mode (single pod or shared).

Storage Categories#

Block Storage (ReadWriteOnce)#

Block storage provides a raw disk attached to a single node. Only one pod on that node can mount it at a time (ReadWriteOnce). This is the most common storage type for databases, caches, and any workload that needs fast, consistent disk I/O.

Cilium Deep Dive: eBPF Networking, L7 Policies, Hubble Observability, and Cluster Mesh

Cilium Deep Dive#

Cilium replaces the traditional Kubernetes networking stack with eBPF programs that run directly in the Linux kernel. Instead of kube-proxy translating Service definitions into iptables rules and a traditional CNI plugin managing pod networking through bridge interfaces and routing tables, Cilium attaches eBPF programs to kernel hooks that process packets at wire speed. The result is a networking layer that is faster at scale, capable of Layer 7 policy enforcement, and provides built-in observability without application instrumentation.

Custom Resource Definitions (CRDs): Extending the Kubernetes API

Custom Resource Definitions (CRDs)#

CRDs extend the Kubernetes API with your own resource types. Once you create a CRD, you can kubectl get, kubectl apply, and kubectl delete instances of your custom type just like built-in resources. The custom resources are stored in etcd alongside native Kubernetes objects, benefit from the same RBAC, and participate in the same API machinery.

When to Use CRDs#

CRDs make sense when you need to represent application-specific concepts inside Kubernetes: