AKS Troubleshooting: Diagnosing Common Azure Kubernetes Problems

AKS Troubleshooting#

AKS problems fall into categories: node pool operations stuck or failed, pods not scheduling, storage not provisioning, authentication broken, and ingress not working. Each has Azure-specific causes that generic Kubernetes debugging will not surface.

Node Pool Stuck in Updating or Failed#

Node pool operations (scaling, upgrading, changing settings) can get stuck. The AKS API reports the pool as “Updating” indefinitely or transitions to “Failed.”

# Check node pool provisioning state
az aks nodepool show \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload \
  --query provisioningState

# Check the activity log for errors
az monitor activity-log list \
  --resource-group myapp-rg \
  --query "[?contains(operationName.value, 'Microsoft.ContainerService')].{op:operationName.value, status:status.value, msg:properties.statusMessage}" \
  --output table

Common causes and fixes:

CockroachDB Debugging and Troubleshooting

Node Liveness Issues#

Every node must renew its liveness record every 4.5 seconds. Failure to renew marks the node suspect, then dead, triggering re-replication of its ranges.

cockroach node status --insecure --host=localhost:26257

Look at is_live. If a node shows false, check in order:

Process crashed. Check cockroach-data/logs/ for fatal or panic entries. OOM kills are the most common cause – check dmesg | grep -i oom on the host.

Network partition. The node runs but cannot reach peers. If cockroach node status succeeds locally but fails from other nodes, the problem is network-level (firewalls, security groups, DNS).

Debugging ArgoCD: Diagnosing Sync Failures, Health Checks, RBAC, and Repo Issues

Debugging ArgoCD#

Most ArgoCD problems fall into predictable categories: sync stuck in a bad state, resources showing OutOfSync when they should not be, health checks reporting wrong status, RBAC blocking operations, or repository connections failing. Here is how to diagnose and fix each one.

Application Stuck in Progressing#

An application stuck in Progressing means ArgoCD is waiting for a resource to become healthy and it never does. The most common causes:

Debugging GitHub Actions: Triggers, Failures, Secrets, Caching, and Performance

Debugging GitHub Actions#

When a GitHub Actions workflow fails or does not behave as expected, the problem falls into a few predictable categories. This guide covers each one with the diagnostic steps and fixes.

Workflow Not Triggering#

The most common GitHub Actions “bug” is a workflow that never runs.

Check the event and branch filter. A push trigger with branches: [main] will not fire for pushes to feature/xyz. A pull_request trigger fires for the PR’s head branch, not the base branch:

Diagnosing Common Terraform Problems

Stuck State Lock#

A CI job was cancelled, a laptop lost network, or a process crashed mid-apply. Terraform refuses to run:

Error acquiring the state lock
Lock Info:
  ID:        f8e7d6c5-b4a3-2109-8765-43210fedcba9
  Operation: OperationTypeApply
  Who:       deploy@ci-runner
  Created:   2026-02-20 09:15:22 +0000 UTC

Verify the lock holder is truly dead. Check CI job status, then:

terraform force-unlock f8e7d6c5-b4a3-2109-8765-43210fedcba9

If the lock was from a crashed apply, the state may be partially updated. Run terraform plan immediately after unlocking to see the current situation.

EKS Troubleshooting

EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

GKE Troubleshooting

GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.

Jenkins Debugging: Diagnosing Stuck Builds, Pipeline Failures, Performance Issues, and Kubernetes Agent Problems

Jenkins Debugging#

Jenkins failures fall into a few categories: builds stuck waiting, cryptic pipeline errors, performance degradation, and Kubernetes agent pods that refuse to launch.

Builds Stuck in Queue#

When a build sits in the queue and never starts, check the queue tooltip in the UI – it tells you why. Common causes:

No agents with matching labels. The pipeline requests agent { label 'docker-arm64' } but no agent has that label. Check Manage Jenkins > Nodes to see available labels.

kubectl Debugging: A Practical Command Reference

kubectl Debugging#

When something breaks in Kubernetes, you need to move through a specific sequence of commands. Here is every debugging command you will reach for, plus a step-by-step workflow for a pod that will not start.

Logs#

kubectl logs <pod-name> -n <namespace>                           # basic
kubectl logs <pod-name> -c <container-name> -n <namespace>       # specific container
kubectl logs <pod-name> --previous -n <namespace>                # previous crash (essential for CrashLoopBackOff)
kubectl logs -f <pod-name> -n <namespace>                        # stream in real-time
kubectl logs --since=5m <pod-name> -n <namespace>                # last 5 minutes
kubectl logs -l app=payments-api -n payments-prod --all-containers  # all pods matching label

The --previous flag is critical for crash-looping pods where the current container has no logs yet. The --all-containers flag captures init containers and sidecars.

Kubernetes Events Debugging: Patterns, Filtering, and Alerting

Kubernetes Events Debugging#

Kubernetes events are the cluster’s built-in audit trail for what is happening to resources. When a pod fails to schedule, a container crashes, a node runs out of disk, or a volume fails to mount, the system records an event. Events are the first place to look when something goes wrong, and learning to read them efficiently separates quick diagnosis from hours of guessing.

Event Structure#

Every Kubernetes event has these fields: