AKS Troubleshooting: Diagnosing Common Azure Kubernetes Problems

AKS Troubleshooting#

AKS problems fall into categories: node pool operations stuck or failed, pods not scheduling, storage not provisioning, authentication broken, and ingress not working. Each has Azure-specific causes that generic Kubernetes debugging will not surface.

Node Pool Stuck in Updating or Failed#

Node pool operations (scaling, upgrading, changing settings) can get stuck. The AKS API reports the pool as “Updating” indefinitely or transitions to “Failed.”

# Check node pool provisioning state
az aks nodepool show \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload \
  --query provisioningState

# Check the activity log for errors
az monitor activity-log list \
  --resource-group myapp-rg \
  --query "[?contains(operationName.value, 'Microsoft.ContainerService')].{op:operationName.value, status:status.value, msg:properties.statusMessage}" \
  --output table

Common causes and fixes:

Cloud Behavioral Divergence Guide: Where AWS, Azure, and GCP Actually Differ

Cloud Behavioral Divergence Guide#

Running the “same” workload on AWS, Azure, and GCP does not produce the same behavior. The Kubernetes API is portable, application containers are portable, and SQL queries are portable. Everything else – identity, networking, storage, load balancing, DNS, and managed service behavior – diverges in ways that matter for production reliability.

This guide documents the specific divergence points with practical examples. Use it when translating infrastructure from one cloud to another, when debugging behavior that differs between environments, or when assessing migration risk.

Choosing Kubernetes Storage: Local vs Network vs Cloud CSI Drivers

Choosing Kubernetes Storage#

Storage decisions in Kubernetes are harder to change than almost any other architectural choice. Migrating data between storage backends in production involves downtime, risk, and careful planning. Understand the tradeoffs before provisioning your first PersistentVolumeClaim.

The decision comes down to five criteria: performance (IOPS and latency), durability (can you survive node failure), portability (can you move the workload), cost, and access mode (single pod or shared).

Storage Categories#

Block Storage (ReadWriteOnce)#

Block storage provides a raw disk attached to a single node. Only one pod on that node can mount it at a time (ReadWriteOnce). This is the most common storage type for databases, caches, and any workload that needs fast, consistent disk I/O.

Kubernetes Troubleshooting Decision Trees: Symptom to Diagnosis to Fix

Kubernetes Troubleshooting Decision Trees#

Troubleshooting Kubernetes in production is about eliminating possibilities in the right order. Every symptom maps to a finite set of causes, and each cause has a specific diagnostic command. The decision trees below encode that mapping. Start at the symptom, follow the branches, run the commands, and the output tells you which branch to take next.

These trees are designed to be followed mechanically. No intuition required – just execute the commands and interpret the results.