AKS Troubleshooting: Diagnosing Common Azure Kubernetes Problems

AKS Troubleshooting#

AKS problems fall into categories: node pool operations stuck or failed, pods not scheduling, storage not provisioning, authentication broken, and ingress not working. Each has Azure-specific causes that generic Kubernetes debugging will not surface.

Node Pool Stuck in Updating or Failed#

Node pool operations (scaling, upgrading, changing settings) can get stuck. The AKS API reports the pool as “Updating” indefinitely or transitions to “Failed.”

# Check node pool provisioning state
az aks nodepool show \
  --resource-group myapp-rg \
  --cluster-name myapp-aks \
  --name workload \
  --query provisioningState

# Check the activity log for errors
az monitor activity-log list \
  --resource-group myapp-rg \
  --query "[?contains(operationName.value, 'Microsoft.ContainerService')].{op:operationName.value, status:status.value, msg:properties.statusMessage}" \
  --output table

Common causes and fixes:

Alertmanager Configuration and Routing

Routing Tree#

Alertmanager receives alerts from Prometheus and decides where to send them based on a routing tree. Every alert enters at the root route and travels down the tree until it matches a child route. If no child matches, the root route’s receiver handles it.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxx"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "default-slack"
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-oncall"
      group_wait: 10s
      repeat_interval: 1h
      routes:
        - match:
            team: database
          receiver: "pagerduty-dba"
    - match:
        severity: warning
      receiver: "team-slack"
      repeat_interval: 12h
    - match_re:
        namespace: "staging|dev"
      receiver: "dev-slack"
      repeat_interval: 24h

Timing parameters matter. group_wait is how long Alertmanager waits after receiving the first alert in a new group before sending the notification – this lets it batch related alerts together. group_interval is the minimum time before sending updates about a group that already fired. repeat_interval controls how often an unchanged active alert is re-sent.

API Gateway Patterns: Selection, Configuration, and Routing

API Gateway Patterns#

An API gateway sits between clients and your backend services. It handles cross-cutting concerns – authentication, rate limiting, request transformation, routing – so your services do not have to. Choosing the right gateway and configuring it correctly is one of the first decisions in any microservices architecture.

Gateway Responsibilities#

Before selecting a gateway, clarify which responsibilities it should own:

  • Routing – directing requests to the correct backend service based on path, headers, or method.
  • Authentication and authorization – validating tokens, API keys, or certificates before requests reach backends.
  • Rate limiting – protecting backends from traffic spikes and enforcing usage quotas.
  • Request/response transformation – modifying headers, rewriting paths, converting between formats.
  • Load balancing – distributing traffic across service instances.
  • Observability – emitting metrics, logs, and traces for every request that passes through.
  • TLS termination – handling HTTPS so backends can speak plain HTTP internally.

No gateway does everything equally well. The right choice depends on which of these responsibilities matter most in your environment.

API Versioning Strategies

API Versioning Strategies#

APIs change. Fields get added, endpoints get restructured, response formats evolve. The question is not whether to version your API, but how. A versioning strategy determines how clients discover and select API versions, how you communicate changes, and how you eventually retire old versions.

Breaking vs Non-Breaking Changes#

Before picking a versioning strategy, you need a clear definition of what constitutes a breaking change. This determines when you need a new version versus when you can evolve the existing one.

ArgoCD Image Updater: Automatic Image Tag Updates Without Git Commits

ArgoCD Image Updater#

ArgoCD Image Updater watches container registries for new image tags and automatically updates ArgoCD Applications to use them. In a standard GitOps workflow, updating an image tag requires a Git commit that changes the tag in a values file or manifest. Image Updater automates that step.

The Problem It Solves#

Standard GitOps image update flow:

CI builds image → pushes myapp:v1.2.3 to registry
    → Developer (or CI) commits "update image tag to v1.2.3" to Git
    → ArgoCD detects Git change
    → ArgoCD syncs new tag to cluster

That middle step – committing the tag update – is friction. CI pipelines need Git write access, commit messages are noise (“bump image to v1.2.4”, “bump image to v1.2.5”), and the delay between image push and deployment depends on how fast the commit pipeline runs.

ArgoCD Multi-Cluster Management: Hub-Spoke Patterns, Cluster Registration, and Fleet Operations

ArgoCD Multi-Cluster Management#

A single ArgoCD instance can manage deployments across dozens of Kubernetes clusters. This is one of ArgoCD’s strongest features and the standard approach for organizations with multiple environments, regions, or cloud providers.

Hub-Spoke Architecture#

The standard multi-cluster pattern runs ArgoCD on one “hub” cluster that deploys to multiple “spoke” clusters:

Hub Cluster (management)
├── ArgoCD control plane
├── Application/ApplicationSet definitions
├── RBAC policies
└── Cluster credentials (Secrets)
    │
    ├──→ Spoke Cluster: dev (us-east-1)
    ├──→ Spoke Cluster: staging (us-west-2)
    ├──→ Spoke Cluster: prod-us (us-east-1)
    ├──→ Spoke Cluster: prod-eu (eu-west-1)
    └──→ Spoke Cluster: prod-apac (ap-southeast-1)

ArgoCD on the hub cluster connects to each spoke cluster’s API server to apply manifests and check health. The spoke clusters do not need ArgoCD installed.

ArgoCD Notifications: Slack, Teams, Webhooks, and Custom Triggers

ArgoCD Notifications#

ArgoCD Notifications is a built-in component (since ArgoCD 2.5) that monitors applications and sends alerts when specific events occur – sync succeeded, sync failed, health degraded, new version deployed. Before notifications existed, teams polled the ArgoCD UI or built custom watchers. Notifications eliminates that.

Architecture#

ArgoCD Notifications runs as a controller alongside the ArgoCD application controller. It watches Application resources for state changes and matches them against triggers. When a trigger fires, it renders a template and sends it through a configured service (Slack, Teams, webhook, email, etc.).

ArgoCD Patterns: App of Apps, ApplicationSets, Multi-Environment Management, and Source Strategies

ArgoCD Patterns#

Once ArgoCD is running and you have a few applications deployed, you hit a scaling problem: managing dozens or hundreds of Application resources by hand is unsustainable. These patterns solve that.

App of Apps#

The App of Apps pattern uses one ArgoCD Application to manage other Application resources. You create a “root” application that points to a directory containing Application YAML files. When ArgoCD syncs the root app, it creates all the child applications.

ArgoCD Secrets Management: Sealed Secrets, External Secrets Operator, and SOPS

ArgoCD Secrets Management#

GitOps says everything should be in Git. Kubernetes Secrets are base64-encoded, not encrypted. Committing base64 secrets to Git is equivalent to committing plaintext – anyone with repo access can decode them. This is the fundamental tension of GitOps secrets management.

Three approaches solve this, each with different tradeoffs.

Approach 1: Sealed Secrets#

Sealed Secrets encrypts secrets client-side so the encrypted form can be safely committed to Git. Only the Sealed Secrets controller running in-cluster can decrypt them.

ArgoCD Setup and Basics: Installation, CLI, First Application, and Sync Policies

ArgoCD Setup and Basics#

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes. It watches Git repositories and ensures your cluster state matches what is declared in those repos. When someone changes a manifest in Git, ArgoCD detects the drift and either alerts you or automatically applies the change.

Installation with Plain Manifests#

The fastest path to a running ArgoCD instance:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

This installs the full ArgoCD stack: API server, repo server, application controller, Redis, and Dex (for SSO). For a minimal install without SSO and the UI, use namespace-install.yaml instead, which also scopes ArgoCD to a single namespace.