---
title: "EKS Troubleshooting"
description: "How to diagnose and fix common EKS problems: nodes not joining, pods stuck pending, load balancers not routing, EBS volumes not attaching, and DNS failures."
url: https://agent-zone.ai/knowledge/kubernetes/eks-troubleshooting/
section: knowledge
date: 2026-02-22
categories: ["kubernetes"]
tags: ["eks","aws","troubleshooting","debugging","vpc-cni","alb","ebs"]
skills: ["eks-troubleshooting","kubernetes-debugging","aws-networking-diagnosis"]
tools: ["kubectl","aws-cli","eksctl"]
levels: ["intermediate"]
word_count: 777
formats:
  json: https://agent-zone.ai/knowledge/kubernetes/eks-troubleshooting/index.json
  html: https://agent-zone.ai/knowledge/kubernetes/eks-troubleshooting/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=EKS+Troubleshooting
---


# EKS Troubleshooting

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

## Nodes Not Joining the Cluster

Symptoms: `kubectl get nodes` shows fewer nodes than expected. ASG shows instances running, but they never register.

### aws-auth ConfigMap Missing Node Role

The most common cause. Worker nodes authenticate via `aws-auth`. If the node IAM role is not mapped, nodes are rejected silently.

```bash
kubectl get configmap aws-auth -n kube-system -o yaml
# Verify the node role ARN appears under mapRoles with groups:
#   system:bootstrappers and system:nodes

# If missing, add it:
eksctl create iamidentitymapping --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-node-group-role \
  --group system:bootstrappers \
  --group system:nodes \
  --username system:node:{{EC2PrivateDNSName}}
```

### Security Group Rules

Nodes need port 443 to the control plane (API server) and the control plane needs port 10250 to nodes (kubelet). Check bidirectional traffic is allowed in the cluster security group:

```bash
aws eks describe-cluster --name my-cluster \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId"
```

### AMI and Bootstrap Issues

Managed node groups handle AMIs automatically. Self-managed nodes must run `/etc/eks/bootstrap.sh my-cluster` -- if the script is missing or the cluster name is wrong, the node never joins.

If nodes appear but show `NotReady`, check kubelet logs via SSM (`journalctl -u kubelet -f`). Common causes: VPC CNI crashing (check `aws-node` DaemonSet), disk pressure, or memory pressure.

## Pods Stuck in Pending

Run `kubectl describe pod <pod-name>` and check Events. Common causes:

**Insufficient resources:** Events show "0/3 nodes are available: 3 Insufficient cpu". Add nodes or configure Karpenter.

**Fargate profile mismatch:** Fargate pods only schedule if namespace AND labels match a profile selector exactly. Check with `aws eks describe-fargate-profile`. Mismatches produce no useful error -- just "no nodes available."

**VPC CNI IP exhaustion:** Nodes have capacity but pods stay Pending. Check `aws-node` logs for "ipamd: no available IP addresses":

```bash
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50

# Check remaining IPs
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query "Subnets[].{ID:SubnetId,Available:AvailableIpAddressCount}"
```

If subnets are nearly full, enable prefix delegation or add subnets.

## ALB/NLB Not Routing Traffic

First verify the controller is running: `kubectl get deployment -n kube-system aws-load-balancer-controller`. Check its logs for IAM errors.

### Subnet Tagging

Missing tags are the single most common reason ALBs/NLBs fail to create. Required tags:

- Public subnets (internet-facing LBs): `kubernetes.io/role/elb = 1`
- Private subnets (internal LBs): `kubernetes.io/role/internal-elb = 1`
- All subnets: `kubernetes.io/cluster/<cluster-name> = shared`

```bash
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query "Subnets[].Tags[?Key=='kubernetes.io/role/elb']"
```

### Target Group Health Checks Failing

If all targets are unhealthy, no traffic routes. Check with `aws elbv2 describe-target-health --target-group-arn <arn>`. Common causes: health check path returns non-200, security group blocks ALB-to-pod traffic, or pod listens on wrong port.

### Ingress Shows No ADDRESS

Run `kubectl describe ingress my-app -n production` and check Events for "Failed to create load balancer" with subnet or IAM errors.

## EBS Volumes Not Attaching

### EBS CSI Driver Not Installed

EKS 1.23+ requires the EBS CSI driver add-on for EBS-backed PersistentVolumes. The in-tree provisioner was removed. If PVCs stay in Pending:

```bash
# Check if the driver is installed
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

# Install it
aws eks create-addon --cluster-name my-cluster --addon-name aws-ebs-csi-driver \
  --service-account-role-arn arn:aws:iam::123456789012:role/ebs-csi-role
```

The driver needs an IAM role with `ec2:CreateVolume`, `ec2:AttachVolume`, `ec2:DetachVolume`, `ec2:DeleteVolume`, and related permissions.

### Availability Zone Mismatch

EBS volumes are AZ-specific. If a pod is scheduled to `us-east-1a` but the PersistentVolume was created in `us-east-1b`, the volume cannot attach.

```bash
# Check the PV's AZ
kubectl get pv <pv-name> -o jsonpath='{.spec.nodeAffinity}'

# Check the pod's node AZ
kubectl get node <node-name> -L topology.kubernetes.io/zone
```

Fix: use a StorageClass with `volumeBindingMode: WaitForFirstConsumer` so the volume is created in the same AZ as the pod:

```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
```

## DNS Resolution Failures

### CoreDNS Not Running

```bash
kubectl get pods -n kube-system -l k8s-app=kube-dns
```

If CoreDNS pods are crashing, check logs:

```bash
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
```

Common cause: CoreDNS tries to reach upstream DNS (the VPC DNS resolver at the VPC CIDR base +2, e.g., 10.0.0.2). If the node security group blocks outbound UDP/TCP port 53 to this address, DNS fails for the entire cluster.

### Pod DNS Not Working But Node DNS Works

If pods cannot resolve external names but nodes can, check that the `kube-dns` Service in `kube-system` has endpoints and that pods have the correct `/etc/resolv.conf`:

```bash
kubectl exec <pod> -- cat /etc/resolv.conf
# Should show: nameserver 172.20.0.10 (the kube-dns ClusterIP)

kubectl get endpoints kube-dns -n kube-system
# Should show CoreDNS pod IPs
```

## CloudWatch Container Insights

Enable Container Insights for cluster, node, and pod metrics:

```bash
aws eks create-addon --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability
```

This sends metrics to CloudWatch under the `ContainerInsights` namespace -- node CPU, pod memory, request vs capacity, and restart counts. Query container logs with CloudWatch Logs Insights when debugging pod restarts.

