EKS Troubleshooting

EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

From Empty Cluster to Production-Ready: The Complete Setup Sequence

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

GKE Networking

GKE Networking#

GKE networking centers on VPC-native clusters, where pods and services get IP addresses from VPC subnet ranges. This integrates Kubernetes networking directly into Google Cloud’s VPC, enabling native routing, firewall rules, and load balancing without extra overlays.

VPC-Native Clusters and Alias IP Ranges#

VPC-native clusters use alias IP ranges on the subnet. You allocate two secondary ranges: one for pods, one for services.

# Create subnet with secondary ranges
gcloud compute networks subnets create gke-subnet \
  --network my-vpc \
  --region us-central1 \
  --range 10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20

# Create cluster using those ranges
gcloud container clusters create my-cluster \
  --region us-central1 \
  --network my-vpc \
  --subnetwork gke-subnet \
  --cluster-secondary-range-name pods \
  --services-secondary-range-name services \
  --enable-ip-alias

The pod range needs to be large. A /14 gives about 262,000 pod IPs. Each node reserves a /24 from the pod range (256 IPs, 110 usable pods per node). If you have 100 nodes, that consumes 100 /24 blocks. Undersizing the pod range is a common cause of IP exhaustion – the cluster cannot add nodes even though VMs are available.

GKE Security and Identity

GKE Security and Identity#

GKE security covers identity (who can do what), workload isolation (sandboxing untrusted code), supply chain integrity (ensuring only trusted images run), and data protection (encryption at rest). These features layer on top of standard Kubernetes RBAC and network policies.

Workload Identity Federation#

Workload Identity Federation is the successor to the original Workload Identity. It removes the need for a separate workload-pool flag and uses the standard GCP IAM federation model. The concept is the same: bind a Kubernetes service account to a Google Cloud service account so pods get GCP credentials without exported keys.

GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

GKE Troubleshooting

GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.

GPU and ML Workloads on Kubernetes: Scheduling, Sharing, and Monitoring

GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.

HashiCorp Vault on Kubernetes: Secrets Management Done Right

HashiCorp Vault on Kubernetes#

Vault centralizes secret management with dynamic credentials, encryption as a service, and fine-grained access control. On Kubernetes, workloads authenticate using service accounts and pull secrets without hardcoding anything.

Installation with Helm#

helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update

Dev Mode (Single Pod, In-Memory)#

Automatically initialized and unsealed, stores everything in memory, loses all data on restart. Root token is root. Never use this in production.

helm upgrade --install vault hashicorp/vault \
  --namespace vault --create-namespace \
  --set server.dev.enabled=true \
  --set injector.enabled=true

Production Mode (HA with Integrated Raft Storage)#

Run Vault in HA mode with Raft consensus – a 3-node StatefulSet with persistent storage.

Helm Chart Development: Templates, Helpers, and Testing

Helm Chart Development#

Writing your own Helm charts turns static YAML into reusable, configurable packages. The learning curve is in Go’s template syntax and Helm’s conventions, but once you internalize the patterns, chart development is fast.

Chart Structure#

Create a new chart scaffold:

helm create my-app

This generates:

my-app/
  Chart.yaml              # chart metadata (name, version, dependencies)
  values.yaml             # default configuration values
  charts/                 # dependency charts (populated by helm dependency update)
  templates/              # Kubernetes manifest templates
    deployment.yaml
    service.yaml
    ingress.yaml
    serviceaccount.yaml
    hpa.yaml
    NOTES.txt             # post-install instructions (printed after helm install)
    _helpers.tpl           # named template definitions
    tests/
      test-connection.yaml # helm test pod

Chart.yaml#

The Chart.yaml defines your chart’s identity and dependencies: