EKS Setup and Configuration

EKS Setup and Configuration#

Amazon EKS runs the Kubernetes control plane for you – managed etcd, API server, and controller manager across multiple AZs. You are responsible for the worker nodes, networking configuration, and add-ons.

Cluster Creation Methods#

eksctl is the fastest path for a working cluster. It creates the VPC, subnets, NAT gateway, IAM roles, node groups, and kubeconfig in one command:

eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --version 1.31 \
  --nodegroup-name workers \
  --node-type m6i.large \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed

For repeatable setups, use a ClusterConfig file:

EKS Troubleshooting

EKS Troubleshooting#

EKS failure modes combine Kubernetes problems with AWS-specific issues. Most fall into a handful of categories: IAM permissions, networking/security groups, missing tags, and add-on misconfiguration.

Nodes Not Joining the Cluster#

Symptoms: kubectl get nodes shows fewer nodes than expected. ASG shows instances running, but they never register.

aws-auth ConfigMap Missing Node Role#

The most common cause. Worker nodes authenticate via aws-auth. If the node IAM role is not mapped, nodes are rejected silently.

Ephemeral Cloud Clusters: Create, Validate, Destroy Sequences for EKS, GKE, and AKS

Ephemeral Cloud Clusters#

Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict – create, validate, destroy – and the entire sequence must be automated so that destruction cannot be forgotten.

The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.

etcd Maintenance for Self-Managed Clusters

etcd Maintenance for Self-Managed Clusters#

etcd is the backing store for all Kubernetes cluster state. Every object – pods, services, secrets, configmaps – lives in etcd. If etcd is unhealthy, your cluster is unhealthy. If etcd data is lost, your cluster is gone. Managed Kubernetes services (EKS, GKE, AKS) handle etcd for you, but self-managed clusters require you to operate it directly.

All etcdctl commands below require TLS flags. Set these as environment variables to avoid repeating them:

From Empty Cluster to Production-Ready: The Complete Setup Sequence

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

GitLab CI/CD Pipeline Patterns: Stages, DAG Pipelines, Includes, and Registry Integration

GitLab CI/CD Pipeline Patterns#

GitLab CI/CD runs pipelines defined in a .gitlab-ci.yml file at the repository root. Every push, merge request, or tag triggers a pipeline consisting of stages that contain jobs. The pipeline configuration is version-controlled alongside your code, so the build process evolves with the application.

Basic .gitlab-ci.yml Structure#

A minimal pipeline defines stages and jobs. Stages run sequentially; jobs within the same stage run in parallel:

stages:
  - build
  - test
  - deploy

build-app:
  stage: build
  image: golang:1.22
  script:
    - go build -o myapp ./cmd/myapp
  artifacts:
    paths:
      - myapp
    expire_in: 1 hour

unit-tests:
  stage: test
  image: golang:1.22
  script:
    - go test ./... -v -coverprofile=coverage.out
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.out

deploy-staging:
  stage: deploy
  image: bitnami/kubectl:latest
  script:
    - kubectl set image deployment/myapp myapp=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  environment:
    name: staging
    url: https://staging.example.com
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

Every job must have a stage and a script. The image field specifies the Docker image the job runs inside. If omitted, it falls back to the pipeline-level default image or the runner’s default.

GKE Networking

GKE Networking#

GKE networking centers on VPC-native clusters, where pods and services get IP addresses from VPC subnet ranges. This integrates Kubernetes networking directly into Google Cloud’s VPC, enabling native routing, firewall rules, and load balancing without extra overlays.

VPC-Native Clusters and Alias IP Ranges#

VPC-native clusters use alias IP ranges on the subnet. You allocate two secondary ranges: one for pods, one for services.

# Create subnet with secondary ranges
gcloud compute networks subnets create gke-subnet \
  --network my-vpc \
  --region us-central1 \
  --range 10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20

# Create cluster using those ranges
gcloud container clusters create my-cluster \
  --region us-central1 \
  --network my-vpc \
  --subnetwork gke-subnet \
  --cluster-secondary-range-name pods \
  --services-secondary-range-name services \
  --enable-ip-alias

The pod range needs to be large. A /14 gives about 262,000 pod IPs. Each node reserves a /24 from the pod range (256 IPs, 110 usable pods per node). If you have 100 nodes, that consumes 100 /24 blocks. Undersizing the pod range is a common cause of IP exhaustion – the cluster cannot add nodes even though VMs are available.

GKE Security and Identity

GKE Security and Identity#

GKE security covers identity (who can do what), workload isolation (sandboxing untrusted code), supply chain integrity (ensuring only trusted images run), and data protection (encryption at rest). These features layer on top of standard Kubernetes RBAC and network policies.

Workload Identity Federation#

Workload Identity Federation is the successor to the original Workload Identity. It removes the need for a separate workload-pool flag and uses the standard GCP IAM federation model. The concept is the same: bind a Kubernetes service account to a Google Cloud service account so pods get GCP credentials without exported keys.

GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

GKE Troubleshooting

GKE Troubleshooting#

GKE adds a layer of Google Cloud infrastructure on top of Kubernetes, which means some problems are pure Kubernetes issues and others are GKE-specific. This guide covers the GKE-specific problems that trip people up.

Autopilot Resource Adjustment#

Autopilot automatically mutates pod resource requests to fit its scheduling model. If you request cpu: 100m and memory: 128Mi, Autopilot may bump the request to cpu: 250m and memory: 512Mi. This affects your billing (you pay per resource request) and can cause unexpected OOMKills if the limits were set relative to the original request.