---
title: "Ephemeral Cloud Clusters: Create, Validate, Destroy Sequences for EKS, GKE, and AKS"
description: "Operational sequence for creating and destroying ephemeral test clusters on AWS EKS, GCP GKE, and Azure AKS. Covers Terraform modules with auto-destroy mechanisms, cost estimation, and fully automated create-validate-destroy pipelines to prevent cost leakage."
url: https://agent-zone.ai/knowledge/infrastructure/ephemeral-cloud-clusters/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["ephemeral-clusters","eks","gke","aks","terraform","auto-destroy","cost-management","ci-cd","kubernetes","cloud-testing"]
skills: ["ephemeral-infrastructure","terraform-modules","cloud-cost-management","automated-testing"]
tools: ["terraform","aws-cli","gcloud","az","kubectl","helm"]
levels: ["intermediate","advanced"]
word_count: 1744
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/ephemeral-cloud-clusters/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/ephemeral-cloud-clusters/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Ephemeral+Cloud+Clusters%3A+Create%2C+Validate%2C+Destroy+Sequences+for+EKS%2C+GKE%2C+and+AKS
---


# Ephemeral Cloud Clusters

Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict -- create, validate, destroy -- and the entire sequence must be automated so that destruction cannot be forgotten.

The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.

## The Create-Validate-Destroy Pattern

Every ephemeral cluster follows the same lifecycle:

1. **Create** -- Terraform provisions the cluster with minimal configuration. No monitoring stack, no ingress controllers, no persistent storage unless the validation requires it.
2. **Configure** -- Get kubeconfig, install any test dependencies (a Helm chart being validated, a set of manifests, a database operator).
3. **Validate** -- Run the actual tests. Helm install succeeds, pods reach Running state, services respond on expected ports, integration tests pass.
4. **Destroy** -- Terraform destroys everything. No partial cleanup, no orphaned resources.

The critical rule: steps 1 through 4 must execute in a single automated sequence. If step 3 fails, step 4 still runs. If step 2 fails, step 4 still runs. The only acceptable outcome is that the cluster no longer exists when the sequence completes.

```bash
#!/bin/bash
set -euo pipefail

CLUSTER_DIR="$1"
VALIDATION_SCRIPT="$2"

cleanup() {
  echo "Destroying ephemeral cluster..."
  cd "$CLUSTER_DIR"
  terraform destroy -auto-approve -input=false 2>&1 | tail -20
}
trap cleanup EXIT

cd "$CLUSTER_DIR"
terraform init -input=false
terraform apply -auto-approve -input=false

# Extract kubeconfig
terraform output -raw kubeconfig > /tmp/ephemeral-kubeconfig
export KUBECONFIG=/tmp/ephemeral-kubeconfig

# Wait for nodes to be ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

# Run validation
bash "$VALIDATION_SCRIPT"
```

The `trap cleanup EXIT` is the most important line. It ensures `terraform destroy` runs regardless of how the script exits -- success, failure, or signal.

## Ephemeral EKS on AWS

### Terraform Configuration

This module creates a minimal EKS cluster with managed node groups. It uses the official `terraform-aws-modules/eks/aws` module to avoid reinventing VPC and IAM configuration.

```hcl
# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.region
}

variable "region" {
  default = "us-east-1"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  description = "Hours before auto-destroy (used for tagging)"
  default     = 4
}

locals {
  destroy_after = timeadd(timestamp(), "${var.ttl_hours}h")
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "${var.cluster_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["${var.region}a", "${var.region}b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = local.destroy_after
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = var.cluster_name
  cluster_version = "1.29"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets

  cluster_endpoint_public_access = true

  eks_managed_node_groups = {
    ephemeral = {
      instance_types = ["t3.medium"]
      min_size       = 1
      max_size       = 3
      desired_size   = 2
    }
  }

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = local.destroy_after
  }
}

output "kubeconfig" {
  value = <<-EOT
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        server: ${module.eks.cluster_endpoint}
        certificate-authority-data: ${module.eks.cluster_certificate_authority_data}
      name: ${var.cluster_name}
    contexts:
    - context:
        cluster: ${var.cluster_name}
        user: ${var.cluster_name}
      name: ${var.cluster_name}
    current-context: ${var.cluster_name}
    users:
    - name: ${var.cluster_name}
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          command: aws
          args: ["eks", "get-token", "--cluster-name", "${var.cluster_name}", "--region", "${var.region}"]
  EOT
  sensitive = true
}
```

### Cost Estimate

EKS control plane: $0.10/hour. Two t3.medium nodes: $0.0416/hour each. NAT gateway: $0.045/hour. Total: approximately $0.23/hour or $5.50/day. The single NAT gateway and two-AZ VPC are the cheapest configuration that still allows EKS to function (EKS requires subnets in at least two AZs).

### Apply and Validate

```bash
terraform init -input=false
terraform apply -auto-approve -input=false -var="cluster_name=test-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

# Validate cluster is functional
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-test
```

## Ephemeral GKE on GCP

GKE Autopilot is the best choice for ephemeral clusters because you pay only for running pods, there are no idle node costs, and you do not need to manage node pools.

### Terraform Configuration

```hcl
# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

variable "project_id" {
  description = "GCP project ID"
}

variable "region" {
  default = "us-central1"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  default = 4
}

resource "google_container_cluster" "ephemeral" {
  name     = var.cluster_name
  location = var.region

  enable_autopilot = true

  release_channel {
    channel = "RAPID"
  }

  resource_labels = {
    environment   = "ephemeral"
    destroy-after = formatdate("YYYY-MM-DD-hh-mm", timeadd(timestamp(), "${var.ttl_hours}h"))
  }

  deletion_protection = false
}

output "kubeconfig" {
  value = <<-EOT
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        server: https://${google_container_cluster.ephemeral.endpoint}
        certificate-authority-data: ${google_container_cluster.ephemeral.master_auth[0].cluster_ca_certificate}
      name: ${var.cluster_name}
    contexts:
    - context:
        cluster: ${var.cluster_name}
        user: ${var.cluster_name}
      name: ${var.cluster_name}
    current-context: ${var.cluster_name}
    users:
    - name: ${var.cluster_name}
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          command: gke-gcloud-auth-plugin
          installHint: "Install gke-gcloud-auth-plugin for kubectl"
  EOT
  sensitive = true
}
```

### Cost Estimate

GKE Autopilot charges per pod resource: $0.000017/vCPU-second, $0.000002/GB-second. The management fee is $0.10/hour. For a typical validation workload running 2 vCPUs and 4GB RAM for one hour, the cost is approximately $0.22. Autopilot has no idle node costs -- if no pods are running, you pay only the management fee.

### Apply and Validate

```bash
terraform init -input=false
terraform apply -auto-approve -input=false \
  -var="project_id=my-project" \
  -var="cluster_name=eph-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

# GKE Autopilot may take a few minutes to schedule pods
kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test --requests='cpu=250m,memory=256Mi'
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=300s
kubectl delete namespace validation-test
```

Note the explicit `--requests` flag. Autopilot requires resource requests on all pods. Pods without requests get default values, which may not match your expectations.

## Ephemeral AKS on Azure

### Terraform Configuration

```hcl
# main.tf
terraform {
  required_version = ">= 1.5"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

variable "location" {
  default = "eastus"
}

variable "cluster_name" {
  default = "ephemeral"
}

variable "ttl_hours" {
  default = 4
}

resource "azurerm_resource_group" "ephemeral" {
  name     = "${var.cluster_name}-rg"
  location = var.location

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
  }
}

resource "azurerm_kubernetes_cluster" "ephemeral" {
  name                = var.cluster_name
  location            = azurerm_resource_group.ephemeral.location
  resource_group_name = azurerm_resource_group.ephemeral.name
  dns_prefix          = var.cluster_name

  default_node_pool {
    name       = "default"
    node_count = 2
    vm_size    = "Standard_B2s"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = {
    Environment  = "ephemeral"
    DestroyAfter = timeadd(timestamp(), "${var.ttl_hours}h")
  }
}

output "kubeconfig" {
  value     = azurerm_kubernetes_cluster.ephemeral.kube_config_raw
  sensitive = true
}
```

### Cost Estimate

Two Standard_B2s nodes: approximately $0.042/hour each. AKS control plane: free (unlike EKS). Total: approximately $0.084/hour or $2.00/day. AKS is the cheapest option for ephemeral clusters because the control plane has no charge.

### Apply and Validate

```bash
terraform init -input=false
terraform apply -auto-approve -input=false \
  -var="cluster_name=eph-$(date +%s)"

terraform output -raw kubeconfig > /tmp/eph-kubeconfig
export KUBECONFIG=/tmp/eph-kubeconfig

kubectl get nodes
kubectl create namespace validation-test
kubectl run nginx --image=nginx:alpine -n validation-test
kubectl wait --for=condition=Ready pod/nginx -n validation-test --timeout=120s
kubectl delete namespace validation-test
```

## Auto-Destroy Mechanisms

The Terraform configurations above tag resources with a `DestroyAfter` timestamp, but tags alone do not destroy anything. You need an active mechanism to enforce the TTL.

### CI-Triggered Destroy

The simplest approach: the same CI job that creates the cluster also destroys it. The wrapper script at the top of this article demonstrates this. In GitHub Actions:

```yaml
jobs:
  ephemeral-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Create, validate, destroy
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          bash scripts/ephemeral-test.sh ./terraform/eks ./tests/validate.sh
```

### Cron-Based Cleanup

For clusters created outside CI (manual testing, development), run a scheduled cleanup job that finds and destroys expired resources:

```bash
#!/bin/bash
# cleanup-expired-clusters.sh
# Run via cron: 0 * * * * /path/to/cleanup-expired-clusters.sh

NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)

# AWS: find EKS clusters tagged as ephemeral and past TTL
aws eks list-clusters --output json | jq -r '.clusters[]' | while read cluster; do
  destroy_after=$(aws eks describe-cluster --name "$cluster" \
    --query "cluster.tags.DestroyAfter" --output text 2>/dev/null)

  if [[ "$destroy_after" != "None" && "$destroy_after" < "$NOW" ]]; then
    echo "Destroying expired cluster: $cluster (expired: $destroy_after)"
    # Use terraform destroy if state is available, otherwise:
    aws eks delete-nodegroup --cluster-name "$cluster" --nodegroup-name ephemeral --no-cli-pager
    aws eks delete-cluster --name "$cluster" --no-cli-pager
  fi
done
```

### TTL-Based with AWS Lambda

For a fully automated approach, deploy a Lambda function triggered by EventBridge on a schedule:

```python
# lambda_function.py
import boto3
from datetime import datetime, timezone

def handler(event, context):
    eks = boto3.client('eks')
    clusters = eks.list_clusters()['clusters']

    for cluster_name in clusters:
        cluster = eks.describe_cluster(name=cluster_name)['cluster']
        tags = cluster.get('tags', {})

        if tags.get('Environment') != 'ephemeral':
            continue

        destroy_after = tags.get('DestroyAfter')
        if not destroy_after:
            continue

        if datetime.fromisoformat(destroy_after.replace('Z', '+00:00')) < datetime.now(timezone.utc):
            print(f"Destroying expired cluster: {cluster_name}")
            # Delete node groups first
            nodegroups = eks.list_nodegroups(clusterName=cluster_name)['nodegroups']
            for ng in nodegroups:
                eks.delete_nodegroup(clusterName=cluster_name, nodegroupName=ng)
            # Wait for nodegroups to delete, then delete cluster
            # In production, use a Step Function for this orchestration
```

## How an Agent Should Use These

An agent tasked with validating infrastructure changes should follow this exact sequence:

1. **Select the cheapest provider** for the validation type. If the validation only needs a running Kubernetes cluster (not provider-specific features), use AKS ($0.084/hour) or GKE Autopilot (pay-per-pod). If the validation tests AWS-specific integrations (ALB ingress, EBS CSI, IAM roles for service accounts), use EKS.

2. **Set a tight TTL.** Most validations complete in under 30 minutes. Set `ttl_hours=1` as the default. Only increase it if the validation is known to take longer.

3. **Use the wrapper script.** Never run `terraform apply` without the `trap cleanup EXIT` pattern. The risk of forgetting to destroy is too high.

4. **Fail fast on creation errors.** If the cluster fails to create (quota limits, permissions issues, region capacity), do not retry automatically. Report the error and let a human investigate. Retrying in a loop can create partially-provisioned resources that are harder to clean up.

5. **Log the cost.** After destroy, estimate and log the cost: `duration_hours * hourly_rate`. This creates visibility into ephemeral cluster spending over time.

## Cost Comparison Summary

| Provider | Hourly Cost (2 nodes) | Daily Cost | Control Plane | Best For |
|---|---|---|---|---|
| EKS | ~$0.23/hr | ~$5.50/day | $0.10/hr | AWS-specific testing |
| GKE Autopilot | ~$0.22/hr (varies) | ~$5.30/day | $0.10/hr | Pay-per-pod, no idle cost |
| AKS | ~$0.084/hr | ~$2.00/day | Free | Cheapest option |

These costs assume the cheapest viable node types and minimal configuration. Production-like configurations with larger nodes, multiple AZs, and additional services will cost more.

