Diagnosing Common Terraform Problems

Stuck State Lock#

A CI job was cancelled, a laptop lost network, or a process crashed mid-apply. Terraform refuses to run:

Error acquiring the state lock
Lock Info:
  ID:        f8e7d6c5-b4a3-2109-8765-43210fedcba9
  Operation: OperationTypeApply
  Who:       deploy@ci-runner
  Created:   2026-02-20 09:15:22 +0000 UTC

Verify the lock holder is truly dead. Check CI job status, then:

terraform force-unlock f8e7d6c5-b4a3-2109-8765-43210fedcba9

If the lock was from a crashed apply, the state may be partially updated. Run terraform plan immediately after unlocking to see the current situation.

EKS Setup and Configuration

EKS Setup and Configuration#

Amazon EKS runs the Kubernetes control plane for you – managed etcd, API server, and controller manager across multiple AZs. You are responsible for the worker nodes, networking configuration, and add-ons.

Cluster Creation Methods#

eksctl is the fastest path for a working cluster. It creates the VPC, subnets, NAT gateway, IAM roles, node groups, and kubeconfig in one command:

eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --version 1.31 \
  --nodegroup-name workers \
  --node-type m6i.large \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed

For repeatable setups, use a ClusterConfig file:

Ephemeral Cloud Clusters: Create, Validate, Destroy Sequences for EKS, GKE, and AKS

Ephemeral Cloud Clusters#

Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict – create, validate, destroy – and the entire sequence must be automated so that destruction cannot be forgotten.

The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.

GCP Terraform Patterns: Projects, GKE, Workload Identity, Cloud SQL, and Common Gotchas

GCP Terraform Patterns#

GCP’s Terraform provider (google and google-beta) has patterns distinct from both AWS and Azure. The biggest differences: APIs must be explicitly enabled per project, IAM uses a binding model (not inline policies), and GKE requires secondary IP ranges for VPC-native networking. GCP resources also tend to have longer creation times with more eventual consistency.

Projects and API Enablement#

Before creating any resource in GCP, the corresponding API must be enabled in the project. This is the most common source of first-time failures.

GitOps and Infrastructure as Code: Reconciliation Patterns for Terraform, ArgoCD, and Crossplane

GitOps and Infrastructure as Code#

GitOps says: the desired state is in Git. A controller continuously reconciles the real state to match. Infrastructure as Code says: the desired state is in code. A human (or agent) runs apply to push changes.

These two paradigms overlap but do not align perfectly. Kubernetes resources fit the GitOps model well — ArgoCD/Flux watch Git, detect differences, and apply changes continuously. Cloud infrastructure (VPCs, databases, IAM roles) fits the IaC model better — Terraform tracks state, computes diffs, and applies on command.

GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

Grafana Organization: Folders, Permissions, Provisioning, and Dashboard Lifecycle

Folder Structure Strategy#

Grafana folders organize dashboards and control access through permissions. The folder structure you choose determines how teams find dashboards and who can edit them. Three patterns work in practice, each suited to a different organizational shape.

By Team#

When teams own distinct services and rarely need cross-team dashboards:

Platform/
  Node Overview
  Kubernetes Cluster
  Networking
Backend/
  API Gateway
  User Service
  Payment Service
Frontend/
  Web Vitals
  CDN Performance
Data/
  Kafka Pipelines
  ETL Jobs
  Data Quality

Each team gets Editor access to their folder and Viewer access to everything else. This works well when ownership boundaries are clear.

Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures

Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.

Integrating Infrastructure as Code with CI/CD: Patterns for Safe, Automated Infrastructure Delivery

Integrating Infrastructure as Code with CI/CD#

Running Terraform locally works for one person. It breaks down when multiple people (or agents) modify infrastructure concurrently, when changes need review before applying, and when environments (dev/staging/prod) need synchronized promotion. CI/CD pipelines solve this by making the plan-review-apply cycle automated, auditable, and safe.

This article covers the patterns for integrating Terraform into CI/CD — from the basic plan-on-PR flow to multi-directory monorepos with dependency ordering and environment promotion.

Multi-Account Cloud Architecture with Terraform: AWS Organizations, Azure Management Groups, and GCP Organizations

Multi-Account Cloud Architecture with Terraform#

Single-account cloud deployments work for learning and prototypes. Production systems need multiple accounts (AWS), subscriptions (Azure), or projects (GCP) for isolation — security boundaries, blast radius control, billing separation, and compliance requirements.

Terraform manages multi-account architectures well, but the patterns differ significantly from single-account work. Provider configuration, state isolation, cross-account references, and IAM trust relationships all need explicit design.

Why Multiple Accounts#

ReasonSingle Account ProblemMulti-Account Solution
Blast radiusMisconfigured IAM affects everythingDamage limited to one account
BillingCannot attribute costs to teamsPer-account billing and budgets
CompliancePCI data mixed with dev workloadsSeparate accounts for regulated workloads
Service limitsVPC limit of 5 per region sharedEach account has its own limits
Access controlComplex IAM policies to isolate teamsAccount boundary is the strongest isolation
TestingDev resources can affect productionImpossible for dev to touch prod resources

AWS Organizations#

Organization Structure#

Organization Root
├── Core OU
│   ├── Management Account (billing, org management)
│   ├── Security Account (GuardDuty, SecurityHub, audit logs)
│   └── Networking Account (Transit Gateway, shared VPCs)
├── Workload OU
│   ├── Production OU
│   │   ├── App-A Production Account
│   │   └── App-B Production Account
│   └── Non-Production OU
│       ├── App-A Development Account
│       └── App-A Staging Account
└── Sandbox OU
    └── Developer Sandbox Accounts

Terraform for AWS Organizations#

resource "aws_organizations_organization" "main" {
  feature_set = "ALL"

  enabled_policy_types = [
    "SERVICE_CONTROL_POLICY",
    "TAG_POLICY",
  ]
}

resource "aws_organizations_organizational_unit" "core" {
  name      = "Core"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "workloads" {
  name      = "Workloads"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "production" {
  name      = "Production"
  parent_id = aws_organizations_organizational_unit.workloads.id
}

# Create a workload account
resource "aws_organizations_account" "app_production" {
  name      = "app-a-production"
  email     = "aws+app-a-prod@example.com"
  parent_id = aws_organizations_organizational_unit.production.id
  role_name = "OrganizationAccountAccessRole"  # cross-account admin role

  lifecycle {
    prevent_destroy = true  # accounts cannot be easily recreated
  }
}

Service Control Policies (SCPs)#

SCPs set permission boundaries for entire OUs: