Advanced Ansible Patterns: Roles, Collections, Dynamic Inventory, Vault, and Testing

Advanced Ansible Patterns#

As infrastructure grows from a handful of servers to hundreds or thousands, Ansible patterns that worked at small scale become bottlenecks. Playbooks that were simple and readable at 10 hosts become tangled at 100. Roles that were self-contained become duplicated across teams. This framework helps you decide which advanced patterns to adopt and when.

Roles vs Collections#

Roles and collections both organize Ansible content, but they serve different purposes and operate at different scales.

Advanced Terraform State Management

Remote Backends#

Every team beyond a single developer needs remote state. The three major backends:

S3 + DynamoDB (AWS):

terraform {
  backend "s3" {
    bucket         = "myorg-tfstate"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Azure Blob Storage:

terraform {
  backend "azurerm" {
    resource_group_name  = "tfstate-rg"
    storage_account_name = "myorgtfstate"
    container_name       = "tfstate"
    key                  = "prod/network/terraform.tfstate"
  }
}

Google Cloud Storage:

terraform {
  backend "gcs" {
    bucket = "myorg-tfstate"
    prefix = "prod/network"
  }
}

All three support locking natively (DynamoDB for S3, blob leases for Azure, object locking for GCS). Always enable encryption at rest and restrict access with IAM.

Agent-Oriented Terraform: Linear Patterns for Machine-Managed Infrastructure

Agent-Oriented Terraform#

Most Terraform code is written by humans for humans. It favors abstraction, DRY principles, and deep module nesting — patterns that make sense when a human maintains a mental model of the codebase. Agents do not maintain mental models. They read code fresh each time, trace references to resolve dependencies, and reason about the full resource graph in a single context window.

The patterns that make Terraform elegant for humans make it expensive for agents. Deep module nesting multiplies the files an agent must read. Variable threading through three layers of modules hides dependencies behind indirection. Complex for_each over maps of objects creates resources that are invisible until runtime. The agent spends most of its context on navigation, not comprehension.

Building Machine Images with Packer: Templates, Builders, Provisioners, and CI/CD

Building Machine Images with Packer#

Machine images (AMIs, Azure Managed Images, GCP Images) are the foundation of immutable infrastructure. Instead of provisioning a base OS and configuring it at boot, you build a pre-configured image and launch instances from it. Packer automates this process: it launches a temporary instance, runs provisioners to configure it, creates an image from the result, and destroys the temporary instance.

This operational sequence walks through building, testing, and managing machine images with Packer from template creation through CI/CD integration.

Cloud Migration Strategies: The 7 Rs Framework

Cloud Migration Strategies#

A company does not “migrate to the cloud” – it migrates dozens or hundreds of applications, each with different characteristics, dependencies, and risk profiles. The 7 Rs framework provides vocabulary for per-workload decisions, but selecting the right R requires understanding the application, its dependencies, and the organization’s tolerance for change.

The 7 Rs#

Rehost (Lift and Shift)#

Move the application to cloud infrastructure with minimal changes. A VM on-premises becomes an EC2 instance. OS, application code, and configuration remain the same.

Diagnosing Common Terraform Problems

Stuck State Lock#

A CI job was cancelled, a laptop lost network, or a process crashed mid-apply. Terraform refuses to run:

Error acquiring the state lock
Lock Info:
  ID:        f8e7d6c5-b4a3-2109-8765-43210fedcba9
  Operation: OperationTypeApply
  Who:       deploy@ci-runner
  Created:   2026-02-20 09:15:22 +0000 UTC

Verify the lock holder is truly dead. Check CI job status, then:

terraform force-unlock f8e7d6c5-b4a3-2109-8765-43210fedcba9

If the lock was from a crashed apply, the state may be partially updated. Run terraform plan immediately after unlocking to see the current situation.

Docker Compose Patterns for Local Development

Multi-Service Stack Structure#

A typical local development stack has an application, a database, and maybe a cache or message broker. The compose file should read top-to-bottom like a description of your system.

services:
  app:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    env_file:
      - .env
    volumes:
      - ./src:/app/src
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: myapp
      POSTGRES_PASSWORD: localdev
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./db/init.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U myapp"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  pgdata:

depends_on and Healthchecks#

The depends_on field controls startup order, but without a condition it only waits for the container to start, not for the service inside to be ready. A Postgres container starts in under a second, but the database process takes several seconds to accept connections. Use condition: service_healthy paired with a healthcheck to block until the dependency is actually ready.

Ephemeral Cloud Clusters: Create, Validate, Destroy Sequences for EKS, GKE, and AKS

Ephemeral Cloud Clusters#

Ephemeral clusters exist for one purpose: validate something, then disappear. They are not staging environments, not shared dev clusters, not long-lived resources that someone forgets to turn off. The operational model is strict – create, validate, destroy – and the entire sequence must be automated so that destruction cannot be forgotten.

The cost of getting this wrong is real. A three-node EKS cluster left running over a weekend costs roughly $15. Left running for a month, $200. Multiply by the number of developers or CI pipelines that create clusters, and forgotten ephemeral infrastructure becomes a significant budget line item. Every template in this article includes auto-destroy mechanisms to prevent this.

HAProxy Configuration and Operations

Configuration File Structure#

HAProxy uses a single configuration file, typically /etc/haproxy/haproxy.cfg. The configuration is divided into four sections: global (process-level settings), defaults (default values for all proxies), frontend (client-facing listeners), and backend (server pools).

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 50000
    user haproxy
    group haproxy
    daemon
    stats socket /run/haproxy/admin.sock mode 660 level admin
    tune.ssl.default-dh-param 2048

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    timeout http-request 10s
    timeout http-keep-alive 10s
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http

The maxconn in the global section sets the hard limit on total simultaneous connections. The timeout values in defaults apply to all frontends and backends unless overridden. The stats socket directive enables the runtime API, which is essential for operational management.

Kubernetes Cost Audit and Reduction: A Systematic Operational Plan

Kubernetes Cost Audit and Reduction#

Kubernetes clusters accumulate cost waste silently. Resource requests padded “just in case” during initial deployment never get revisited. Load balancers created for debugging stay running. PVCs from deleted applications persist. Over six months, a cluster originally running at $5,000/month can drift to $12,000 with no corresponding increase in actual workload.

This operational plan works through cost reduction systematically, starting with visibility (you cannot cut what you cannot see), moving through quick wins, then tackling the larger structural optimizations that require data collection and careful rollout.