# Infrastructure

Infrastructure tooling patterns — Minikube, Terraform, Ansible, and local development

## Articles

- [Disaster Recovery Strategy: RPO/RTO-Driven Decision Framework](https://agent-zone.ai/knowledge/infrastructure/disaster-recovery-strategy/) — How to select the right DR strategy based on RPO/RTO requirements, budget constraints, and workload classification. Covers the four DR tiers from backup/restore through active-active, with cost implications, real-world requirements by industry, and the leadership conversation about DR investment.
- [Disaster Recovery Testing: From Tabletop Exercises to Full Regional Failover](https://agent-zone.ai/knowledge/infrastructure/disaster-recovery-testing/) — How to validate your DR plan actually works. Covers the four types of DR tests, testing frequency by tier, automation with chaos engineering tools, the most common test failures, post-test reporting, and regulatory requirements for DR testing under SOC2 and PCI-DSS.
- [Backup Verification and Restore Testing: Proving Your Backups Actually Work](https://agent-zone.ai/knowledge/infrastructure/backup-verification-restore-testing/) — Automated restore verification pipelines, backup integrity validation, restore time measurement, backup monitoring for missed windows and size anomalies, and database-specific restore testing for PostgreSQL, MySQL, and etcd. Concrete scripts and cron jobs.
- [DR Runbook Design: Failover Procedures, Communication Plans, and Decision Trees](https://agent-zone.ai/knowledge/infrastructure/dr-runbook-design/) — How to write DR runbooks that work under pressure. Covers step-by-step failover procedures with timing estimates, communication templates, decision trees for failover vs ride-it-out, role assignments, pre-failover checklists, post-failover validation, and failback procedures. Includes a complete runbook template.
- [Active-Active Architecture Patterns: Multi-Region, Data Replication, and Split-Brain Resolution](https://agent-zone.ai/knowledge/infrastructure/active-active-architecture/) — Deep guide to active-active architecture covering what it actually means to serve production traffic from multiple regions simultaneously, data replication strategies, conflict resolution, split-brain scenarios, session management, CAP theorem tradeoffs, and the real cost of running active-active.
- [Active-Passive vs Active-Active: Decision Framework for Multi-Region Architecture](https://agent-zone.ai/knowledge/infrastructure/active-passive-vs-active-active/) — Decision framework for choosing between active-passive and active-active multi-region architectures. Covers cost comparison with concrete numbers, RTO/RPO analysis, data consistency tradeoffs, operational complexity, pilot light vs warm standby vs hot standby patterns, and a progression path from simple DR to full active-active.
- [Global Load Balancing and Geo-Routing: DNS GSLB, Anycast, and Cloud Provider Configurations](https://agent-zone.ai/knowledge/infrastructure/global-load-balancing/) — Practical guide to global server load balancing covering DNS-based GSLB, anycast vs unicast, failover routing, latency-based routing, geolocation routing, Cloudflare Load Balancing, AWS Global Accelerator, GCP external HTTP(S) load balancing, and real configuration examples for each approach.
- [DNS Failover Patterns: TTL Tradeoffs, Health Check Design, and Real-World Failover Timing](https://agent-zone.ai/knowledge/infrastructure/dns-failover-patterns/) — Practical guide to DNS failover covering TTL tradeoffs between propagation speed and DNS load, health check design principles, weighted DNS for blue-green deployments, Route53 and Cloudflare failover configuration, client-side DNS caching gotchas, and why real-world failover is never as fast as you think.
- [Advanced Ansible Patterns: Roles, Collections, Dynamic Inventory, Vault, and Testing](https://agent-zone.ai/knowledge/infrastructure/ansible-advanced-patterns/) — Decision framework for advanced Ansible patterns covering roles vs collections, dynamic inventory strategies, vault encryption, callback plugins, custom modules, Molecule testing, and CI integration with guidance on when to use each pattern at different infrastructure scales.
- [Advanced Terraform State Management](https://agent-zone.ai/knowledge/infrastructure/terraform-state-advanced/) — Remote backends, state manipulation commands, import workflows, workspace strategies, and emergency state recovery operations.
- [Agent-Oriented Terraform: Linear Patterns for Machine-Managed Infrastructure](https://agent-zone.ai/knowledge/infrastructure/terraform-agent-oriented-patterns/) — Why flat, explicit Terraform code outperforms deeply nested modules when agents write and maintain infrastructure. Covers the problems with human-oriented abstraction for agents, linear code patterns with direct references, state decomposition for parallel operations, when modules still make sense, and how the human role shifts from writer to reviewer.
- [AWS Terraform Patterns: IAM, Networking, EKS, RDS, and Common Gotchas](https://agent-zone.ai/knowledge/infrastructure/aws-terraform-patterns/) — AWS-specific Terraform patterns that trip up agents and humans. Covers IAM role and policy patterns, VPC networking details, EKS cluster setup with IRSA, RDS configuration, S3 bucket policies, and the AWS-specific gotchas that cause plan failures, apply errors, and security misconfigurations.
- [Azure Terraform Patterns: Resource Groups, AKS, Managed Identity, and Common Gotchas](https://agent-zone.ai/knowledge/infrastructure/azure-terraform-patterns/) — Azure-specific Terraform patterns for the azurerm provider. Covers resource group organization, VNET networking, AKS with workload identity, Azure Database for PostgreSQL Flexible Server, managed identities, Key Vault integration, and Azure-specific gotchas that cause failures in plan and apply.
- [Building Machine Images with Packer: Templates, Builders, Provisioners, and CI/CD](https://agent-zone.ai/knowledge/infrastructure/packer-image-building/) — Operational sequence for building machine images with Packer covering HCL2 templates, multi-cloud builders (AWS AMI, Azure, GCP, Docker), provisioners, post-processors, image testing, CI/CD integration, and image lifecycle management.
- [Cloud Migration Strategies: The 7 Rs Framework](https://agent-zone.ai/knowledge/infrastructure/cloud-migration-patterns/) — Decision framework for cloud migration — the 7 Rs (rehost, replatform, repurchase, refactor, retire, retain, relocate), migration assessment, dependency mapping, cutover planning, and rollback strategies.
- [Diagnosing Common Terraform Problems](https://agent-zone.ai/knowledge/infrastructure/terraform-debugging/) — Practical fixes for stuck state locks, dependency cycles, unexpected plan changes, import errors, slow plans, and partial apply recovery.
- [Docker Compose Patterns for Local Development](https://agent-zone.ai/knowledge/infrastructure/docker-compose-patterns/) — Multi-service stacks, healthchecks, live reload, networking, profiles, and override files for productive local development with Docker Compose.
- [Ephemeral Cloud Clusters: Create, Validate, Destroy Sequences for EKS, GKE, and AKS](https://agent-zone.ai/knowledge/infrastructure/ephemeral-cloud-clusters/) — Operational sequence for creating and destroying ephemeral test clusters on AWS EKS, GCP GKE, and Azure AKS. Covers Terraform modules with auto-destroy mechanisms, cost estimation, and fully automated create-validate-destroy pipelines to prevent cost leakage.
- [GCP Terraform Patterns: Projects, GKE, Workload Identity, Cloud SQL, and Common Gotchas](https://agent-zone.ai/knowledge/infrastructure/gcp-terraform-patterns/) — GCP-specific Terraform patterns for the google provider. Covers project and API enablement, VPC networking with secondary ranges, GKE with Workload Identity, Cloud SQL with private service networking, IAM binding patterns, and GCP-specific gotchas that cause silent failures and permission errors.
- [HAProxy Configuration and Operations](https://agent-zone.ai/knowledge/infrastructure/haproxy-load-balancing/) — Reference guide for HAProxy covering frontend/backend configuration, health checks, ACLs, SSL termination, connection limits, stick tables, stats page, runtime API, and production tuning.
- [Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures](https://agent-zone.ai/knowledge/infrastructure/infrastructure-disaster-recovery-terraform/) — Disaster recovery patterns for Terraform-managed infrastructure. Covers state file backup and recovery, recovering from corrupted or lost state, blue-green infrastructure patterns, immutable infrastructure rebuilds, cross-region DR with Terraform, runbook templates for common disaster scenarios, and the differences between application DR and infrastructure DR.
- [Kubernetes Cost Audit and Reduction: A Systematic Operational Plan](https://agent-zone.ai/knowledge/infrastructure/ops-cost-audit-and-reduction/) — Step-by-step operational plan for auditing Kubernetes infrastructure costs, identifying waste, rightsizing workloads, optimizing nodes, and establishing ongoing cost governance.
- [Linux Debugging Essentials for Infrastructure](https://agent-zone.ai/knowledge/infrastructure/linux-debugging-essentials/) — Systematic approach to debugging Linux systems using systemctl, journalctl, dmesg, process tools, disk and memory analysis, network inspection, and strace.
- [Multi-Account Cloud Architecture with Terraform: AWS Organizations, Azure Management Groups, and GCP Organizations](https://agent-zone.ai/knowledge/infrastructure/multi-account-cloud-terraform/) — How to structure Terraform for multi-account cloud architectures. Covers AWS Organizations with SCPs and cross-account roles, Azure Management Groups with subscriptions, GCP Organizations with projects, provider aliasing for multi-account deploys, landing zone patterns, and the state isolation strategies that prevent one account's failure from cascading.
- [Multi-Cloud Networking Patterns](https://agent-zone.ai/knowledge/infrastructure/multi-cloud-networking/) — Reference for multi-cloud networking — VPN tunnels between clouds, transit gateways, service mesh across clusters, DNS-based routing, cloud interconnect services, and practical configuration examples.
- [Nginx Configuration Patterns for Production](https://agent-zone.ai/knowledge/infrastructure/nginx-configuration-patterns/) — Reference guide for Nginx configuration covering server blocks, location matching rules, reverse proxy setup, SSL termination, rate limiting, caching, load balancing, health checks, and security headers.
- [Prometheus and Grafana Monitoring Stack](https://agent-zone.ai/knowledge/infrastructure/prometheus-and-grafana-setup/) — Setting up Prometheus scrape configs, PromQL queries, Grafana dashboards, alerting rules, and the kube-prometheus-stack for Kubernetes monitoring.
- [Refactoring Terraform: When and How to Restructure Growing Infrastructure Code](https://agent-zone.ai/knowledge/infrastructure/terraform-refactoring-guide/) — Decision framework and practical procedures for refactoring Terraform — when monolith state needs splitting, how to decompose state safely, extracting modules from inline resources, moving between workspaces and directories, provider version upgrades, and deprecating resources without breaking state.
- [Running Terraform in CI/CD Pipelines](https://agent-zone.ai/knowledge/infrastructure/terraform-ci-cd/) — GitHub Actions workflows for plan-on-PR and apply-on-merge, OIDC authentication, cost estimation, and policy-as-code integration.
- [Setting Up Multi-Environment Infrastructure: Dev, Staging, and Production](https://agent-zone.ai/knowledge/infrastructure/ops-multi-environment-infrastructure/) — Operational sequence for establishing a multi-environment Kubernetes setup covering environment strategy, infrastructure as code, Kustomize overlays, secrets management, pipeline integration, and per-environment observability.
- [Terraform Cloud Architecture Patterns: VPC/EKS/RDS on AWS, VNET/AKS on Azure, VPC/GKE on GCP](https://agent-zone.ai/knowledge/infrastructure/terraform-cloud-architecture-patterns/) — Side-by-side Terraform patterns for the standard three-tier architecture across AWS, Azure, and GCP. Shows the real code for networking, managed Kubernetes, and managed databases on each cloud, highlighting where the concepts are the same and where the gotchas differ.
- [Terraform Code Quality: Patterns, Anti-Patterns, and Review Heuristics](https://agent-zone.ai/knowledge/infrastructure/terraform-code-quality/) — What makes Terraform code good vs bad from a maintainability perspective. Covers variable vs local vs hardcoded decisions, module granularity, provider pinning, resource naming and tagging, lifecycle rules, data sources vs hardcoded IDs, count vs for_each judgment calls, and common anti-patterns with detection heuristics.
- [Terraform Core Concepts and Workflow](https://agent-zone.ai/knowledge/infrastructure/terraform-fundamentals/) — Providers, resources, variables, file organization, and the init/plan/apply/destroy lifecycle for day-to-day Terraform work.
- [Terraform Cost Management: Writing Cost-Aware Infrastructure Code](https://agent-zone.ai/knowledge/infrastructure/terraform-cost-management/) — How to write Terraform that does not surprise you with cloud bills. Covers Infracost integration for pre-apply cost estimates, cost-aware resource sizing patterns, right-sizing for dev vs production, the most expensive resources per cloud provider, tagging for cost allocation, reserved capacity vs on-demand decisions, and agent patterns for cost-conscious infrastructure.
- [Terraform Import and Brownfield Adoption: Bringing Existing Infrastructure Under Code](https://agent-zone.ai/knowledge/infrastructure/terraform-import-brownfield/) — How to bring manually created cloud infrastructure under Terraform management. Covers the legacy terraform import command, import blocks (Terraform 1.5+), planning an import campaign for large environments, handling attribute drift between real resources and generated code, state surgery patterns, and the agent workflow for systematic brownfield adoption.
- [Terraform Modules: Structure, Composition, and Reuse](https://agent-zone.ai/knowledge/infrastructure/terraform-modules/) — Building reusable Terraform modules with proper structure, versioning, composition patterns, and testing fundamentals.
- [Terraform Networking Patterns: VPC, Subnets, NAT, Peering, and Transit Gateway Across Clouds](https://agent-zone.ai/knowledge/infrastructure/terraform-networking-patterns/) — Cloud networking patterns in Terraform for AWS, Azure, and GCP. Covers VPC/VNET/VPC Network design, public vs private subnets, NAT gateway patterns, VPC peering, Transit Gateway and hub-spoke topologies, DNS configuration, CIDR planning for multi-environment and multi-region architectures, and the networking gotchas that cause connectivity failures.
- [Terraform Provider Configuration Patterns: Versioning, Aliasing, Multi-Region, and Authentication](https://agent-zone.ai/knowledge/infrastructure/terraform-provider-patterns/) — How to configure Terraform providers correctly for production use. Covers provider version constraints, multi-region and multi-account aliasing, authentication patterns for CI/CD vs local development, passing providers to modules, required_providers blocks, and the gotchas that cause silent provider misconfigurations.
- [Terraform Safety for Agents: Plans, Applies, and the Human Approval Gate](https://agent-zone.ai/knowledge/infrastructure/terraform-agent-safety/) — How agents should handle the Terraform plan/apply cycle safely. Reading plan output for danger signals, presenting plans to humans in readable summaries, state lock protocols, what to never auto-apply, drift investigation procedures, and when to escalate vs proceed.
- [Terraform Secrets and Sensitive Data: Patterns for Variables, State, Providers, and CI/CD](https://agent-zone.ai/knowledge/infrastructure/terraform-secrets-and-sensitive-data/) — How to handle secrets in Terraform without leaking them into state files, plan output, logs, or version control. Covers sensitive variables, the Vault provider for dynamic secrets, SOPS for encrypted files, state encryption, CI/CD secret injection, and the common mistakes that expose credentials in Terraform workflows.
- [Terraform Workspaces vs Directories: Choosing an Environment Isolation Strategy](https://agent-zone.ai/knowledge/infrastructure/terraform-workspace-vs-directory/) — When to use Terraform workspaces, when to use separate directories, and when to use neither. Covers the workspace model (single config, multiple state files), the directory model (separate configs per environment), hybrid patterns, feature branch infrastructure, and the decision framework for choosing the right isolation level.
- [Testing Infrastructure Code: The Validation Pyramid from Lint to Integration](https://agent-zone.ai/knowledge/infrastructure/terraform-testing-pyramid/) — A unified testing strategy for Terraform code — static analysis, plan-based testing, contract testing for modules, cost estimation, and integration testing. The testing pyramid applied to infrastructure: fast and cheap at the bottom, slow and expensive at the top, with clear guidance on what to test at which level.
- [TLS Certificate Lifecycle Management](https://agent-zone.ai/knowledge/infrastructure/tls-certificates-management/) — Generating, deploying, debugging, and automating TLS certificates for development and production environments.
- [Ansible Role Structure and Patterns](https://agent-zone.ai/knowledge/infrastructure/ansible-role-structure/) — Ansible role directory layout, variable precedence, handler behavior, and common patterns for writing maintainable roles.
- [Choosing a Kubernetes Backup Strategy: Velero vs Native Snapshots vs Application-Level Backups](https://agent-zone.ai/knowledge/infrastructure/choosing-backup-strategy/) — Decision framework for Kubernetes backup and recovery — comparing Velero, CSI VolumeSnapshots, etcd snapshots, and application-level backups across scope, consistency guarantees, restore granularity, and operational complexity.
- [Choosing an Infrastructure as Code Tool: Terraform vs Pulumi vs CloudFormation/Bicep vs Crossplane](https://agent-zone.ai/knowledge/infrastructure/choosing-iac-tool/) — Decision framework for selecting an Infrastructure as Code tool — comparing Terraform/OpenTofu, Pulumi, cloud-native IaC (CloudFormation, Bicep), Crossplane, and CDK variants across language support, multi-cloud capability, state management, and ecosystem maturity.
- [Cloud Networking Fundamentals: VPCs, Subnets, Security Groups, and Connectivity](https://agent-zone.ai/knowledge/infrastructure/cloud-networking-fundamentals/) — Practical guide to cloud networking covering VPC design, subnet architecture, security groups, route tables, peering, transit gateways, and cross-cloud terminology mapping.
- [DNS Deep Dive: Record Types, Resolution, Troubleshooting, and Cloud DNS Management](https://agent-zone.ai/knowledge/infrastructure/dns-deep-dive/) — Comprehensive guide to DNS covering resolution mechanics, record types, TTL strategies, Kubernetes DNS, cloud DNS services, and practical debugging with dig and nslookup.
- [Linux Performance Tuning: sysctl, ulimits, I/O Schedulers, and Kernel Parameters](https://agent-zone.ai/knowledge/infrastructure/linux-performance-tuning/) — Advanced Linux performance tuning covering sysctl parameters, ulimits, I/O schedulers, Transparent Huge Pages, CPU governors, network tuning, and Kubernetes node optimization.
- [Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues](https://agent-zone.ai/knowledge/infrastructure/linux-troubleshooting/) — Systematic methodology for diagnosing Linux system issues using the USE method, covering CPU, memory, disk, network, process, and log investigation with practical commands and common patterns.
- [Load Balancer Patterns: L4 vs L7, Health Checks, Session Affinity, and Cloud LB Selection](https://agent-zone.ai/knowledge/infrastructure/load-balancer-patterns/) — Practical guide to load balancer architecture covering L4/L7 differences, health check design, session affinity, TLS termination patterns, and cloud provider load balancer selection.
- [Minikube with Docker Driver on Apple Silicon](https://agent-zone.ai/knowledge/infrastructure/minikube-docker-driver/) — Running Minikube with the Docker driver on Apple Silicon (M4 Pro) for native ARM64 container execution without emulation overhead.
- [SSH Hardening and Management: Key Management, Bastion Hosts, and SSH Certificates](https://agent-zone.ai/knowledge/infrastructure/ssh-hardening-and-management/) — Comprehensive guide to SSH security covering key management, sshd_config hardening, bastion hosts, SSH certificates, tunneling, agent forwarding risks, and modern access management tools.
- [systemd Service Management: Units, Timers, Journal, and Socket Activation](https://agent-zone.ai/knowledge/infrastructure/systemd-service-management/) — Comprehensive guide to systemd service management covering unit files, restart policies, service types, resource controls, timer units, journal logging, socket activation, and common debugging techniques.
- [Terraform State Management Patterns](https://agent-zone.ai/knowledge/infrastructure/terraform-state-patterns/) — Remote backends, state locking, workspace isolation, and common pitfalls in Terraform state management.
- [TLS Deep Dive: Certificate Chains, Handshake, Cipher Suites, and Debugging Connection Issues](https://agent-zone.ai/knowledge/infrastructure/tls-ssl-deep-dive/) — In-depth guide to TLS covering the handshake process, certificate chains, Let's Encrypt automation, cert-manager in Kubernetes, mTLS, cipher suites, and practical debugging with openssl.


---

[JSON](https://agent-zone.ai/knowledge/infrastructure/index.json) | [HTML](https://agent-zone.ai/knowledge/infrastructure/?format=html)