Infrastructure Capacity Planning: Measurement, Projection, and Scaling

What Capacity Planning Solves#

Running out of capacity during a traffic spike causes outages. Over-provisioning wastes money continuously. Capacity planning is the process of measuring what you use now, projecting what you will need, and ensuring resources are available before demand arrives. Without it, you are either constantly firefighting resource exhaustion or explaining to finance why your cloud bill doubled.

Capacity planning is not a one-time exercise. It is a recurring process – monthly for fast-growing services, quarterly for stable ones.

Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures

Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.

Infrastructure Knowledge Scoping for Agents

Infrastructure Knowledge Scoping for Agents#

An agent working on infrastructure tasks needs to operate at the right level of specificity. Giving generic Kubernetes advice when the user runs EKS with IRSA is unhelpful – the agent misses the IAM integration that will make or break the deployment. Giving EKS-specific advice when the user runs minikube on a laptop is equally unhelpful – the agent references services and configurations that do not exist.

Multi-Account Cloud Architecture with Terraform: AWS Organizations, Azure Management Groups, and GCP Organizations

Multi-Account Cloud Architecture with Terraform#

Single-account cloud deployments work for learning and prototypes. Production systems need multiple accounts (AWS), subscriptions (Azure), or projects (GCP) for isolation — security boundaries, blast radius control, billing separation, and compliance requirements.

Terraform manages multi-account architectures well, but the patterns differ significantly from single-account work. Provider configuration, state isolation, cross-account references, and IAM trust relationships all need explicit design.

Why Multiple Accounts#

ReasonSingle Account ProblemMulti-Account Solution
Blast radiusMisconfigured IAM affects everythingDamage limited to one account
BillingCannot attribute costs to teamsPer-account billing and budgets
CompliancePCI data mixed with dev workloadsSeparate accounts for regulated workloads
Service limitsVPC limit of 5 per region sharedEach account has its own limits
Access controlComplex IAM policies to isolate teamsAccount boundary is the strongest isolation
TestingDev resources can affect productionImpossible for dev to touch prod resources

AWS Organizations#

Organization Structure#

Organization Root
├── Core OU
│   ├── Management Account (billing, org management)
│   ├── Security Account (GuardDuty, SecurityHub, audit logs)
│   └── Networking Account (Transit Gateway, shared VPCs)
├── Workload OU
│   ├── Production OU
│   │   ├── App-A Production Account
│   │   └── App-B Production Account
│   └── Non-Production OU
│       ├── App-A Development Account
│       └── App-A Staging Account
└── Sandbox OU
    └── Developer Sandbox Accounts

Terraform for AWS Organizations#

resource "aws_organizations_organization" "main" {
  feature_set = "ALL"

  enabled_policy_types = [
    "SERVICE_CONTROL_POLICY",
    "TAG_POLICY",
  ]
}

resource "aws_organizations_organizational_unit" "core" {
  name      = "Core"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "workloads" {
  name      = "Workloads"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "production" {
  name      = "Production"
  parent_id = aws_organizations_organizational_unit.workloads.id
}

# Create a workload account
resource "aws_organizations_account" "app_production" {
  name      = "app-a-production"
  email     = "aws+app-a-prod@example.com"
  parent_id = aws_organizations_organizational_unit.production.id
  role_name = "OrganizationAccountAccessRole"  # cross-account admin role

  lifecycle {
    prevent_destroy = true  # accounts cannot be easily recreated
  }
}

Service Control Policies (SCPs)#

SCPs set permission boundaries for entire OUs:

Multi-Cloud Networking Patterns

Multi-Cloud Networking Patterns#

Multi-cloud networking connects workloads across two or more cloud providers into a coherent network. The motivations vary – vendor redundancy, best-of-breed service selection, regulatory requirements – but the challenges are the same: private connectivity between isolated networks, consistent service discovery, and traffic routing that handles failures.

VPN Tunnels Between Clouds#

IPsec VPN tunnels are the simplest way to connect two cloud networks. Each provider offers managed VPN gateways that terminate IPsec tunnels, encrypting traffic between VPCs without exposing it to the public internet.

Terraform Cloud Architecture Patterns: VPC/EKS/RDS on AWS, VNET/AKS on Azure, VPC/GKE on GCP

Terraform Cloud Architecture Patterns#

The three-tier architecture — networking, managed Kubernetes, managed database — is the most common pattern for production deployments on any major cloud. The concepts are identical across AWS, Azure, and GCP. The Terraform code is not. Resource names differ, required arguments differ, default behaviors differ, and the gotchas that catch agents and humans are cloud-specific.

This article shows the real Terraform for each layer on each cloud, side by side, so agents can write correct infrastructure code for whichever cloud the user deploys to.

Terraform Cost Management: Writing Cost-Aware Infrastructure Code

Terraform Cost Management#

The most expensive line in your cloud bill was written in a .tf file. A single instance_type choice, a forgotten NAT Gateway, or an over-provisioned RDS instance can cost thousands per month — and none of these show up in terraform plan. Plan shows what changes. It does not show what it costs.

This article covers how to write cost-aware Terraform and catch expensive decisions before they reach production.

Terraform Import and Brownfield Adoption: Bringing Existing Infrastructure Under Code

Terraform Import and Brownfield Adoption#

Most organizations do not start with Infrastructure as Code. They start with console clicks, CLI commands, and scripts. At some point they decide to adopt Terraform — and now they have hundreds of existing resources that need to be brought under management without disruption.

This is the brownfield problem: writing Terraform code that matches existing infrastructure exactly, importing the state so Terraform knows about the resources, and resolving the inevitable drift between what exists and what the code describes.

Terraform Networking Patterns: VPC, Subnets, NAT, Peering, and Transit Gateway Across Clouds

Terraform Networking Patterns#

Networking is the first thing you build and the last thing you want to change. CIDR ranges, subnet allocation, and connectivity topology are difficult to modify after resources depend on them. Getting the network right in Terraform saves months of migration work later.

This article covers the networking patterns across AWS, Azure, and GCP — from basic VPC design to multi-region hub-spoke topologies.

CIDR Planning#

Plan CIDR ranges before writing any Terraform. Once a VPC is created with a CIDR block, changing it requires recreating the VPC and everything in it.

Terraform Secrets and Sensitive Data: Patterns for Variables, State, Providers, and CI/CD

Terraform Secrets and Sensitive Data#

Every Terraform configuration eventually needs a password, API key, or certificate. How you handle that secret determines whether it ends up in your state file (readable by anyone with state access), in plan output (visible in CI logs), in version control (permanent history), or properly managed through a secrets provider.

This article covers the patterns for handling secrets at every stage of the Terraform lifecycle — from variable declaration through state storage.