Infrastructure Knowledge Scoping for Agents

Infrastructure Knowledge Scoping for Agents#

An agent working on infrastructure tasks needs to operate at the right level of specificity. Giving generic Kubernetes advice when the user runs EKS with IRSA is unhelpful – the agent misses the IAM integration that will make or break the deployment. Giving EKS-specific advice when the user runs minikube on a laptop is equally unhelpful – the agent references services and configurations that do not exist.

Integrating Infrastructure as Code with CI/CD: Patterns for Safe, Automated Infrastructure Delivery

Integrating Infrastructure as Code with CI/CD#

Running Terraform locally works for one person. It breaks down when multiple people (or agents) modify infrastructure concurrently, when changes need review before applying, and when environments (dev/staging/prod) need synchronized promotion. CI/CD pipelines solve this by making the plan-review-apply cycle automated, auditable, and safe.

This article covers the patterns for integrating Terraform into CI/CD — from the basic plan-on-PR flow to multi-directory monorepos with dependency ordering and environment promotion.

Minikube to Cloud Migration: 10 Things That Change on EKS, GKE, and AKS

Minikube to Cloud Migration Guide#

Minikube is excellent for learning and local development. But almost everything that “just works” on minikube requires explicit configuration on a cloud cluster. Here are the 10 things that change.

1. Ingress Controller Becomes a Cloud Load Balancer#

On minikube: You enable the NGINX ingress addon with minikube addons enable ingress. Traffic reaches your services through minikube tunnel or minikube service.

On cloud: The ingress controller must be deployed explicitly, and it provisions a real cloud load balancer. On AWS, the AWS Load Balancer Controller creates ALBs or NLBs from Ingress resources. On GKE, the built-in GCE ingress controller creates Google Cloud Load Balancers. You pay per load balancer.

Multi-Account Cloud Architecture with Terraform: AWS Organizations, Azure Management Groups, and GCP Organizations

Multi-Account Cloud Architecture with Terraform#

Single-account cloud deployments work for learning and prototypes. Production systems need multiple accounts (AWS), subscriptions (Azure), or projects (GCP) for isolation — security boundaries, blast radius control, billing separation, and compliance requirements.

Terraform manages multi-account architectures well, but the patterns differ significantly from single-account work. Provider configuration, state isolation, cross-account references, and IAM trust relationships all need explicit design.

Why Multiple Accounts#

ReasonSingle Account ProblemMulti-Account Solution
Blast radiusMisconfigured IAM affects everythingDamage limited to one account
BillingCannot attribute costs to teamsPer-account billing and budgets
CompliancePCI data mixed with dev workloadsSeparate accounts for regulated workloads
Service limitsVPC limit of 5 per region sharedEach account has its own limits
Access controlComplex IAM policies to isolate teamsAccount boundary is the strongest isolation
TestingDev resources can affect productionImpossible for dev to touch prod resources

AWS Organizations#

Organization Structure#

Organization Root
├── Core OU
│   ├── Management Account (billing, org management)
│   ├── Security Account (GuardDuty, SecurityHub, audit logs)
│   └── Networking Account (Transit Gateway, shared VPCs)
├── Workload OU
│   ├── Production OU
│   │   ├── App-A Production Account
│   │   └── App-B Production Account
│   └── Non-Production OU
│       ├── App-A Development Account
│       └── App-A Staging Account
└── Sandbox OU
    └── Developer Sandbox Accounts

Terraform for AWS Organizations#

resource "aws_organizations_organization" "main" {
  feature_set = "ALL"

  enabled_policy_types = [
    "SERVICE_CONTROL_POLICY",
    "TAG_POLICY",
  ]
}

resource "aws_organizations_organizational_unit" "core" {
  name      = "Core"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "workloads" {
  name      = "Workloads"
  parent_id = aws_organizations_organization.main.roots[0].id
}

resource "aws_organizations_organizational_unit" "production" {
  name      = "Production"
  parent_id = aws_organizations_organizational_unit.workloads.id
}

# Create a workload account
resource "aws_organizations_account" "app_production" {
  name      = "app-a-production"
  email     = "aws+app-a-prod@example.com"
  parent_id = aws_organizations_organizational_unit.production.id
  role_name = "OrganizationAccountAccessRole"  # cross-account admin role

  lifecycle {
    prevent_destroy = true  # accounts cannot be easily recreated
  }
}

Service Control Policies (SCPs)#

SCPs set permission boundaries for entire OUs:

Multi-Cloud Networking Patterns

Multi-Cloud Networking Patterns#

Multi-cloud networking connects workloads across two or more cloud providers into a coherent network. The motivations vary – vendor redundancy, best-of-breed service selection, regulatory requirements – but the challenges are the same: private connectivity between isolated networks, consistent service discovery, and traffic routing that handles failures.

VPN Tunnels Between Clouds#

IPsec VPN tunnels are the simplest way to connect two cloud networks. Each provider offers managed VPN gateways that terminate IPsec tunnels, encrypting traffic between VPCs without exposing it to the public internet.

Multi-Cloud vs Single-Cloud Strategy Decisions

Multi-Cloud vs Single-Cloud Strategy#

Multi-cloud is one of the most oversold strategies in infrastructure. Vendors, consultants, and conference speakers promote it as the default approach, but the reality is that most organizations are better served by a single cloud provider used well. This framework helps you determine whether multi-cloud is actually worth the cost for your situation.

The Default Answer Is Single-Cloud#

Start with single-cloud unless you have a specific, concrete reason to go multi-cloud. Here is why.

Prompt Engineering for Infrastructure Operations: Templates, Safety, and Structured Reasoning

Prompt Engineering for Infrastructure Operations#

Infrastructure prompts differ from general-purpose prompts in one critical way: the output often drives real actions on real systems. A hallucinated filename in a creative writing task is harmless. A hallucinated resource name in a Kubernetes delete command causes an outage. Every prompt pattern here is designed with that asymmetry in mind – prioritizing correctness and safety over cleverness.

Structured Output for Infrastructure Data#

Infrastructure operations produce structured data: IP addresses, resource names, status codes, configuration values. Free-form text responses create parsing fragility. Force structured output from the start.

Refactoring Terraform: When and How to Restructure Growing Infrastructure Code

Refactoring Terraform#

Terraform configurations grow organically. A project starts with 10 resources in one directory. Six months later it has 80 resources, 3 levels of modules, and a state file that takes 2 minutes to plan. Changes feel risky because everything is interconnected. New team members (or agents) cannot understand the structure without reading every file.

Refactoring addresses this — but Terraform refactoring is harder than code refactoring because the state file maps resource addresses to real infrastructure. Rename a resource and Terraform thinks you want to destroy the old one and create a new one. Move a resource into a module and Terraform plans to recreate it. Every structural change requires corresponding state manipulation.

Regulatory Compliance Frameworks: HIPAA, FedRAMP, ITAR, and SOX Technical Controls

Regulatory Compliance Frameworks#

Regulatory compliance translates legal requirements into technical controls. Understanding which regulations apply to your system and mapping them to infrastructure and application design is a core engineering responsibility in regulated industries.

This guide covers four major frameworks and their practical implications for software architecture. These are not exhaustive compliance guides — they map the most impactful technical controls for each framework.

HIPAA (Health Insurance Portability and Accountability Act)#

HIPAA applies to organizations handling Protected Health Information (PHI) — any data that can identify a patient and relates to their health condition, treatment, or payment.

Running Terraform in CI/CD Pipelines

The Core Pattern#

The standard CI/CD pattern for Terraform: run terraform plan on every pull request and post the output as a PR comment. Run terraform apply only when the PR merges to main. This gives reviewers visibility into what will change before approving.

GitHub Actions Workflow#

name: Terraform
on:
  pull_request:
    paths: ["infra/**"]
  push:
    branches: [main]
    paths: ["infra/**"]

permissions:
  id-token: write    # OIDC
  contents: read
  pull-requests: write  # PR comments

jobs:
  terraform:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infra
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0
          terraform_wrapper: true  # captures stdout for PR comments

      - name: Terraform Init
        run: terraform init -input=false

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: terraform plan -input=false -no-color -out=tfplan
        continue-on-error: true

      - name: Post Plan to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const plan = `${{ steps.plan.outputs.stdout }}`;
            const truncated = plan.length > 60000
              ? plan.substring(0, 60000) + "\n\n... truncated ..."
              : plan;
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: `#### Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -input=false tfplan

The plan step uses continue-on-error: true so the PR comment step still runs. A separate step checks the actual plan outcome. Apply only runs on pushes to main.