GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

Grafana Organization: Folders, Permissions, Provisioning, and Dashboard Lifecycle

Folder Structure Strategy#

Grafana folders organize dashboards and control access through permissions. The folder structure you choose determines how teams find dashboards and who can edit them. Three patterns work in practice, each suited to a different organizational shape.

By Team#

When teams own distinct services and rarely need cross-team dashboards:

Platform/
  Node Overview
  Kubernetes Cluster
  Networking
Backend/
  API Gateway
  User Service
  Payment Service
Frontend/
  Web Vitals
  CDN Performance
Data/
  Kafka Pipelines
  ETL Jobs
  Data Quality

Each team gets Editor access to their folder and Viewer access to everything else. This works well when ownership boundaries are clear.

Refactoring Terraform: When and How to Restructure Growing Infrastructure Code

Refactoring Terraform#

Terraform configurations grow organically. A project starts with 10 resources in one directory. Six months later it has 80 resources, 3 levels of modules, and a state file that takes 2 minutes to plan. Changes feel risky because everything is interconnected. New team members (or agents) cannot understand the structure without reading every file.

Refactoring addresses this — but Terraform refactoring is harder than code refactoring because the state file maps resource addresses to real infrastructure. Rename a resource and Terraform thinks you want to destroy the old one and create a new one. Move a resource into a module and Terraform plans to recreate it. Every structural change requires corresponding state manipulation.

Running Terraform in CI/CD Pipelines

The Core Pattern#

The standard CI/CD pattern for Terraform: run terraform plan on every pull request and post the output as a PR comment. Run terraform apply only when the PR merges to main. This gives reviewers visibility into what will change before approving.

GitHub Actions Workflow#

name: Terraform
on:
  pull_request:
    paths: ["infra/**"]
  push:
    branches: [main]
    paths: ["infra/**"]

permissions:
  id-token: write    # OIDC
  contents: read
  pull-requests: write  # PR comments

jobs:
  terraform:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infra
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0
          terraform_wrapper: true  # captures stdout for PR comments

      - name: Terraform Init
        run: terraform init -input=false

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        id: plan
        run: terraform plan -input=false -no-color -out=tfplan
        continue-on-error: true

      - name: Post Plan to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const plan = `${{ steps.plan.outputs.stdout }}`;
            const truncated = plan.length > 60000
              ? plan.substring(0, 60000) + "\n\n... truncated ..."
              : plan;
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: `#### Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Plan Status
        if: steps.plan.outcome == 'failure'
        run: exit 1

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -input=false tfplan

The plan step uses continue-on-error: true so the PR comment step still runs. A separate step checks the actual plan outcome. Apply only runs on pushes to main.

Self-Service Infrastructure Patterns

The Problem Self-Service Solves#

Developers need infrastructure: databases, caches, message queues, storage buckets, DNS records. In most organizations, getting these means filing a ticket, waiting days for someone to provision, and receiving credentials in a Slack DM. This bottleneck incentivizes workarounds — manual console provisioning, skipped security configs, everything crammed into shared databases.

Self-service infrastructure lets developers provision what they need directly, within guardrails the platform team defines. Choose a resource from a catalog, fill in parameters, and the system provisions it and returns connection details. No tickets, no waiting.

Setting Up Multi-Environment Infrastructure: Dev, Staging, and Production

Setting Up Multi-Environment Infrastructure: Dev, Staging, and Production#

Running a single environment is straightforward. Running three that drift apart silently is where teams lose weeks debugging “it works in dev.” This operational sequence walks through setting up dev, staging, and production environments that stay consistent where it matters and diverge only where intended.

Phase 1 – Environment Strategy#

Step 1: Define Environments#

Each environment serves a distinct purpose:

  • Dev: Rapid iteration. Developers deploy frequently, break things, and recover quickly. Data is disposable. Resources are minimal.
  • Staging: Production mirror. Same Kubernetes version, same network policies, same resource quotas. External services point to staging endpoints. Used for integration testing and pre-release validation.
  • Production: Real users, real data. Changes go through approval gates. Monitoring is comprehensive and alerting reaches on-call engineers.

Step 2: Isolation Model#

Decision point: Separate clusters per environment versus namespaces in a shared cluster.

Terraform Cloud Architecture Patterns: VPC/EKS/RDS on AWS, VNET/AKS on Azure, VPC/GKE on GCP

Terraform Cloud Architecture Patterns#

The three-tier architecture — networking, managed Kubernetes, managed database — is the most common pattern for production deployments on any major cloud. The concepts are identical across AWS, Azure, and GCP. The Terraform code is not. Resource names differ, required arguments differ, default behaviors differ, and the gotchas that catch agents and humans are cloud-specific.

This article shows the real Terraform for each layer on each cloud, side by side, so agents can write correct infrastructure code for whichever cloud the user deploys to.

Terraform Code Quality: Patterns, Anti-Patterns, and Review Heuristics

Terraform Code Quality#

Writing Terraform that works is easy. Writing Terraform that is safe, maintainable, and comprehensible to the next person (or agent) is harder. Most quality problems are not bugs — they are patterns that work today but create pain tomorrow: hardcoded IDs that break in a new account, missing lifecycle rules that cause accidental data loss, modules that are too big to understand or too small to justify their existence.

Terraform Core Concepts and Workflow

Providers, Resources, and Data Sources#

Terraform has three core object types. Providers are plugins that talk to APIs (AWS, Azure, GCP, Kubernetes, GitHub). Resources are the things you create and manage. Data sources read existing objects without managing them.

# providers.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  required_version = ">= 1.5.0"
}

provider "aws" {
  region = var.region
}
# A resource Terraform creates and manages
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  tags = { Name = "main-vpc" }
}

# A data source that reads an existing AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  subnet_id     = aws_vpc.main.id
}

Resources create, update, and delete. Data sources only read. If you need information about something Terraform does not manage, use a data source.

Terraform Modules: Structure, Composition, and Reuse

What Modules Are#

A Terraform module is a directory containing .tf files. Every Terraform configuration is already a module (the “root module”). When you call another module from your root module, that is a “child module.” Modules let you encapsulate a set of resources behind a clean interface of input variables and outputs.

Module Structure#

A well-organized module looks like this:

modules/vpc/
  main.tf           # resource definitions
  variables.tf      # input variables
  outputs.tf        # output values
  versions.tf       # required providers and terraform version
  README.md         # usage documentation

The module itself has no backend, no provider configuration, and no hardcoded values. Everything configurable comes in through variables. Everything downstream consumers need comes out through outputs.