GCP Terraform Patterns: Projects, GKE, Workload Identity, Cloud SQL, and Common Gotchas

GCP Terraform Patterns#

GCP’s Terraform provider (google and google-beta) has patterns distinct from both AWS and Azure. The biggest differences: APIs must be explicitly enabled per project, IAM uses a binding model (not inline policies), and GKE requires secondary IP ranges for VPC-native networking. GCP resources also tend to have longer creation times with more eventual consistency.

Projects and API Enablement#

Before creating any resource in GCP, the corresponding API must be enabled in the project. This is the most common source of first-time failures.

GitOps and Infrastructure as Code: Reconciliation Patterns for Terraform, ArgoCD, and Crossplane

GitOps and Infrastructure as Code#

GitOps says: the desired state is in Git. A controller continuously reconciles the real state to match. Infrastructure as Code says: the desired state is in code. A human (or agent) runs apply to push changes.

These two paradigms overlap but do not align perfectly. Kubernetes resources fit the GitOps model well — ArgoCD/Flux watch Git, detect differences, and apply changes continuously. Cloud infrastructure (VPCs, databases, IAM roles) fits the IaC model better — Terraform tracks state, computes diffs, and applies on command.

GKE Networking

GKE Networking#

GKE networking centers on VPC-native clusters, where pods and services get IP addresses from VPC subnet ranges. This integrates Kubernetes networking directly into Google Cloud’s VPC, enabling native routing, firewall rules, and load balancing without extra overlays.

VPC-Native Clusters and Alias IP Ranges#

VPC-native clusters use alias IP ranges on the subnet. You allocate two secondary ranges: one for pods, one for services.

# Create subnet with secondary ranges
gcloud compute networks subnets create gke-subnet \
  --network my-vpc \
  --region us-central1 \
  --range 10.0.0.0/20 \
  --secondary-range pods=10.4.0.0/14,services=10.8.0.0/20

# Create cluster using those ranges
gcloud container clusters create my-cluster \
  --region us-central1 \
  --network my-vpc \
  --subnetwork gke-subnet \
  --cluster-secondary-range-name pods \
  --services-secondary-range-name services \
  --enable-ip-alias

The pod range needs to be large. A /14 gives about 262,000 pod IPs. Each node reserves a /24 from the pod range (256 IPs, 110 usable pods per node). If you have 100 nodes, that consumes 100 /24 blocks. Undersizing the pod range is a common cause of IP exhaustion – the cluster cannot add nodes even though VMs are available.

GKE Security and Identity

GKE Security and Identity#

GKE security covers identity (who can do what), workload isolation (sandboxing untrusted code), supply chain integrity (ensuring only trusted images run), and data protection (encryption at rest). These features layer on top of standard Kubernetes RBAC and network policies.

Workload Identity Federation#

Workload Identity Federation is the successor to the original Workload Identity. It removes the need for a separate workload-pool flag and uses the standard GCP IAM federation model. The concept is the same: bind a Kubernetes service account to a Google Cloud service account so pods get GCP credentials without exported keys.

GKE Setup and Configuration

GKE Setup and Configuration#

GKE is Google’s managed Kubernetes service. The two major decisions when creating a cluster are the mode (Standard vs Autopilot) and the networking model (VPC-native is now the default and the only option for new clusters). Everything else – node pools, release channels, Workload Identity – layers on top of those choices.

Standard vs Autopilot#

Standard mode gives you full control over node pools, machine types, and node configuration. You manage capacity, pay per node (whether pods are using the resources or not), and can run DaemonSets, privileged containers, and host-network pods.

Grafana Dashboards for Kubernetes Monitoring

Data Source Configuration#

Grafana connects to backend data stores through data sources. For a complete Kubernetes observability stack, you need three: Prometheus for metrics, Loki for logs, and Tempo for traces.

Provision data sources declaratively so they survive Grafana restarts and are version-controlled:

# grafana/provisioning/datasources/observability.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-operated:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"traceID":"(\w+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3100
    jsonData:
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{key: "service.name", value: "job"}]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

The cross-linking configuration lets you click from a metric data point to the trace that generated it, and extract trace IDs from log lines to link to Tempo.

Grafana Organization: Folders, Permissions, Provisioning, and Dashboard Lifecycle

Folder Structure Strategy#

Grafana folders organize dashboards and control access through permissions. The folder structure you choose determines how teams find dashboards and who can edit them. Three patterns work in practice, each suited to a different organizational shape.

By Team#

When teams own distinct services and rarely need cross-team dashboards:

Platform/
  Node Overview
  Kubernetes Cluster
  Networking
Backend/
  API Gateway
  User Service
  Payment Service
Frontend/
  Web Vitals
  CDN Performance
Data/
  Kafka Pipelines
  ETL Jobs
  Data Quality

Each team gets Editor access to their folder and Viewer access to everything else. This works well when ownership boundaries are clear.

Implementing Compliance as Code

Implementing Compliance as Code#

Compliance as code encodes compliance requirements as machine-readable policies evaluated automatically, continuously, and with every change. Instead of quarterly spreadsheet audits, a policy like “all S3 buckets must have encryption enabled” becomes a check that runs in CI, blocks non-compliant Terraform plans, and scans running infrastructure hourly. Evidence generation is automatic. Drift is detected immediately.

Step 1: Map Compliance Controls to Technical Policies#

Translate your compliance framework’s controls into specific, testable technical requirements. This mapping bridges auditor language and infrastructure code.

Infrastructure Capacity Planning: Measurement, Projection, and Scaling

What Capacity Planning Solves#

Running out of capacity during a traffic spike causes outages. Over-provisioning wastes money continuously. Capacity planning is the process of measuring what you use now, projecting what you will need, and ensuring resources are available before demand arrives. Without it, you are either constantly firefighting resource exhaustion or explaining to finance why your cloud bill doubled.

Capacity planning is not a one-time exercise. It is a recurring process – monthly for fast-growing services, quarterly for stable ones.

Infrastructure Disaster Recovery with Terraform: State Recovery, Blue-Green Infrastructure, and Rebuild Procedures

Infrastructure Disaster Recovery with Terraform#

Application disaster recovery is well-understood: replicate data, failover traffic, restore from backups. Infrastructure disaster recovery is different — you are recovering the platform that applications run on. If your Terraform state is lost, your VPC is deleted, or an entire region goes down, how do you rebuild?

This article covers the DR patterns specific to Terraform-managed infrastructure: protecting state, recovering from state loss, designing infrastructure for regional failover, and the runbooks that agents and operators need when things go wrong.