Change Management for Infrastructure

Why Change Management Matters#

Most production incidents trace back to a change. Code deployments, configuration updates, infrastructure modifications, database migrations – each introduces risk. Change management reduces that risk through structure, visibility, and accountability. The goal is not to prevent change but to make change safe, visible, and reversible.

Change Request Process#

Every infrastructure change flows through a structured request. The formality scales with risk, but the basic elements remain constant.

Chaos Engineering: From First Experiment to Mature Practice

Why Break Things on Purpose#

Production systems fail in ways that testing environments never reveal. A database connection pool exhaustion under load, a cascading timeout across three services, a DNS cache that masks a routing change until it expires – these failures only surface when real conditions collide in ways nobody predicted. Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they cause outages.

Choosing a CNI Plugin: Calico vs Cilium vs Flannel vs Cloud-Native CNI

Choosing a CNI Plugin#

The Container Network Interface (CNI) plugin is one of the most consequential infrastructure decisions in a Kubernetes cluster. It determines how pods get IP addresses, how traffic flows between them, whether network policies are enforced, and what observability you get into network behavior. Changing CNI after deployment is painful – it typically requires draining and rebuilding nodes, or rebuilding the cluster entirely. Choose carefully up front.

Choosing a Kubernetes Policy Engine: OPA/Gatekeeper vs Kyverno vs Pod Security Admission

Choosing a Kubernetes Policy Engine#

Kubernetes does not enforce security best practices by default. A freshly deployed cluster allows containers to run as root, pull images from any registry, mount the host filesystem, and use the host network. Policy engines close this gap by intercepting API requests through admission webhooks and rejecting or modifying resources that violate your rules.

The three main options – Pod Security Admission (built-in), OPA Gatekeeper, and Kyverno – serve different needs. Choosing the wrong one leads to either insufficient enforcement or unnecessary operational burden.

Choosing a Secret Management Strategy: K8s Secrets vs Vault vs Sealed Secrets vs External Secrets

Choosing a Secret Management Strategy#

Secrets – database credentials, API keys, TLS certificates, encryption keys – must be available to pods at runtime. At the same time, they must not be stored in plain text in git, should be rotatable without downtime, and should produce an audit trail showing who accessed what and when. No single tool satisfies every requirement, and the right choice depends on your security maturity, operational capacity, and compliance obligations.

Choosing an Autoscaling Strategy: HPA vs VPA vs KEDA vs Karpenter/Cluster Autoscaler

Choosing an Autoscaling Strategy#

Kubernetes autoscaling operates at two distinct layers: pod-level scaling changes how many pods run or how large they are, while node-level scaling changes how many nodes exist in the cluster to host those pods. Getting the right combination of tools at each layer is the key to a system that responds to demand without wasting resources.

The Two Scaling Layers#

Understanding which layer a tool operates on prevents the most common misconfiguration – expecting pod-level scaling to solve node-level capacity problems, or vice versa.

Choosing an Ingress Controller: Nginx vs Traefik vs HAProxy vs Cloud ALB/NLB

Choosing an Ingress Controller#

An Ingress controller is the component that actually routes external traffic into your cluster. The Ingress resource (or Gateway API resource) defines the rules – which hostnames and paths map to which backend Services – but without a controller watching those resources and configuring a reverse proxy, nothing happens. The choice of controller affects performance, configuration ergonomics, TLS management, protocol support, and operational cost.

Unlike CNI plugins, you can run multiple ingress controllers in the same cluster, which is a common pattern for separating internal and external traffic. This reduces the stakes of any single choice, but your primary controller still deserves careful selection.

Choosing Kubernetes Workload Types: Deployment vs StatefulSet vs DaemonSet vs Job

Choosing Kubernetes Workload Types#

Kubernetes provides several workload controllers, each designed for a specific class of application behavior. Choosing the wrong one leads to data loss, unnecessary complexity, or workloads that fight the platform instead of leveraging it. This guide walks through the decision criteria and tradeoffs for each type.

The Workload Types at a Glance#

Workload TypeLifecyclePod IdentityScaling ModelStorage ModelTypical Use
DeploymentLong-runningInterchangeableHorizontal replicasShared or noneWeb servers, APIs, stateless microservices
StatefulSetLong-runningStable, orderedOrdered horizontalPer-pod persistentDatabases, message queues, distributed consensus
DaemonSetLong-runningOne per nodeTied to node countNode-localLog collectors, monitoring agents, network plugins
JobRun to completionDisposableParallel completionsEphemeralBatch processing, migrations, one-time tasks
CronJobScheduledDisposablePer-schedule runEphemeralPeriodic backups, cleanup, scheduled reports
ReplicaSetLong-runningInterchangeableHorizontal replicasShared or noneAlmost never used directly

Decision Criteria#

The choice comes down to four questions:

Cloud Behavioral Divergence Guide: Where AWS, Azure, and GCP Actually Differ

Cloud Behavioral Divergence Guide#

Running the “same” workload on AWS, Azure, and GCP does not produce the same behavior. The Kubernetes API is portable, application containers are portable, and SQL queries are portable. Everything else – identity, networking, storage, load balancing, DNS, and managed service behavior – diverges in ways that matter for production reliability.

This guide documents the specific divergence points with practical examples. Use it when translating infrastructure from one cloud to another, when debugging behavior that differs between environments, or when assessing migration risk.

Cloud-Native vs Portable Infrastructure: A Decision Framework

Cloud-Native vs Portable Infrastructure#

Every infrastructure decision sits on a spectrum between portability and fidelity. On one end, you have generic Kubernetes running on minikube or kind – it works everywhere, costs nothing, and captures the behavior of the Kubernetes API itself. On the other end, you have cloud-native managed services – EKS with IRSA and ALB Ingress Controller, GKE with Workload Identity and Cloud Load Balancing, AKS with Azure AD Pod Identity and Azure Load Balancer. These capture the behavior of the actual platform your workloads will run on.