Automating Operational Runbooks

Sre

The Manual-to-Automated Progression#

Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.

Level 0 – Tribal Knowledge: The procedure exists only in someone’s head. Invisible risk.

Level 1 – Documented Runbook: Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.

Level 2 – Scripted Runbook: Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.

CDN and Edge Computing Patterns

CDN and Edge Computing Patterns#

A CDN (Content Delivery Network) caches content at edge locations close to users, reducing latency and offloading traffic from origin servers. Edge computing extends this by running custom code at those edge locations, enabling request transformation, authentication, A/B testing, and dynamic content generation without round-tripping to an origin server.

CDN Cache Fundamentals#

Cache-Control Headers#

The origin server controls CDN caching behavior through HTTP headers. Getting these right is the single most impactful CDN optimization.

Change Management for Infrastructure

Sre

Why Change Management Matters#

Most production incidents trace back to a change. Code deployments, configuration updates, infrastructure modifications, database migrations – each introduces risk. Change management reduces that risk through structure, visibility, and accountability. The goal is not to prevent change but to make change safe, visible, and reversible.

Change Request Process#

Every infrastructure change flows through a structured request. The formality scales with risk, but the basic elements remain constant.

Cloud Behavioral Divergence Guide: Where AWS, Azure, and GCP Actually Differ

Cloud Behavioral Divergence Guide#

Running the “same” workload on AWS, Azure, and GCP does not produce the same behavior. The Kubernetes API is portable, application containers are portable, and SQL queries are portable. Everything else – identity, networking, storage, load balancing, DNS, and managed service behavior – diverges in ways that matter for production reliability.

This guide documents the specific divergence points with practical examples. Use it when translating infrastructure from one cloud to another, when debugging behavior that differs between environments, or when assessing migration risk.

Cloud Cost Optimization

The Cost Optimization Hierarchy#

Cloud cost optimization follows a hierarchy of impact. Work from the top down – fixing the wrong tier of commitment discount matters far less than shutting down resources nobody uses.

  1. Eliminate waste – turn off unused resources, delete orphaned storage
  2. Right-size – match instance sizes to actual usage
  3. Use commitment discounts – reserved instances, savings plans, CUDs
  4. Shift to spot/preemptible – for fault-tolerant workloads
  5. Optimize storage and network – tiering, transfer patterns, caching
  6. Architect for cost – serverless, auto-scaling, multi-region strategy

Eliminating Waste#

The fastest cost reduction comes from finding resources that serve no purpose. Every cloud provider accumulates these: instances left running after a test, snapshots from decommissioned servers, load balancers with no backends, unattached disks.

Cloud Migration Strategies: The 7 Rs Framework

Cloud Migration Strategies#

A company does not “migrate to the cloud” – it migrates dozens or hundreds of applications, each with different characteristics, dependencies, and risk profiles. The 7 Rs framework provides vocabulary for per-workload decisions, but selecting the right R requires understanding the application, its dependencies, and the organization’s tolerance for change.

The 7 Rs#

Rehost (Lift and Shift)#

Move the application to cloud infrastructure with minimal changes. A VM on-premises becomes an EC2 instance. OS, application code, and configuration remain the same.

Cloud Vendor Product Matrix: Comparing Cloudflare, AWS, Azure, and GCP

Cloud Vendor Product Matrix#

Choosing between cloud vendors requires mapping equivalent services across providers. AWS has 200+ services. Azure has 200+. GCP has 100+. Cloudflare has 20+ but they are tightly integrated and edge-native. This article maps the services that matter for most applications – compute, serverless, databases, storage, networking, and observability – across all four vendors with pricing, availability, and portability for each.

How to Use This Matrix#

Each section maps equivalent products across vendors, then provides:

Cloud-Native vs Portable Infrastructure: A Decision Framework

Cloud-Native vs Portable Infrastructure#

Every infrastructure decision sits on a spectrum between portability and fidelity. On one end, you have generic Kubernetes running on minikube or kind – it works everywhere, costs nothing, and captures the behavior of the Kubernetes API itself. On the other end, you have cloud-native managed services – EKS with IRSA and ALB Ingress Controller, GKE with Workload Identity and Cloud Load Balancing, AKS with Azure AD Pod Identity and Azure Load Balancer. These capture the behavior of the actual platform your workloads will run on.

Database High Availability Patterns

Database High Availability Patterns#

Every database HA decision starts with two numbers: RPO (Recovery Point Objective – how much data you can afford to lose) and RTO (Recovery Time Objective – how long the database can be unavailable). These numbers dictate the pattern, and each pattern carries specific operational tradeoffs.

Core Concepts#

RPO = 0 means zero data loss. Every committed transaction must survive a failure. This requires synchronous replication, which adds latency to every write.

Database Performance Investigation Runbook

Database Performance Investigation Runbook#

When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.

Phase 1 – Identify Slow Queries#

The first step is always finding which queries are consuming the most time.

PostgreSQL: pg_stat_statements#

Enable the extension if not already loaded: