Reliability Review Process

Sre

Reliability-Assessment, Error-Budget-Review, Incident-Trend-Analysis, Risk-Assessment

Reliability-Review, Error-Budget, Incident-Trends, Dependency-Risk, Sre, Metrics-Review

Grafana, Prometheus, Datadog, Jira, Confluence, Pagerduty, Opsgenie

Why Regular Reviews Matter#

Reliability does not improve by accident. Without a structured review cadence, teams operate on vibes – “things feel okay” or “we’ve been having a lot of incidents lately.” Reliability reviews replace gut feelings with data. They surface slow-burning problems before they become outages, hold teams accountable for improvement actions, and create a shared understanding of system health across engineering and leadership.

Weekly Reliability Review#

The weekly review is a 30-minute tactical meeting focused on what happened this week and what needs attention next week. Attendees: on-call engineers, team leads, SRE. Keep it tight.

Change Management for Infrastructure

February 22, 2026

Sre

Intermediate, Advanced

Change-Request-Workflow, Risk-Assessment, Rollback-Planning, Progressive-Rollout-Execution, Change-Freeze-Management

Change-Management, Rollback, Progressive-Rollout, Risk-Assessment, Change-Freeze, Infrastructure, Deployment

Git, Jira, Pagerduty, Slack, Terraform, Helm, Argocd, Kubectl

Why Change Management Matters#

Most production incidents trace back to a change. Code deployments, configuration updates, infrastructure modifications, database migrations – each introduces risk. Change management reduces that risk through structure, visibility, and accountability. The goal is not to prevent change but to make change safe, visible, and reversible.

Change Request Process#

Every infrastructure change flows through a structured request. The formality scales with risk, but the basic elements remain constant.

Terraform Safety for Agents: Plans, Applies, and the Human Approval Gate

February 22, 2026

Infrastructure

Intermediate

Terraform-Plan-Interpretation, Risk-Assessment, Plan-Presentation, State-Lock-Management, Drift-Investigation

Terraform, Agent-Safety, Plan-Apply, Approval-Gates, State-Locks, Drift, Destructive-Actions, Risk-Assessment, Human-in-the-Loop

Terraform, Claude-Code

Terraform Safety for Agents#

Terraform is the most dangerous tool most agents have access to. A single terraform apply can create, modify, or destroy real infrastructure — databases with production data, networking that carries live traffic, security groups that protect running services. There is no undo button. terraform destroy is not an undo — it is a different destructive action.

This article defines the safety protocols agents must follow when working with Terraform: what to check before every plan, how to read plan output for danger, how to present plans to humans, when to apply vs when to stop, and how to handle state conflicts.

Threat Modeling for Developers: STRIDE, Attack Surfaces, Data Flow Diagrams, and Prioritization

February 22, 2026

Security

Intermediate

Threat-Modeling, Attack-Surface-Analysis, Risk-Assessment, Security-Architecture

Threat-Modeling, Stride, Attack-Surface, Security-Design, Risk-Assessment, Sdl

Draw-Io, Owasp-Threat-Dragon, Microsoft-Threat-Modeling-Tool

Threat Modeling for Developers#

Threat modeling is the practice of systematically identifying what can go wrong in a system before it goes wrong. It is not a security team activity that happens once. It is a design activity that happens every time the architecture changes.

The output of threat modeling is not a report that sits in a wiki. It is a prioritized list of threats that becomes security requirements in the backlog.