Opsgenie

On-Call Rotation Design

Sre

On-Call-Design, Escalation-Policy-Design, Alert-Tuning, On-Call-Management

On-Call, Pagerduty, Opsgenie, Escalation, Alert-Fatigue, Rotation, Sre

Pagerduty, Opsgenie, Slack, Grafana, Prometheus

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Platform Team Structure and Operating Model

February 22, 2026

Platform-Engineering

Intermediate, Advanced

Team-Design, Platform-Strategy, Slo-Definition, Api-Design

Platform-Team, Team-Topologies, Platform-as-Product, Slos, Api-First, Organizational-Design, Platform-Engineering

Backstage, Kubernetes, Terraform, Opsgenie, Pagerduty, Prometheus

Why the Operating Model Matters#

The platform team’s operating model determines whether the platform becomes a force multiplier or a bottleneck. A ticket-driven, gatekeeper-oriented team produces a platform developers route around. A product-oriented, self-service team produces a platform developers adopt voluntarily. Organizational structure shapes developer experience more than technology choices.

Team Topologies and Interaction Modes#

The Team Topologies framework (Skelton & Pais) defines four team types relevant to platform engineering:

Production Readiness Reviews

Sre

Intermediate, Advanced

Production-Readiness-Assessment, Launch-Gate-Design, Service-Evaluation

Production-Readiness, Prr, Launch-Checklist, Reliability, Observability, Security, Capacity

Jira, Confluence, Grafana, Prometheus, Pagerduty, Opsgenie, Terraform

Why Services Need a Gate Before Production#

Every production outage caused by a service that launched without monitoring, without runbooks, without capacity planning, without anyone knowing who owns it at 3 AM – every one of those was preventable. A production readiness review is the gate between “it works on my machine” and “it is ready for real users.” Google formalized this as the PRR process. You do not need Google-scale infrastructure to benefit from it.

Reliability Review Process

Sre

Intermediate, Advanced

Reliability-Assessment, Error-Budget-Review, Incident-Trend-Analysis, Risk-Assessment

Reliability-Review, Error-Budget, Incident-Trends, Dependency-Risk, Sre, Metrics-Review

Grafana, Prometheus, Datadog, Jira, Confluence, Pagerduty, Opsgenie

Why Regular Reviews Matter#

Reliability does not improve by accident. Without a structured review cadence, teams operate on vibes – “things feel okay” or “we’ve been having a lot of incidents lately.” Reliability reviews replace gut feelings with data. They surface slow-burning problems before they become outages, hold teams accountable for improvement actions, and create a shared understanding of system health across engineering and leadership.

Weekly Reliability Review#

The weekly review is a 30-minute tactical meeting focused on what happened this week and what needs attention next week. Attendees: on-call engineers, team leads, SRE. Keep it tight.

Blameless Post-Mortem Practices: Incident Timelines, Root Cause Analysis, and Organizational Learning

February 22, 2026

Observability

Intermediate

Post-Mortem-Facilitation, Root-Cause-Analysis, Incident-Timeline-Construction, Action-Item-Tracking

Post-Mortem, Incident-Response, Root-Cause-Analysis, 5-Whys, Blameless-Culture, Sre, Incident-Management, Action-Items

Grafana, Pagerduty, Opsgenie, Jira, Confluence, Slack

What a Post-Mortem Is and Is Not#

A post-mortem is a structured analysis of an incident conducted after the incident is resolved. Its purpose is to understand what happened, why it happened, and what changes will prevent it from happening again. It is not a blame assignment exercise. It is not a performance review. It is not a formality to check a compliance box.

The output of a good post-mortem is a set of concrete action items that improve the system. Not the humans – the system. If your post-mortem concludes with “engineer X should have been more careful,” you have failed at the process. Humans make mistakes. Systems should be designed so that human mistakes do not cause outages, and when they do, the blast radius is contained.

Incident Management Lifecycle

February 22, 2026

Sre

Intermediate, Advanced

Incident-Detection, Incident-Triage, Incident-Communication, Incident-Mitigation, Post-Incident-Review

Incident-Management, Incident-Response, Triage, Status-Page, Post-Incident-Review, On-Call, Communication

Pagerduty, Opsgenie, Slack, Statuspage, Grafana, Prometheus, Kubectl

Incident Lifecycle Overview#

An incident is an unplanned disruption to a service requiring coordinated response. The lifecycle has six phases: detection, triage, communication, mitigation, resolution, and review. Each has defined actions, owners, and exit criteria.

Phase 1: Detection#

Incidents are detected through three channels. Automated monitoring is best – alerts fire on SLO violations or error thresholds before users notice. Internal reports come from other teams noticing issues with dependencies. Customer reports are worst case – if users detect your incidents first, your observability has gaps.

SRE Fundamentals: SLOs, Error Budgets, and Reliability Practices

February 22, 2026

Sre

Intermediate

Slo-Definition, Error-Budget-Management, Toil-Identification, Production-Readiness-Review

Sre, Slo, Sli, Sla, Error-Budget, Toil, On-Call, Production-Readiness

Prometheus, Grafana, Pagerduty, Opsgenie, Datadog

The SRE Model#

Site Reliability Engineering treats operations as a software engineering problem. Instead of a wall between developers who ship features and operators who keep things running, SRE defines reliability as a feature – one that can be measured, budgeted, and traded against velocity. The core insight is that 100% reliability is the wrong target. Users cannot tell the difference between 99.99% and 100%, but the engineering cost to close that gap is enormous. SRE makes this tradeoff explicit through service level objectives.

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

February 22, 2026

Observability

Intermediate

Runbook-Authoring, Escalation-Design, Incident-Triage, Diagnostic-Decision-Trees

Runbooks, On-Call, Incident-Response, Escalation, Alerting, Operations, Sre, Pagerduty, Opsgenie

Alertmanager, Pagerduty, Opsgenie, Grafana, Prometheus, Kubectl

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.