On-Call

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Incident Management Lifecycle

February 22, 2026

Sre

Intermediate, Advanced

Incident-Detection, Incident-Triage, Incident-Communication, Incident-Mitigation, Post-Incident-Review

Incident-Management, Incident-Response, Triage, Status-Page, Post-Incident-Review, On-Call, Communication

Pagerduty, Opsgenie, Slack, Statuspage, Grafana, Prometheus, Kubectl

Incident Lifecycle Overview#

An incident is an unplanned disruption to a service requiring coordinated response. The lifecycle has six phases: detection, triage, communication, mitigation, resolution, and review. Each has defined actions, owners, and exit criteria.

Phase 1: Detection#

Incidents are detected through three channels. Automated monitoring is best – alerts fire on SLO violations or error thresholds before users notice. Internal reports come from other teams noticing issues with dependencies. Customer reports are worst case – if users detect your incidents first, your observability has gaps.

SRE Fundamentals: SLOs, Error Budgets, and Reliability Practices

February 22, 2026

Sre

Intermediate

Slo-Definition, Error-Budget-Management, Toil-Identification, Production-Readiness-Review

Sre, Slo, Sli, Sla, Error-Budget, Toil, On-Call, Production-Readiness

Prometheus, Grafana, Pagerduty, Opsgenie, Datadog

The SRE Model#

Site Reliability Engineering treats operations as a software engineering problem. Instead of a wall between developers who ship features and operators who keep things running, SRE defines reliability as a feature – one that can be measured, budgeted, and traded against velocity. The core insight is that 100% reliability is the wrong target. Users cannot tell the difference between 99.99% and 100%, but the engineering cost to close that gap is enormous. SRE makes this tradeoff explicit through service level objectives.

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

February 22, 2026

Observability

Intermediate

Runbook-Authoring, Escalation-Design, Incident-Triage, Diagnostic-Decision-Trees

Runbooks, On-Call, Incident-Response, Escalation, Alerting, Operations, Sre, Pagerduty, Opsgenie

Alertmanager, Pagerduty, Opsgenie, Grafana, Prometheus, Kubectl

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.

On-Call Rotation Design

On-Call Is a System, Not a Schedule#

Rotation Schedule Types#

Weekly Rotation#

Incident Management Lifecycle

Incident Lifecycle Overview#

Phase 1: Detection#

SRE Fundamentals: SLOs, Error Budgets, and Reliability Practices

The SRE Model#

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

Why Runbooks Exist#