On-Call Rotation Design

Sre

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.

Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection

When an Alert Should Fire but Does Not#

Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.

Step 1: Verify the Expression Returns Results#

Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.

Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.