SLO Practical Implementation Guide

Sre

From Theory to Running SLOs#

Every SRE resource explains what SLOs are. Few explain how to actually implement them from scratch – the Prometheus queries, the error budget math, the alerting rules, and the conversations with product managers when the budget runs out. This guide covers all of it.

Step 1: Choose Your SLIs#

SLIs must measure what users experience. Internal metrics like CPU usage or queue depth are useful for debugging but are not SLIs because users do not care about your CPU – they care whether the page loaded.

Alertmanager Configuration and Routing

Routing Tree#

Alertmanager receives alerts from Prometheus and decides where to send them based on a routing tree. Every alert enters at the root route and travels down the tree until it matches a child route. If no child matches, the root route’s receiver handles it.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxx"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "default-slack"
  group_by: ["alertname", "namespace"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-oncall"
      group_wait: 10s
      repeat_interval: 1h
      routes:
        - match:
            team: database
          receiver: "pagerduty-dba"
    - match:
        severity: warning
      receiver: "team-slack"
      repeat_interval: 12h
    - match_re:
        namespace: "staging|dev"
      receiver: "dev-slack"
      repeat_interval: 24h

Timing parameters matter. group_wait is how long Alertmanager waits after receiving the first alert in a new group before sending the notification – this lets it batch related alerts together. group_interval is the minimum time before sending updates about a group that already fired. repeat_interval controls how often an unchanged active alert is re-sent.

ArgoCD Notifications: Slack, Teams, Webhooks, and Custom Triggers

ArgoCD Notifications#

ArgoCD Notifications is a built-in component (since ArgoCD 2.5) that monitors applications and sends alerts when specific events occur – sync succeeded, sync failed, health degraded, new version deployed. Before notifications existed, teams polled the ArgoCD UI or built custom watchers. Notifications eliminates that.

Architecture#

ArgoCD Notifications runs as a controller alongside the ArgoCD application controller. It watches Application resources for state changes and matches them against triggers. When a trigger fires, it renders a template and sends it through a configured service (Slack, Teams, webhook, email, etc.).

Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection

When an Alert Should Fire but Does Not#

Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.

Step 1: Verify the Expression Returns Results#

Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.

Prometheus and Grafana Monitoring Stack

Prometheus Architecture#

Prometheus pulls metrics from targets at regular intervals (scraping). Each target exposes an HTTP endpoint (typically /metrics) that returns metrics in a text format. Prometheus stores the scraped data in a local time-series database and evaluates alerting rules against it. Grafana connects to Prometheus as a data source and renders dashboards.

Scrape Configuration#

The core of Prometheus configuration is the scrape config. Each scrape_config block defines a set of targets and how to scrape them.

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.

Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.

Synthetic Monitoring: Proactive Uptime Checks, Blackbox Exporter, and External Probing

What Synthetic Monitoring Is#

Synthetic monitoring means actively probing your services on a schedule rather than waiting for users to report problems. Instead of relying on internal health checks or real user traffic to detect issues, you send controlled requests and measure the results. The fundamental question it answers is: “Is my service reachable and responding correctly right now?”

This is distinct from real user monitoring (RUM), which observes actual user interactions. Synthetic probes run 24/7 regardless of traffic volume, so they catch outages at 3 AM when no users are active. They provide consistent, repeatable measurements that are easy to alert on. The tradeoff is that synthetic probes test a narrow, predefined path – they do not capture the full range of user experience.