---
title: "SLO Practical Implementation Guide"
description: "End-to-end guide to implementing SLOs — choosing SLIs, setting targets, calculating error budgets, defining error budget policies, SLO-based alerting with burn rates, and communicating with stakeholders."
url: https://agent-zone.ai/knowledge/sre/slo-implementation-guide/
section: knowledge
date: 0001-01-01
categories: ["sre"]
tags: ["slo","sli","error-budget","burn-rate","alerting","reliability","sre"]
skills: ["slo-definition","error-budget-calculation","burn-rate-alerting","stakeholder-communication"]
tools: ["prometheus","grafana","datadog","pagerduty","sloth","openslo"]
levels: ["intermediate","advanced"]
word_count: 1081
formats:
  json: https://agent-zone.ai/knowledge/sre/slo-implementation-guide/index.json
  html: https://agent-zone.ai/knowledge/sre/slo-implementation-guide/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=SLO+Practical+Implementation+Guide
---


## From Theory to Running SLOs

Every SRE resource explains what SLOs are. Few explain how to actually implement them from scratch -- the Prometheus queries, the error budget math, the alerting rules, and the conversations with product managers when the budget runs out. This guide covers all of it.

## Step 1: Choose Your SLIs

SLIs must measure what users experience. Internal metrics like CPU usage or queue depth are useful for debugging but are not SLIs because users do not care about your CPU -- they care whether the page loaded.

### The Four SLI Types

**Availability**: Did the request succeed?

```promql
# Availability SLI: ratio of successful requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
```

**Latency**: Was the request fast enough?

```promql
# Latency SLI: ratio of requests under 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
```

**Correctness**: Did the response contain the right data? Harder to measure -- often requires application-level probes or synthetic checks that verify response content.

**Freshness**: Is the data recent enough? Critical for data pipelines and caches.

```promql
# Freshness SLI: time since last successful pipeline run
time() - pipeline_last_success_timestamp_seconds
```

Measure SLIs at the edge, not the origin. A load balancer's view captures network failures, TLS issues, and routing errors your application never sees. If you must measure at the application, ensure you also capture connection-level failures.

## Step 2: Set SLO Targets

SLO targets are not aspirational. They represent the level of reliability users actually need. Start with historical data.

```
# Pull 90 days of availability data
Query: avg_over_time(
  (sum(rate(http_requests_total{status!~"5.."}[1h]))
   / sum(rate(http_requests_total[1h])))[90d:1h])

Result: 99.95% historical availability
```

Set your initial SLO slightly below your historical performance. If you have been running at 99.95%, set 99.9%. This gives you headroom and makes the SLO achievable from day one. You can tighten it later.

Common SLO targets by service type:

```
| Service Type          | Availability | Latency (p99)    |
|-----------------------|-------------|------------------|
| User-facing API       | 99.9%       | < 500ms          |
| Internal API          | 99.5%       | < 1000ms         |
| Data pipeline         | 99.5%       | Freshness < 5min |
| Batch processing      | 99.0%       | Completion < 4hr |
| Static content/CDN    | 99.95%      | < 100ms          |
```

## Step 3: Calculate Error Budgets

The error budget is the amount of unreliability your SLO permits over a given window.

```
Error Budget = 1 - SLO target

For a 99.9% SLO over 30 days:
  Error budget = 0.1% = 0.001
  Total minutes in 30 days: 43,200
  Allowed downtime: 43,200 × 0.001 = 43.2 minutes

For a 99.5% SLO over 30 days:
  Error budget = 0.5% = 0.005
  Allowed downtime: 43,200 × 0.005 = 216 minutes (3.6 hours)
```

Track error budget consumption as a percentage:

```promql
# Error budget remaining (Prometheus recording rule)
- record: slo:error_budget_remaining:ratio
  expr: |
    1 - (
      (1 - (sum(rate(http_requests_total{status!~"5.."}[30d]))
            / sum(rate(http_requests_total[30d]))))
      / (1 - 0.999)
    )
```

When `slo:error_budget_remaining:ratio` hits 0, you have consumed your entire error budget for the window.

## Step 4: Define Error Budget Policies

The error budget policy is what makes SLOs operational. Without a policy, the error budget is just a number on a dashboard that nobody acts on.

```markdown
## Error Budget Policy: payment-api

**SLO**: 99.9% availability, 30-day rolling window

### Budget > 50% remaining
- Normal development velocity
- Feature work proceeds as planned
- Standard deployment cadence

### Budget 25-50% remaining
- Prioritize reliability work in next sprint
- Increase deployment testing (canary duration from 10min to 30min)
- Review recent incidents for systemic issues

### Budget 5-25% remaining
- Freeze non-critical feature deployments
- All engineering effort shifts to reliability
- Daily error budget review in standup

### Budget < 5% remaining (or exhausted)
- Complete feature freeze
- All deployments require SRE approval
- Incident review for every error budget consumption event
- Escalate to engineering leadership

### Budget exhausted
- Postmortem required identifying systemic causes
- Reliability sprint: minimum 2 weeks focused on fixes
- Feature freeze remains until budget recovers above 25%
```

The policy must have teeth. If product management can override a feature freeze whenever they want, the error budget policy is fiction.

## Step 5: SLO-Based Alerting with Burn Rates

Threshold-based alerts are noisy. "Error rate > 1%" fires on a brief spike that consumes negligible budget. Burn rate alerting solves this by asking: "At the current rate of errors, when will we exhaust the error budget?"

```
Burn rate = (actual error rate) / (SLO-permitted error rate)

For a 99.9% SLO:
  Permitted error rate = 0.1%
  If current error rate = 0.5%
  Burn rate = 0.5% / 0.1% = 5x

  At 5x burn rate, a 30-day budget is consumed in 6 days.
```

Implement multi-window burn rate alerts (Google's recommended approach):

```yaml
# Prometheus alerting rules for SLO burn rate
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn: 14.4x over 1 hour (exhausts budget in ~2 days)
      # Short window for confirmation: 5 minutes
      - alert: SLOHighBurnRate_Critical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate - budget exhausted in ~2 days"

      # Slow burn: 3x over 6 hours (exhausts budget in ~10 days)
      - alert: SLOHighBurnRate_Warning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            / sum(rate(http_requests_total[30m]))
          ) > (3 * 0.001)
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate - budget exhausted in ~10 days"
```

The long window catches sustained problems. The short window prevents alerting on issues that have already resolved. This dual-window approach dramatically reduces false positives compared to single-threshold alerts.

## Step 6: Stakeholder Communication

SLOs are useless if only the engineering team knows about them. Product managers, executives, and customer-facing teams need to understand what SLOs mean and how error budgets affect planning.

### Weekly SLO Report

```markdown
## SLO Status Report - Week of 2026-02-17

| Service      | SLO Target | Current  | Budget Remaining | Trend  |
|-------------|-----------|----------|-----------------|--------|
| payment-api  | 99.9%     | 99.92%   | 62%             | Stable |
| search-api   | 99.5%     | 99.1%    | 18%             | Down   |
| auth-service | 99.9%     | 99.97%   | 88%             | Stable |

### Action Items
- search-api: Error budget below 25%. Reliability sprint started.
  Root cause: connection pool exhaustion under peak load (JIRA-4601).
  Feature deployments paused until budget recovers above 50%.
```

Frame error budgets as a shared resource. Product managers should think of error budget like a spending account: deploying a risky feature costs some budget. A planned maintenance window costs some budget. Running the budget to zero means no more risk-taking until it recovers. This turns reliability from an abstract concern into a concrete resource that competes fairly with feature work.

