---
title: "SLOs, Error Budgets, and SLI Implementation with Prometheus"
description: "Practical guide to defining SLOs, implementing SLIs in PromQL, multi-window burn-rate alerting, error budget tracking, and tooling with Pyrra and Sloth."
url: https://agent-zone.ai/knowledge/observability/slo-error-budgets/
section: knowledge
date: 2026-02-21
categories: ["observability"]
tags: ["slo","sli","error-budget","prometheus","promql","grafana","burn-rate","pyrra","sloth"]
skills: ["slo-definition","sli-implementation","burn-rate-alerting","error-budget-policy"]
tools: ["prometheus","grafana","pyrra","sloth","promtool"]
levels: ["advanced"]
word_count: 1723
formats:
  json: https://agent-zone.ai/knowledge/observability/slo-error-budgets/index.json
  html: https://agent-zone.ai/knowledge/observability/slo-error-budgets/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=SLOs%2C+Error+Budgets%2C+and+SLI+Implementation+with+Prometheus
---


## SLI, SLO, and SLA -- What They Actually Mean

An **SLI** (Service Level Indicator) is a quantitative measurement of service quality -- a number computed from your metrics. Examples: the proportion of successful HTTP requests, the proportion of requests faster than 500ms, the proportion of jobs completing within their deadline.

An **SLO** (Service Level Objective) is a target value for an SLI. It is an internal engineering commitment: "99.9% of requests will succeed over a 30-day rolling window."

An **SLA** (Service Level Agreement) is a business contract with consequences -- typically service credits if the SLO is not met. SLAs are always less aggressive than internal SLOs. If your SLO is 99.9%, your SLA might be 99.5%, giving you a buffer before contractual obligations kick in.

## Choosing SLIs

Good SLIs are user-facing measurements. Internal metrics like CPU usage or queue depth are useful for debugging but poor SLIs because they do not directly represent user experience.

**Availability**: the ratio of successful requests to total requests. Define "successful" precisely -- typically non-5xx, but exclude 429 (rate limiting is intentional).

**Latency**: the proportion of requests faster than a threshold. Never use average latency -- it hides tail latency. Use a percentile-at-threshold: "99% of requests under 500ms."

**Freshness**: for data pipelines, the age of the most recent successfully processed record.

**Throughput**: for batch systems, the proportion of jobs completing within their scheduled window.

## The Error Budget

If your SLO is 99.9% availability over 30 days, your error budget is 0.1%. In concrete terms:

```
30 days * 24 hours * 60 minutes = 43,200 minutes
0.1% of 43,200 = 43.2 minutes of allowed downtime
```

The error budget reframes the reliability conversation. Instead of "should we deploy on Friday?" the question becomes "do we have budget remaining to absorb a potential incident?" When budget remains, deploy aggressively. When it is exhausted, focus on reliability.

## Implementing Availability SLI in PromQL

The fundamental availability SLI query:

```promql
# 30-day availability ratio for a service
sum(rate(http_requests_total{job="api", code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="api"}[30d]))
```

This works but is expensive to evaluate -- it loads 30 days of raw data. In practice, you layer recording rules:

```yaml
groups:
  - name: sli_availability
    interval: 30s
    rules:
      # Layer 1: short-window error ratio (used by burn-rate alerts)
      - record: job:http_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:ratio_rate30m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[30m]))
          / sum by (job) (rate(http_requests_total[30m]))

      - record: job:http_errors:ratio_rate1h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[1h]))
          / sum by (job) (rate(http_requests_total[1h]))

      - record: job:http_errors:ratio_rate6h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[6h]))
          / sum by (job) (rate(http_requests_total[6h]))
```

## Implementing Latency SLI with Histograms

For a latency SLI of "99% of requests complete in under 500ms":

```promql
# Proportion of requests faster than 500ms over 30 days
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
```

The `le="0.5"` bucket contains all observations less than or equal to 500ms. Dividing by the total count gives the proportion within the threshold.

Critical requirement: your histogram must have a bucket boundary at or near your SLO threshold. If your buckets are `[0.1, 0.25, 1.0, 5.0]` and your SLO threshold is 500ms, there is no `le="0.5"` bucket. You would have to use `le="1.0"`, which overstates compliance. Configure bucket boundaries to match your SLO thresholds.

Recording rules for latency SLI follow the same pattern as availability:

```yaml
groups:
  - name: sli_latency
    interval: 30s
    rules:
      - record: job:http_latency_below_threshold:ratio_rate5m
        expr: |
          sum by (job) (rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
          / sum by (job) (rate(http_request_duration_seconds_count[5m]))
```

## Multi-Window Multi-Burn-Rate Alerting

### Why Simple Threshold Alerts Fail for SLOs

A naive alert like `error_ratio > 0.001` (targeting 99.9%) fires on any momentary spike, even ones that consume negligible budget. Setting a `for: 1h` duration avoids noise but means you do not get paged until an hour into a major incident. You need alerts that are sensitive to severe incidents and tolerant of minor blips.

### The Burn Rate Concept

Burn rate is how fast you are consuming error budget relative to a uniform consumption rate. A burn rate of 1 means you are consuming budget at exactly the rate that would exhaust it at the end of the SLO window. A burn rate of 10 means you would exhaust the budget in 1/10th of the window.

For a 99.9% SLO (0.1% error budget) over 30 days:

```
Burn rate 1:  0.1% error rate -- budget exhausted in 30 days (this is your baseline)
Burn rate 2:  0.2% error rate -- budget exhausted in 15 days
Burn rate 10: 1.0% error rate -- budget exhausted in 3 days
Burn rate 14: 1.4% error rate -- budget exhausted in ~2 days
Burn rate 36: 3.6% error rate -- budget exhausted in ~20 hours
```

### The Four-Window Approach

Google's SRE workbook recommends four alert windows, combining a short window for detection speed and a long window for significance:

**Page-worthy (immediate response required):**

| Severity | Burn Rate | Long Window | Short Window | Budget Consumed |
|---|---|---|---|---|
| Critical | 14.4x | 1h | 5m | 2% in 1 hour |
| Critical | 6x | 6h | 30m | 5% in 6 hours |

**Ticket-worthy (next business day):**

| Severity | Burn Rate | Long Window | Short Window | Budget Consumed |
|---|---|---|---|---|
| Warning | 3x | 1d | 2h | 10% in 1 day |
| Warning | 1x | 3d | 6h | 10% in 3 days |

The short window prevents the alert from firing when the problem has already resolved. Both windows must exceed the threshold for the alert to fire.

### Implementation as Alerting Rules

```yaml
groups:
  - name: slo-burn-rate-alerts
    rules:
      # Page: 2% budget consumed in 1 hour
      - alert: SLOHighBurnRate_Critical_1h
        expr: |
          job:http_errors:ratio_rate5m{job="api"} > (14.4 * 0.001)
          and
          job:http_errors:ratio_rate1h{job="api"} > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          slo: "api-availability"
          window: "1h"
        annotations:
          summary: "API error budget burning at 14.4x -- 2% consumed in 1 hour"
          budget_consumed: "2%"

      # Page: 5% budget consumed in 6 hours
      - alert: SLOHighBurnRate_Critical_6h
        expr: |
          job:http_errors:ratio_rate30m{job="api"} > (6 * 0.001)
          and
          job:http_errors:ratio_rate6h{job="api"} > (6 * 0.001)
        for: 5m
        labels:
          severity: critical
          slo: "api-availability"
          window: "6h"
        annotations:
          summary: "API error budget burning at 6x -- 5% consumed in 6 hours"
          budget_consumed: "5%"

      # Ticket: 10% budget consumed in 1 day
      - alert: SLOHighBurnRate_Warning_1d
        expr: |
          job:http_errors:ratio_rate2h{job="api"} > (3 * 0.001)
          and
          job:http_errors:ratio_rate1d{job="api"} > (3 * 0.001)
        for: 15m
        labels:
          severity: warning
          slo: "api-availability"
          window: "1d"
        annotations:
          summary: "API error budget burning at 3x -- 10% consumed in 1 day"

      # Ticket: 10% budget consumed in 3 days
      - alert: SLOHighBurnRate_Warning_3d
        expr: |
          job:http_errors:ratio_rate6h{job="api"} > (1 * 0.001)
          and
          job:http_errors:ratio_rate3d{job="api"} > (1 * 0.001)
        for: 30m
        labels:
          severity: warning
          slo: "api-availability"
          window: "3d"
        annotations:
          summary: "API error budget burning at 1x -- will exhaust in 30 days at this rate"
```

The `0.001` in each expression is the error budget (1 - 0.999). Multiply by the burn rate to get the threshold error ratio.

## Error Budget Dashboard

A Grafana dashboard for error budget tracking needs these panels:

**Budget remaining gauge:**

```promql
# Remaining error budget as a percentage (0-100)
(1 - (
  sum(rate(http_requests_total{job="api", code=~"5.."}[30d]))
  / sum(rate(http_requests_total{job="api"}[30d]))
  - (1 - 0.999)
) / 0.001) * 100
```

Simplify with recording rules. Color thresholds: green above 50%, yellow 20-50%, red below 20%.

**Current burn rate:**

```promql
# Current burn rate (1 = sustainable, >1 = consuming budget too fast)
job:http_errors:ratio_rate1h{job="api"} / 0.001
```

**Time until budget exhaustion at current rate:**

```promql
# Hours until budget is gone (negative means already exhausted)
(
  0.001 - job:http_errors:ratio_rate1h{job="api"}
) / job:http_errors:ratio_rate1h{job="api"} * 720
```

Where 720 = 30 days * 24 hours.

**Budget consumption by error type** (requires a label distinguishing error categories):

```promql
sum by (code) (rate(http_requests_total{job="api", code=~"5.."}[30d]))
/ sum(rate(http_requests_total{job="api"}[30d]))
```

This reveals whether budget is consumed by 502s (upstream failures), 503s (overload), or 500s (application bugs).

## Error Budget Policy

Without a written policy, error budgets are just numbers on a dashboard. A practical policy:

**Above 50%**: Normal operations. Deploy freely. Run chaos experiments.

**20-50%**: Increased caution. Deployments require extra review. Investigate ongoing error sources.

**Below 20%**: Feature deployments paused unless they improve reliability. Post-incident reviews mandatory.

**Exhausted**: Feature freeze. Only bug fixes and reliability work. Freeze lifts when the rolling window recovers.

## Multi-Tier Application Example

For a 3-tier application (API gateway, worker service, PostgreSQL database), define SLOs per tier:

**API gateway**: 99.9% availability (non-5xx), 99% of requests under 500ms.

**Worker service**: 99.9% of jobs complete successfully, 99% of jobs complete within 60 seconds.

**Database**: 99.95% availability (connection success rate), 99% of queries under 100ms.

Each tier gets its own set of recording rules and burn-rate alerts. The API gateway SLO is the most user-facing and the most important -- backend issues that do not cause API errors do not consume the API's error budget.

```yaml
groups:
  - name: sli_api
    rules:
      - record: sli:api_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{job="api-gateway", code!~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api-gateway"}[5m]))

  - name: sli_worker
    rules:
      - record: sli:worker_success:ratio_rate5m
        expr: |
          sum(rate(jobs_completed_total{job="worker", status="success"}[5m]))
          / sum(rate(jobs_completed_total{job="worker"}[5m]))

  - name: sli_database
    rules:
      - record: sli:db_availability:ratio_rate5m
        expr: |
          sum(rate(pg_connections_total{job="postgres", status="success"}[5m]))
          / sum(rate(pg_connections_total{job="postgres"}[5m]))
```

## Pyrra and Sloth

Writing recording rules and burn-rate alerts by hand is tedious and error-prone. Two tools automate this.

**Sloth** takes an SLO definition in YAML and generates all the recording rules and multi-window burn-rate alerts:

```yaml
# sloth.yml
version: "prometheus/v1"
service: "api-gateway"
labels:
  team: "platform"
slos:
  - name: "requests-availability"
    objective: 99.9
    description: "API availability"
    sli:
      events:
        error_query: sum(rate(http_requests_total{job="api-gateway", code=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total{job="api-gateway"}[{{.window}}]))
    alerting:
      name: APIHighErrorRate
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning
```

Run `sloth generate -i sloth.yml` and it outputs a complete PrometheusRule with all the recording rules and four-window burn-rate alerts.

**Pyrra** provides similar functionality but also includes a web UI that displays SLO compliance, error budget status, and burn rate. It runs as a Kubernetes operator that watches SLO custom resources and generates PrometheusRule resources automatically.

```yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-availability
  namespace: monitoring
spec:
  target: "99.9"
  window: 30d
  indicator:
    ratio:
      errors:
        metric: http_requests_total{job="api-gateway", code=~"5.."}
      total:
        metric: http_requests_total{job="api-gateway"}
```

Both tools follow the Google SRE multi-window multi-burn-rate approach. Sloth is simpler (CLI tool, generates YAML). Pyrra is more integrated (Kubernetes operator, web dashboard). Either eliminates manual work and reduces miscalculated thresholds.

## Common Pitfalls

**SLOs too tight**: A 99.99% SLO gives 4.3 minutes of budget per month. A single rollback can consume half of it. Match your SLO to your deployment pipeline's capabilities.

**Measuring the wrong thing**: Status-code-only SLIs miss slow-but-successful requests. Combine availability and latency SLIs. Always exclude health check endpoints from SLI calculations.

**Ignoring partial failures**: A 200 response with empty results or stale data looks healthy to a status-code SLI. Use application-level success signals when possible.

**No error budget policy**: Without documented, agreed-upon consequences, budgets are ignored when inconvenient. Get leadership buy-in before the budget runs out.

**Calendar vs rolling window**: Calendar-month SLOs reset on the first, creating perverse incentives. A 30-day rolling window provides consistent pressure and is strongly preferred.