---
title: "Automating Operational Runbooks"
description: "Progressing from manual to automated runbooks, choosing automation tools, implementing safety checks and guardrails, and deciding when to automate versus keep manual."
url: https://agent-zone.ai/knowledge/sre/runbook-automation/
section: knowledge
date: 2026-02-22
categories: ["sre"]
tags: ["runbook-automation","rundeck","ansible-awx","stackstorm","automation","guardrails","approval-workflow","audit-trail"]
skills: ["runbook-design","automation-assessment","guardrail-implementation","approval-workflow-design","audit-trail-setup"]
tools: ["rundeck","ansible","ansible-awx","stackstorm","bash","python","terraform","kubectl"]
levels: ["intermediate","advanced"]
word_count: 1043
formats:
  json: https://agent-zone.ai/knowledge/sre/runbook-automation/index.json
  html: https://agent-zone.ai/knowledge/sre/runbook-automation/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Automating+Operational+Runbooks
---


## The Manual-to-Automated Progression

Not every runbook should be automated, and automation does not happen in a single jump. The progression builds confidence at each stage.

**Level 0 -- Tribal Knowledge:** The procedure exists only in someone's head. Invisible risk.

**Level 1 -- Documented Runbook:** Step-by-step instructions a human follows, including commands, expected outputs, and decision points. Every runbook starts here.

**Level 2 -- Scripted Runbook:** Manual steps encoded in a script that a human triggers and monitors. The script handles tedious parts; the human handles judgment calls.

**Level 3 -- Semi-Automated:** Runs automatically when triggered but pauses at key decision points for human approval. The sweet spot for most operational procedures.

**Level 4 -- Fully Automated:** End-to-end without human intervention. Appropriate only for well-understood, low-risk, high-frequency operations with comprehensive safety checks.

## When to Automate vs. Keep Manual

### Automate When

- **Frequency is high.** Run more than once a week -- automation pays for itself quickly.
- **Steps are deterministic.** Every step has a clear input, action, and expected output.
- **Time sensitivity matters.** Procedure must complete in minutes; human execution is a bottleneck.
- **Manual errors are common.** Typos, missed steps, wrong order.
- **The procedure is stable.** Has not changed significantly in 3 months.

### Keep Manual When

- **Complex judgment is required.** Each execution requires evaluating novel conditions.
- **Blast radius is catastrophic and irreversible.** Data deletion, production schema changes.
- **Frequency is very low.** Run once a year -- automation rots faster than it saves time.
- **The environment is unstable.** Target systems change frequently; automation needs constant maintenance.

```
                     High Frequency    Low Frequency
                   +------------------+------------------+
High Risk          | Semi-automated   | Manual with      |
                   | (Level 3)        | detailed docs    |
                   +------------------+------------------+
Low Risk           | Fully automated  | Scripted         |
                   | (Level 4)        | (Level 2)        |
                   +------------------+------------------+
```

## Automation Tools

### Rundeck

Job scheduler and runbook automation platform with a web UI for executing pre-defined procedures with RBAC.

**Best for:** Centralizing operations where on-call engineers click "Run" on pre-built procedures. Good for transitioning from manual to automated because it wraps existing scripts with a UI, access controls, and logging.

```yaml
# Rundeck job definition
- id: restart-service
  name: "Restart Kubernetes Deployment"
  sequence:
    commands:
      - description: "Pre-check: verify deployment is healthy"
        script: |
          kubectl rollout status deployment/${option.deployment} \
            -n ${option.namespace} --timeout=30s
      - description: "Execute rolling restart"
        script: |
          kubectl rollout restart deployment/${option.deployment} \
            -n ${option.namespace}
      - description: "Wait for rollout"
        script: |
          kubectl rollout status deployment/${option.deployment} \
            -n ${option.namespace} --timeout=300s
```

**Strengths:** Web UI, RBAC, audit logging, LDAP integration, approval workflows.
**Weaknesses:** Java-based (resource-heavy), limited orchestration logic.

### Ansible AWX

Open-source upstream of Red Hat Ansible Automation Platform. Web UI, REST API, and RBAC around Ansible playbooks.

**Best for:** Teams already using Ansible who want to expose playbooks as self-service operations with inventory management.

```yaml
- name: Rotate TLS certificates
  hosts: "{{ target_hosts }}"
  tasks:
    - name: Backup existing certificate
      copy:
        src: /etc/ssl/certs/service.crt
        dest: "/etc/ssl/backup/service.crt.{{ ansible_date_time.iso8601_basic }}"
        remote_src: true
    - name: Generate new certificate
      community.crypto.x509_certificate:
        path: /etc/ssl/certs/service.crt
        provider: ownca
        ownca_path: /etc/ssl/certs/ca.crt
        ownca_not_after: "+365d"
    - name: Verify new certificate is served
      uri:
        url: "https://{{ inventory_hostname }}:{{ service_port }}/health"
        validate_certs: true
      retries: 3
      delay: 5
```

**Strengths:** Massive module ecosystem, agentless, declarative and idempotent.
**Weaknesses:** Slow at scale (SSH per host), AWX has significant infrastructure overhead.

### StackStorm

Event-driven automation platform connecting triggers (alerts, webhooks) to actions through rules and workflows.

**Best for:** Alert-driven remediation -- "if this alert fires, run that procedure."

```yaml
# StackStorm rule: auto-remediate high memory
name: remediate_high_memory
trigger:
  type: prometheus.webhook
  parameters:
    alert_name: "PodMemoryHigh"
criteria:
  trigger.labels.severity:
    type: equals
    pattern: "warning"
action:
  ref: kubernetes.restart_pod
  parameters:
    namespace: "{{ trigger.labels.namespace }}"
    pod_name: "{{ trigger.labels.pod }}"
    approval_required: true
```

**Strengths:** Event-driven, large integration ecosystem, complex workflow support.
**Weaknesses:** Complex setup, smaller community, requires dedicated infrastructure.

### Custom Scripts

For simple, targeted automation, a well-written Bash or Python script may be sufficient.

**Strengths:** Simple, no additional infrastructure, easy to version control.
**Weaknesses:** No built-in RBAC or UI, audit trail requires external logging, no approval workflows without wrapping.

## Safety Checks and Guardrails

Automated runbooks execute faster than humans can intervene. A script with a bug causes more damage in 10 seconds than a human could in 10 minutes.

### Pre-Execution

1. **Target validation.** Confirm the target exists and is in the expected state.
2. **Environment confirmation.** Verify you are operating in the intended environment.
3. **Dependency health.** Check that monitoring, logging, and backup systems are available.
4. **Concurrency guard.** Ensure another instance is not already running against the same target.

### During Execution

5. **Step validation.** After each step, verify the expected outcome before proceeding.
6. **Rate limiting.** Process multiple targets in batches with pauses between them.
7. **Timeout enforcement.** Every step has a timeout. Hanging steps fail the runbook, not block it.

### Post-Execution

8. **Health verification.** Verify the system is healthy after completion.
9. **Metric comparison.** Compare key metrics against the pre-execution baseline.

## Approval Workflows

Semi-automated runbooks pause for human approval at critical points.

**Pre-execution approval:** Runbook waits for a human to approve before any action. Good for scheduled maintenance where an engineer reviews conditions first.

**Mid-execution approval:** Preparatory steps (gathering data, backups, pre-checks) run automatically, then the runbook pauses before the impactful step. The human sees pre-check results and decides.

**Escalating approval:** Low-risk steps execute automatically. If unexpected conditions arise or higher-risk action is needed, the runbook pauses for senior approval.

## Audit Trails

Every execution must produce an audit trail answering: who triggered it, when, what it did, and what the outcome was.

```json
{
  "runbook": "cache-rebuild",
  "execution_id": "exec-2026-0222-143052",
  "triggered_by": "alert:CacheHitRateLow",
  "triggered_at": "2026-02-22T14:30:52Z",
  "approved_by": "oncall-sre@company.com",
  "target": "api-server/production",
  "steps_executed": [
    {"step": "pre-check", "status": "pass", "timestamp": "2026-02-22T14:31:16Z"},
    {"step": "cache-rebuild", "status": "pass", "timestamp": "2026-02-22T14:31:45Z"},
    {"step": "post-check", "status": "pass", "timestamp": "2026-02-22T14:32:48Z"}
  ],
  "outcome": "success",
  "baseline_metrics": {"cache_hit_rate": 0.45},
  "post_metrics": {"cache_hit_rate": 0.92}
}
```

Store audit records in a durable, append-only system. The automation system should not be able to modify or delete its own logs.

## Agent Operational Notes

- **Start at Level 2.** Script the steps but keep human triggering. Move to Level 3 only after multiple successful scripted executions.
- **Never skip pre-checks.** A pre-check that prevents one bad execution justifies its existence for every future run.
- **Make rollback the default.** If any step fails or any post-check deviates from expected results, roll back rather than continue.
- **Log at the step level.** "Steps 1-5 succeeded, step 6 skipped (condition not met), step 7 succeeded" is useful. "The runbook succeeded" is not.
- **Treat automation code as production code.** Version control, code review, testing, and staged rollouts apply to runbook automation.

