---
title: "Change Management for Infrastructure"
description: "Change request processes, risk assessment frameworks, rollback criteria, change windows, progressive rollouts, and change freeze policies for infrastructure operations."
url: https://agent-zone.ai/knowledge/sre/change-management/
section: knowledge
date: 2026-02-22
categories: ["sre"]
tags: ["change-management","rollback","progressive-rollout","risk-assessment","change-freeze","infrastructure","deployment"]
skills: ["change-request-workflow","risk-assessment","rollback-planning","progressive-rollout-execution","change-freeze-management"]
tools: ["git","jira","pagerduty","slack","terraform","helm","argocd","kubectl"]
levels: ["intermediate","advanced"]
word_count: 1034
formats:
  json: https://agent-zone.ai/knowledge/sre/change-management/index.json
  html: https://agent-zone.ai/knowledge/sre/change-management/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Change+Management+for+Infrastructure
---


## Why Change Management Matters

Most production incidents trace back to a change. Code deployments, configuration updates, infrastructure modifications, database migrations -- each introduces risk. Change management reduces that risk through structure, visibility, and accountability. The goal is not to prevent change but to make change safe, visible, and reversible.

## Change Request Process

Every infrastructure change flows through a structured request. The formality scales with risk, but the basic elements remain constant.

```yaml
change_request:
  id: "CR-2026-0451"
  title: "Upgrade Redis cluster from 7.0 to 7.2"
  requester: "platform-team"
  type: "standard"          # standard | normal | emergency
  risk_level: "medium"      # low | medium | high | critical
  environment: "production"
  services_affected:
    - "redis-cluster-primary"
    - "session-cache"
  implementation_plan: |
    1. Update Helm values for Redis chart version
    2. Apply to staging, run integration tests
    3. Apply to production using rolling update strategy
    4. Verify cluster health and replication status
  rollback_plan: |
    1. Revert Helm values to previous chart version
    2. Apply rollback with helm rollback redis-cluster
    3. Verify cluster re-forms with old version
  change_window: "2026-02-25 02:00-04:00 UTC"
  approvers: ["oncall-sre", "redis-team-lead"]
  estimated_duration: "45 minutes"
```

**Standard changes** are pre-approved, low-risk, repeatable operations (scaling a deployment, rotating a certificate). They follow a template and do not require per-instance approval.

**Normal changes** require review and approval. Most infrastructure changes fall here -- full request, review, approve, schedule, execute workflow.

**Emergency changes** bypass normal approval when production is down or at immediate risk. They still require documentation after the fact, and every emergency change triggers a retrospective.

## Risk Assessment Framework

Score every change along four dimensions from 1 (low) to 4 (critical):

| Dimension | 1 - Low | 2 - Medium | 3 - High | 4 - Critical |
|---|---|---|---|---|
| **Blast radius** | Single pod/service | Multiple services | Entire namespace | Cross-cluster or data layer |
| **Reversibility** | Instant rollback | Rollback under 5 min | Rollback requires downtime | Irreversible (data migration) |
| **Confidence** | Done 50+ times | Done several times | First time, tested in staging | First time, no staging equivalent |
| **Timing** | Off-peak | Normal hours | Business hours, moderate traffic | Peak traffic |

**Composite risk score** (sum of four scores):
- **4-6:** Low risk. Standard change process. Single approver.
- **7-10:** Medium risk. Two approvers. Monitoring dashboards open during execution.
- **11-13:** High risk. Extended review. Dedicated rollback operator. Incident channel open.
- **14-16:** Critical risk. Director-level approval. Full war-room staffing.

## Rollback Criteria

Define rollback triggers before executing a change. These are pre-committed conditions, not judgment calls during an incident.

```yaml
rollback_triggers:
  immediate:
    - "Error rate exceeds 5% for 2 consecutive minutes"
    - "P99 latency exceeds 3x baseline for 3 minutes"
    - "Health check failures on more than 20% of pods"
  time_bounded:
    - condition: "Error rate above 1%"
      duration: "15 minutes after deployment completes"
  manual_review:
    - "Any unexpected log pattern not seen in staging"
    - "Dependent service reports degradation"
```

**Decide when to rollback before you start, not during the stress of a failure.** If the conditions are met, roll back without debate.

Rollback execution: announce in the change channel, execute the pre-documented procedure, verify metrics that triggered it have recovered, update change request status, and create a follow-up investigation ticket.

## Change Windows

Change windows constrain risk to periods of lower impact and higher staffing.

```
Production change windows:
  Standard:  Any time (automated, pre-approved)
  Normal:    Tuesday-Thursday, 02:00-06:00 UTC
  Emergency: Any time (with incident commander approval)

Exclusions:
  - No normal changes within 48 hours of a major release
  - No normal changes during peak traffic
  - No changes on company holidays
  - Respect change freeze periods
```

An agent executing changes should programmatically verify the window:

```python
def can_execute_change(change_request, current_time):
    if change_request.type == "emergency":
        return has_incident_commander_approval(change_request)
    if is_change_freeze_active(current_time):
        return False
    if change_request.type == "standard":
        return True
    if not is_within_change_window(current_time):
        return False
    return has_required_approvals(change_request)
```

## Progressive Rollouts

Never deploy a change to 100% of production simultaneously. Progressive rollouts catch problems when the blast radius is still small.

**Step 1 -- Canary (1-5%):** Deploy to a single pod or small traffic slice. Monitor for 15-30 minutes.
**Step 2 -- Limited (10-25%):** Expand to a quarter of the fleet. Monitor for 15-30 minutes.
**Step 3 -- Broad (50%):** Half the fleet runs the new version. Monitor for 15 minutes.
**Step 4 -- Full (100%):** Complete the deployment. Monitor for the bake period (1-2 hours).

For fine-grained control, use Argo Rollouts:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 15m }
        - setWeight: 25
        - pause: { duration: 15m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
```

## Change Freeze Policies

Change freezes prohibit non-emergency changes during high-risk periods: end-of-quarter, major product launches, holiday periods with reduced staffing, or after a major incident until the post-mortem is complete.

```yaml
change_freeze:
  name: "Q1 2026 End-of-Quarter"
  start: "2026-03-28T00:00:00Z"
  end: "2026-04-02T00:00:00Z"
  scope: "all-production"
  exceptions:
    - "Security patches rated Critical or High"
    - "Fixes for active production incidents"
  enforcement: "automated"
  approver_for_exceptions: "vp-engineering"
```

Enforce freezes in your deployment pipeline. The deployment tool should query the freeze schedule and block non-exempt changes automatically.

## Step-by-Step Change Execution Workflow

1. **Submit change request.** Fill out all fields including implementation plan, rollback plan, and rollback criteria.
2. **Risk assessment.** Score the change across four dimensions to determine the approval path.
3. **Peer review.** At least one engineer reviews the plan and rollback procedure.
4. **Approval.** Required approvers sign off based on risk level.
5. **Schedule.** Assign to an approved change window. Notify affected teams.
6. **Pre-flight checks.** Verify monitoring works, rollback tooling is accessible, no freeze is active. Take backups if applicable.
7. **Announce.** Post in the change channel with CR ID, description, ETA, and rollback criteria reference.
8. **Execute.** Follow the implementation plan step by step. Verify expected outcomes before proceeding.
9. **Monitor.** Watch dashboards for the bake period. Compare against baseline metrics.
10. **Validate.** Run smoke tests or synthetic checks against the changed system.
11. **Close.** Update the change request to completed. Post a summary with before/after metrics.

If rollback criteria are triggered during steps 8-10, execute the rollback plan immediately.

## Agent Operational Notes

- **Never skip the rollback plan.** Every change request must have one before execution begins.
- **Log every action.** Humans must be able to reconstruct exactly what the agent did and when.
- **Respect approval gates.** An agent should not self-approve its own changes.
- **Fail safe.** If monitoring data is unavailable during a rollout, treat it as a rollback trigger.
- **Communicate proactively.** Post status updates at each stage. Silence during a change is alarming.

