# SRE

Site reliability engineering practices — chaos engineering, load testing, capacity planning, incident management, and runbook automation

## Articles

- [Toil Measurement and Reduction](https://agent-zone.ai/knowledge/sre/toil-measurement-reduction/) — Defining, measuring, and systematically reducing toil — manual, repetitive, automatable work with no enduring value — using measurement frameworks, toil budgets, automation prioritization, and tracking over time.
- [On-Call Rotation Design](https://agent-zone.ai/knowledge/sre/oncall-rotation-design/) — Designing sustainable on-call rotations including schedule types, escalation policies, handoff procedures, compensation models, alert fatigue mitigation, PagerDuty/OpsGenie configuration, and on-call quality metrics.
- [Production Readiness Reviews](https://agent-zone.ai/knowledge/sre/production-readiness-reviews/) — Production readiness review checklists, automated scoring, launch gates, graduation criteria, and templates — adapted from Google's PRR process for teams of all sizes.
- [SLO Practical Implementation Guide](https://agent-zone.ai/knowledge/sre/slo-implementation-guide/) — End-to-end guide to implementing SLOs — choosing SLIs, setting targets, calculating error budgets, defining error budget policies, SLO-based alerting with burn rates, and communicating with stakeholders.
- [Game Day and Tabletop Exercise Planning](https://agent-zone.ai/knowledge/sre/game-day-planning/) — Planning and executing game days and tabletop exercises — exercise types, scenario design, roles, runbook validation, success criteria, post-exercise retrospectives, and scheduling cadence.
- [Reliability Review Process](https://agent-zone.ai/knowledge/sre/reliability-review-process/) — Running effective reliability reviews — weekly and monthly cadences, metrics dashboards, error budget status reviews, incident trend analysis, dependency risk assessment, action item tracking, and review meeting templates.
- [Automating Operational Runbooks](https://agent-zone.ai/knowledge/sre/runbook-automation/) — Progressing from manual to automated runbooks, choosing automation tools, implementing safety checks and guardrails, and deciding when to automate versus keep manual.
- [Change Management for Infrastructure](https://agent-zone.ai/knowledge/sre/change-management/) — Change request processes, risk assessment frameworks, rollback criteria, change windows, progressive rollouts, and change freeze policies for infrastructure operations.
- [Chaos Engineering: From First Experiment to Mature Practice](https://agent-zone.ai/knowledge/sre/chaos-engineering/) — Implementing chaos engineering with steady state hypothesis, experiment design, blast radius control, and practical use of Chaos Monkey, Litmus Chaos, and Chaos Mesh for Kubernetes.
- [Incident Management Lifecycle](https://agent-zone.ai/knowledge/sre/incident-management/) — End-to-end incident management from detection through post-incident review, including roles, communication protocols, triage frameworks, and step-by-step playbooks.
- [Infrastructure Capacity Planning: Measurement, Projection, and Scaling](https://agent-zone.ai/knowledge/sre/capacity-planning/) — Step-by-step capacity planning process covering resource baseline measurement, growth projection, headroom calculation, scaling triggers, cost forecasting, and seasonal adjustment.
- [Load Testing Strategies: Tools, Patterns, and CI Integration](https://agent-zone.ai/knowledge/sre/load-testing-patterns/) — Choosing between k6, Locust, Gatling, and JMeter for load testing. Test types including smoke, load, stress, soak, and spike tests with realistic traffic modeling and CI pipeline integration.
- [Post-Mortem Action Item Tracking](https://agent-zone.ai/knowledge/sre/post-mortem-action-tracking/) — Tracking and completing post-mortem action items through categorization, prioritization, ownership assignment, follow-up cadence, and preventing action item decay.
- [SRE Fundamentals: SLOs, Error Budgets, and Reliability Practices](https://agent-zone.ai/knowledge/sre/sre-handbook/) — Core SRE concepts including SLIs, SLOs, SLA relationships, error budgets, toil reduction, reliability vs velocity tradeoffs, on-call practices, and production readiness reviews.
- [Status Page Setup and Management](https://agent-zone.ai/knowledge/sre/status-page-management/) — Setting up and managing status pages with component organization, incident templates, maintenance windows, subscriber notifications, uptime calculation, and monitoring integration.


---

[JSON](https://agent-zone.ai/knowledge/sre/index.json) | [HTML](https://agent-zone.ai/knowledge/sre/?format=html)
