Infrastructure Capacity Planning: Measurement, Projection, and Scaling

Sre

What Capacity Planning Solves#

Running out of capacity during a traffic spike causes outages. Over-provisioning wastes money continuously. Capacity planning is the process of measuring what you use now, projecting what you will need, and ensuring resources are available before demand arrives. Without it, you are either constantly firefighting resource exhaustion or explaining to finance why your cloud bill doubled.

Capacity planning is not a one-time exercise. It is a recurring process – monthly for fast-growing services, quarterly for stable ones.

Load Testing Strategies: Tools, Patterns, and CI Integration

Sre

Why Load Test#

Performance problems discovered in production are expensive. A service that handles 100 requests per second in dev might collapse at 500 in production because connection pools exhaust, garbage collection pauses compound, or a downstream service starts throttling. Load testing reveals these limits before users do.

Load testing answers specific questions: What is the maximum throughput before errors start? At what concurrency does latency degrade beyond acceptable limits? Can the system sustain expected traffic for hours without resource leaks? Will a traffic spike cause cascading failures?

Post-Mortem Action Item Tracking

Sre

The Action Item Problem#

Post-mortem reviews produce action items. Teams agree on what needs to change. Then weeks pass, priorities shift, and items quietly decay into a backlog nobody checks. The next incident hits the same root cause, and the post-mortem produces the same action items again.

Studies of recurring incidents consistently show the root cause was identified in a previous post-mortem, and the corresponding action item was never completed. Action item tracking is the mechanism by which incidents make systems more reliable instead of just more documented.

SRE Fundamentals: SLOs, Error Budgets, and Reliability Practices

Sre

The SRE Model#

Site Reliability Engineering treats operations as a software engineering problem. Instead of a wall between developers who ship features and operators who keep things running, SRE defines reliability as a feature – one that can be measured, budgeted, and traded against velocity. The core insight is that 100% reliability is the wrong target. Users cannot tell the difference between 99.99% and 100%, but the engineering cost to close that gap is enormous. SRE makes this tradeoff explicit through service level objectives.

Status Page Setup and Management

Sre

Purpose of a Status Page#

A status page is the single source of truth for service health. It communicates current status, provides historical reliability data, and sets expectations during incidents through regular updates. A well-maintained status page reduces support tickets during incidents, builds customer trust, and gives teams a structured communication channel.

Platform Options#

Statuspage.io (Atlassian)#

The most widely adopted hosted solution. Integrates with the Atlassian ecosystem.

# Create a component
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/components \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"component": {"name": "API", "status": "operational", "showcase": true}}'

# Create an incident
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {"name": "Elevated Error Rates", "status": "investigating",
       "impact_override": "minor", "component_ids": ["id"]}}'

Strengths: Highly reliable, subscriber notifications built-in, custom domains, API-first. Weaknesses: Expensive ($399+/month business plan), limited customization, component limits on lower tiers.