Status Page Setup and Management

Sre

Purpose of a Status Page#

A status page is the single source of truth for service health. It communicates current status, provides historical reliability data, and sets expectations during incidents through regular updates. A well-maintained status page reduces support tickets during incidents, builds customer trust, and gives teams a structured communication channel.

Platform Options#

Statuspage.io (Atlassian)#

The most widely adopted hosted solution. Integrates with the Atlassian ecosystem.

# Create a component
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/components \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"component": {"name": "API", "status": "operational", "showcase": true}}'

# Create an incident
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {"name": "Elevated Error Rates", "status": "investigating",
       "impact_override": "minor", "component_ids": ["id"]}}'

Strengths: Highly reliable, subscriber notifications built-in, custom domains, API-first. Weaknesses: Expensive ($399+/month business plan), limited customization, component limits on lower tiers.

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.

Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.

Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Canary Deployments Deep Dive: Argo Rollouts, Flagger, and Metrics-Based Progressive Delivery

Canary Deployments Deep Dive#

A canary deployment sends a small percentage of traffic to a new version of your application while the majority continues hitting the stable version. You monitor the canary for errors, latency regressions, and business metric anomalies. If the canary is healthy, you gradually increase its traffic share until it handles 100%. If something is wrong, you roll back with minimal user impact.

Why Canary Over Rolling Update#

A standard Kubernetes rolling update replaces pods one by one until all pods run the new version. The problem is timing. By the time you notice a bug in your monitoring dashboards, the rolling update may have already replaced most or all pods. Every user is now hitting the broken version.

Kubernetes Cost Optimization: Rightsizing, Resource Efficiency, and Waste Reduction

Kubernetes Cost Optimization#

Most Kubernetes clusters run at 15-30% actual CPU utilization but are billed for the full provisioned capacity. The gap between what you reserve and what you use is pure waste. This article covers the practical workflow for finding and eliminating that waste.

The Cost Problem: Requests vs Actual Usage#

Kubernetes resource requests are the foundation of cost. When a pod requests 4 CPUs, the scheduler reserves 4 CPUs on a node regardless of whether the pod ever uses more than 0.1 CPU. The node is sized (and billed) based on what is reserved, not what is consumed.

Kubernetes Resource Management: QoS Classes, Eviction, OOM Scoring, and Capacity Planning

Kubernetes Resource Management Deep Dive#

Resource management in Kubernetes is the mechanism that decides which pods get scheduled, which pods get killed when the node runs low, and how much CPU and memory each container is actually allowed to use. The surface-level concept of requests and limits is straightforward. The underlying mechanics – QoS classification, CFS CPU quotas, kernel OOM scoring, kubelet eviction thresholds – are where misconfigurations cause production outages.

Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics

The Retention Problem#

Prometheus stores metrics on local disk with a default retention of 15 days. Most production teams extend this to 30 or 90 days, but local storage has hard limits. A single Prometheus instance cannot scale disk beyond the node it runs on. It provides no high availability – if the instance goes down, you lose scraping and query access. And each Prometheus instance only sees its own targets, so there is no unified view across clusters or regions.

Monitoring Prometheus Itself: Capacity Planning, Self-Monitoring, and Scaling

Why Monitor Your Monitoring#

If Prometheus runs out of memory and crashes, you lose all alerting. If its disk fills up, it stops ingesting and you have a blind spot that may last hours before anyone notices. If scrapes start timing out, metrics go stale and alerts based on rate() produce no data (which means they silently stop firing rather than triggering). Prometheus must be the most reliably monitored component in your stack.

Prometheus Cardinality Management: Detecting, Preventing, and Reducing High-Cardinality Metrics

What Cardinality Means#

In Prometheus, cardinality is the number of unique time series. Every unique combination of metric name and label key-value pairs constitutes one series. The metric http_requests_total{method="GET", path="/api/users", status="200"} is one series. Change any label value and you get a different series. http_requests_total{method="POST", path="/api/users", status="201"} is a second series.

A single metric name can produce thousands or millions of series depending on its labels. A metric with no labels is exactly one series. A metric with one label that has 10 possible values is 10 series. A metric with three labels, each having 100 possible values, is up to 1,000,000 series (100 x 100 x 100), though in practice not every combination occurs.