Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection

When an Alert Should Fire but Does Not#

Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.

Step 1: Verify the Expression Returns Results#

Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.

Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures

“I’m Not Seeing Metrics” – Systematic Diagnosis#

This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.

Step 1: Is the Target Being Scraped?#

Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.

Status: UP    Last Scrape: 3s ago    Duration: 12ms    Error: (none)
Status: DOWN  Last Scrape: 15s ago   Duration: 0ms     Error: connection refused

If the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.

Time-Series Database Selection and Operations

Time-Series Database Selection and Operations#

Time-series databases optimize for a specific access pattern: high-volume writes of timestamped data points, queries that aggregate over time ranges, and automatic expiration of old data. Choosing the right one depends on your data model, query patterns, retention requirements, and operational constraints.

When You Need a Time-Series Database#

A dedicated time-series database is justified when you have high write throughput (thousands to millions of data points per second), queries that are predominantly time-range aggregations, and data that has a defined retention period. Common use cases: infrastructure metrics, application performance monitoring, IoT sensor data, financial tick data, and log analytics.

Writing Effective Prometheus Alerting Rules

Rule Syntax#

Alerting rules live in rule files loaded by Prometheus. Each rule has an expression, an optional for duration, labels, and annotations.

groups:
  - name: example
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"
          description: "Current error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

The for duration is critical. Without it, a single bad scrape triggers an alert. With for: 5m, the condition must be continuously true across all evaluations for 5 minutes before the alert fires. During this window the alert is in pending state.

Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Prometheus Cardinality Management: Detecting, Preventing, and Reducing High-Cardinality Metrics

What Cardinality Means#

In Prometheus, cardinality is the number of unique time series. Every unique combination of metric name and label key-value pairs constitutes one series. The metric http_requests_total{method="GET", path="/api/users", status="200"} is one series. Change any label value and you get a different series. http_requests_total{method="POST", path="/api/users", status="201"} is a second series.

A single metric name can produce thousands or millions of series depending on its labels. A metric with no labels is exactly one series. A metric with one label that has 10 possible values is 10 series. A metric with three labels, each having 100 possible values, is up to 1,000,000 series (100 x 100 x 100), though in practice not every combination occurs.

SLOs, Error Budgets, and SLI Implementation with Prometheus

SLI, SLO, and SLA – What They Actually Mean#

An SLI (Service Level Indicator) is a quantitative measurement of service quality – a number computed from your metrics. Examples: the proportion of successful HTTP requests, the proportion of requests faster than 500ms, the proportion of jobs completing within their deadline.

An SLO (Service Level Objective) is a target value for an SLI. It is an internal engineering commitment: “99.9% of requests will succeed over a 30-day rolling window.”