A receiver YAML passes static review and the helm release reports deployed. The alertmanager pod is Running 1/1. A real critical alert fires and goes nowhere. The alertmanager pod logs are clean. The receiver works fine for a hand-rolled curl to the webhook URL. The trap is that the prometheus-operator generated a Secret containing the rendered config but flagged a sync error in its own logs — and the alertmanager pod kept serving the previous-good rendering, silently. This article assumes familiarity with the basic alertmanager routing tree, receivers, inhibition rules, and templating covered in alertmanager-configuration. It extends that material with the Day-2 operations of the kube-prometheus-stack chart specifically: where errors actually surface, what the native receiver schemas allow (and don’t), and the silence discipline that keeps the alert pipeline trustworthy.
Alertmanager Configuration and Routing
Routing Tree#
Alertmanager receives alerts from Prometheus and decides where to send them based on a routing tree. Every alert enters at the root route and travels down the tree until it matches a child route. If no child matches, the root route’s receiver handles it.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00/B00/xxx"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: "default-slack"
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-oncall"
group_wait: 10s
repeat_interval: 1h
routes:
- match:
team: database
receiver: "pagerduty-dba"
- match:
severity: warning
receiver: "team-slack"
repeat_interval: 12h
- match_re:
namespace: "staging|dev"
receiver: "dev-slack"
repeat_interval: 24hTiming parameters matter. group_wait is how long Alertmanager waits after receiving the first alert in a new group before sending the notification – this lets it batch related alerts together. group_interval is the minimum time before sending updates about a group that already fired. repeat_interval controls how often an unchanged active alert is re-sent.
Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection
When an Alert Should Fire but Does Not#
Silent alerts are the most dangerous failure mode in monitoring. The system appears healthy because no one is being paged, but the condition you intended to catch is actively occurring. Work through this checklist in order.
Step 1: Verify the Expression Returns Results#
Open the Prometheus UI at /graph and run the alert expression directly. If the expression returns empty, the alert cannot fire regardless of anything else.
Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures
“I’m Not Seeing Metrics” – Systematic Diagnosis#
This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.
Step 1: Is the Target Being Scraped?#
Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.
Status: UP Last Scrape: 3s ago Duration: 12ms Error: (none)
Status: DOWN Last Scrape: 15s ago Duration: 0ms Error: connection refusedIf the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.