{"page":{"agent_metadata":{"content_type":"runbook","outputs":["operator-validation-checklist","mattermost-receiver-config","silence-policy","synthetic-alert-test-recipe","webhook-secret-mount"],"prerequisites":["alertmanager-configuration","prometheus-alerting-rules"]},"categories":["observability"],"content_plain":"A receiver YAML passes static review and the helm release reports deployed. The alertmanager pod is Running 1/1. A real critical alert fires and goes nowhere. The alertmanager pod logs are clean. The receiver works fine for a hand-rolled curl to the webhook URL. The trap is that the prometheus-operator generated a Secret containing the rendered config but flagged a sync error in its own logs — and the alertmanager pod kept serving the previous-good rendering, silently. This article assumes familiarity with the basic alertmanager routing tree, receivers, inhibition rules, and templating covered in alertmanager-configuration. It extends that material with the Day-2 operations of the kube-prometheus-stack chart specifically: where errors actually surface, what the native receiver schemas allow (and don\u0026rsquo;t), and the silence discipline that keeps the alert pipeline trustworthy.\nOperator validation hierarchy# When the prometheus-operator manages alertmanager via the Alertmanager and AlertmanagerConfig CRDs (or kube-prometheus-stack\u0026rsquo;s alertmanager.config helm-values block), the validation pipeline has three layers — and errors at each layer surface in a different place.\nLayer Validates Errors surface in Helm template YAML syntax, helm template functions helm upgrade output prometheus-operator reconcile alertmanager config schema (unknown fields, type mismatches, regex compile) operator pod logs Alertmanager pod Final coordinator.go load of the rendered file alertmanager pod logs The trap: layers 1 and 3 commonly look healthy when layer 2 has rejected the new config. The operator generates the Secret, attaches a sync-error condition, and stops promoting it; the alertmanager pod continues serving the last successfully-loaded config. There is no CrashLoopBackOff, no Pending, no helm-release failure — only a quiet reconcile loop in operator logs.\nlevel=error ts=2026-05-07T14:22:11.493Z caller=operator.go:1421 component=alertmanageroperator msg=\u0026#34;provisioning alertmanager configuration failed\u0026#34; err=\u0026#34;failed to unmarshal alertmanager config: yaml: unmarshal errors: line 47: field title not found in type config.plain\u0026#34;The reconcile retry is silent and indefinite (~30s by default). Nothing in the alertmanager pod, the helm release, or kubectl get alertmanager makes this visible without targeted log inspection.\nThe 7-step diagnostic ladder# When an alert isn\u0026rsquo;t reaching its receiver, run these in order. Each step rules out one layer.\n# (1) Did the operator successfully reconcile the new config? kubectl logs -n monitoring deploy/\u0026lt;release\u0026gt;-kube-prometheus-operator --tail=200 \\ | grep -iE \u0026#39;alertmanager|provision|reconcil|error\u0026#39; # (2) What did the operator render into the alertmanager Secret? kubectl get secret -n monitoring \\ alertmanager-\u0026lt;release\u0026gt;-kube-prometh-alertmanager-generated \\ -o jsonpath=\u0026#39;{.data.alertmanager\\.yaml\\.gz}\u0026#39; | base64 -d | gunzip | head -200 # (3) Did the alertmanager pod actually load it? kubectl port-forward -n monitoring svc/\u0026lt;release\u0026gt;-kube-prometh-alertmanager 9093:9093 amtool config show --alertmanager.url=http://localhost:9093 curl -s http://localhost:9093/api/v2/status | jq .config # (4) Does the alert route to the expected receiver? amtool config routes test --alertmanager.url=http://localhost:9093 \\ severity=critical alertname=PodCrashLoopBackOff namespace=production # (5) Did the alert reach alertmanager at all? amtool alert query --alertmanager.url=http://localhost:9093 # (6) Is it silenced or inhibited? amtool silence query --alertmanager.url=http://localhost:9093 # (7) Did the receiver call succeed? kubectl logs -n monitoring alertmanager-\u0026lt;release\u0026gt;-kube-prometh-alertmanager-0 \\ | grep -iE \u0026#39;notify|error|retry\u0026#39;A green step (1) with a stale config in step (2) and matching config show in step (3) confirms the silent-rejection trap. A green step (3) with empty step (5) points at Prometheus-side rule problems (see PrometheusRule discipline below). A green step (5) with empty step (7) points at receiver connectivity (DNS, secret mount, network policy).\nThe Mattermost native receiver# Alertmanager v0.27 introduced mattermost_configs as a first-class receiver, removing the need for a webhook translator sidecar. The schema is real but not symmetric with slack_configs or pagerduty_configs. The full set of allowed fields:\nwebhook_url or webhook_url_file send_resolved (bool) channel (string — channel name, no leading #) username (string — overrides webhook default) icon_emoji or icon_url priority (string — important, urgent, or empty) text (template-rendered body) http_config There is no title: field. Slack, PagerDuty, OpsGenie, and Email receivers all have one. Mattermost does not. A config that copies the Slack pattern and sets title: will pass static review, render, and produce the silent-rejection symptom from the previous section. Fold the title prefix into the first line of text::\nreceivers: - name: \u0026#39;ops-critical\u0026#39; mattermost_configs: - webhook_url_file: \u0026#39;/etc/alertmanager/secrets/monitoring-secrets/alertmanager-mm-webhook\u0026#39; send_resolved: true channel: \u0026#39;your-alerts-channel\u0026#39; # no leading # text: | [CRITICAL] {{ .GroupLabels.alertname }} @oncall @platform-lead {{ range .Alerts }}**{{ .Annotations.summary }}** {{ .Annotations.description }} --- Pod: `{{ .Labels.namespace }}/{{ .Labels.pod }}` · Status: `{{ .Status }}` {{ end }}The wider lesson: native receiver schemas are not symmetric. Read the alertmanager config-schema reference for each integration before assuming a field exists. Static YAML review will not catch unknown-field errors; only an operator dry-run or amtool check-config against the rendered Secret will.\nMounting the webhook URL via prometheus-operator secrets:# Webhook URLs contain bearer-token-equivalent path segments. Embedding them in a helm values file (or in the operator-generated alertmanager Secret committed via GitOps) defeats rotation hygiene. The webhook_url_file: form combined with the operator\u0026rsquo;s secrets: mount field reads from an out-of-band Secret that can be rotated without a helm upgrade.\nIn the Alertmanager CRD spec — or under alertmanager.alertmanagerSpec in kube-prometheus-stack values:\nalertmanager: alertmanagerSpec: secrets: - monitoring-secrets # k8s Secret in the same namespaceMount path inside the alertmanager pod follows a fixed pattern:\n/etc/alertmanager/secrets/\u0026lt;secret-name\u0026gt;/\u0026lt;key\u0026gt; # e.g. /etc/alertmanager/secrets/monitoring-secrets/alertmanager-mm-webhookCreate or rotate the Secret, then trigger the alertmanager pod to re-read:\nkubectl create secret generic monitoring-secrets \\ -n monitoring \\ --from-literal=alertmanager-mm-webhook=\u0026#39;http://mattermost.svc.cluster.local:8065/hooks/REDACTED\u0026#39; \\ --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart statefulset -n monitoring \\ alertmanager-\u0026lt;release\u0026gt;-kube-prometh-alertmanagerRender path quirk: props.attachments[0].text# When the receiver is Mattermost, alertmanager v0.32 sends notifications using MM\u0026rsquo;s attachment format, not as plain post text. Tools that grep the Mattermost REST API response for keywords in posts[*].message will see nothing — the rendered alert body lives in posts[*].props.attachments[0].text:\n{ \u0026#34;id\u0026#34;: \u0026#34;abc123\u0026#34;, \u0026#34;message\u0026#34;: \u0026#34;\u0026#34;, \u0026#34;props\u0026#34;: { \u0026#34;attachments\u0026#34;: [{ \u0026#34;fallback\u0026#34;: \u0026#34;[CRITICAL] PodCrashLoopBackOff\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;[CRITICAL] PodCrashLoopBackOff\\n@oncall ...\u0026#34; }] } }Visual rendering in the UI is correct. Test harnesses that assert against message will report false delivery failures.\nSilencing: amtool vs helm-values disable# Chart-default rules generate false positives on minikube and other single-node dev clusters where kube-etcd, kube-controller-manager, kube-scheduler, and kube-proxy are not exposed as Service resources. The TargetDown, etcdMembersDown, and NodeClockNotSynchronising rules fire indefinitely. Silencing them buys quiet but doesn\u0026rsquo;t fix the underlying noise generator. Silences are the bridge; helm-values rule-disablement is the destination.\nThe two-tier strategy:\n# Tier 1 — immediate silence at runtime via amtool (or the UI) amtool silence add \\ alertname=TargetDown job=kube-etcd \\ --alertmanager.url=http://localhost:9093 \\ --duration=720h \\ --comment=\u0026#39;single-node cluster, no etcd Service\u0026#39; amtool silence add \\ \u0026#39;alertname=~Target.*\u0026#39; job=kube-controller-manager \\ --duration=720h \\ --comment=\u0026#39;single-node cluster\u0026#39;# Tier 2 — disable the rule generators in helm values defaultRules: disabled: NodeClockNotSynchronising: true kubeEtcd: enabled: false kubeControllerManager: enabled: false kubeScheduler: enabled: false kubeProxy: enabled: falseOnce the Tier 2 PR ships, the silences become redundant and expire naturally. Editing the rule directly is not durable: chart-managed PrometheusRule resources regenerate on the next helm upgrade. Use defaultRules.disabled.\u0026lt;RuleName\u0026gt;: true for chart rules; use kubeEtcd.enabled: false and siblings to stop the upstream ServiceMonitor/Endpoint generators.\nSilence retention semantics# Silences are deleted ~5 days after they expire (default --data.retention=120h on alertmanager). The expired state is queryable for that retention window via amtool silence query --expired. Silences survive alertmanager pod restarts (persisted in alertmanager\u0026rsquo;s nflog file under --storage.path, default /alertmanager). Silences are runtime state, not config. They do not sync from prometheus-operator; recreate after a fresh cluster bootstrap. Verifying routing with synthetic alerts# A synthetic alert via POST /api/v2/alerts is the only test that exercises the full receiver path: routing tree match, template render, secret read, webhook resolve, downstream channel post. Static amtool check-config catches schema. The chart\u0026rsquo;s always-firing Watchdog alert proves the pipeline runs end-to-end on its baseline route — it does not prove that a new severity branch added in this helm upgrade routes correctly.\nTreat synthetic-alert delivery as a mandatory post-change verification step.\ncurl -X POST http://localhost:9093/api/v2/alerts \\ -H \u0026#39;Content-Type: application/json\u0026#39; \\ -d \u0026#39;[{ \u0026#34;labels\u0026#34;: { \u0026#34;alertname\u0026#34;: \u0026#34;TestAlert\u0026#34;, \u0026#34;severity\u0026#34;: \u0026#34;warning\u0026#34;, \u0026#34;namespace\u0026#34;: \u0026#34;production\u0026#34; }, \u0026#34;annotations\u0026#34;: { \u0026#34;summary\u0026#34;: \u0026#34;Synthetic test from operator\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Verifying receiver routing\u0026#34; }, \u0026#34;startsAt\u0026#34;: \u0026#34;\u0026#39;\u0026#34;$(date -u +%Y-%m-%dT%H:%M:%SZ)\u0026#34;\u0026#39;\u0026#34;, \u0026#34;endsAt\u0026#34;: \u0026#34;\u0026#39;\u0026#34;$(date -u -v+5M +%Y-%m-%dT%H:%M:%SZ)\u0026#34;\u0026#39;\u0026#34; }]\u0026#39;Equivalent via amtool:\namtool alert add alertname=TestAlert severity=warning namespace=production \\ --alertmanager.url=http://localhost:9093 \\ --annotation=summary=\u0026#39;Synthetic test\u0026#39; \\ --annotation=description=\u0026#39;Verifying delivery\u0026#39;Run one synthetic alert per severity branch on every config change. Confirm receipt in the downstream channel before declaring the change complete.\nAuthoring custom PrometheusRule# Two discipline points trip up most PrometheusRule authors writing rules outside the chart defaults.\nThe release: selector label. The operator selects PrometheusRule resources via Prometheus.spec.ruleSelector. The kube-prometheus-stack default is:\nruleSelector: matchLabels: release: \u0026lt;helm-release-name\u0026gt;A PrometheusRule without this label is dropped silently. Symptom: kubectl get prometheusrules shows the resource, but /rules in the Prometheus UI doesn\u0026rsquo;t list it, and the alert never fires regardless of expression value.\nfor: thresholds calibrated to rollout windows. for: controls how long an expression must be true before the alert transitions from pending to firing. Setting it too short produces false positives during normal helm upgrade rollouts; setting it too long hides slow rollouts past the point of usefulness.\n- alert: DeploymentNotAvailable expr: | kube_deployment_status_replicas_available{namespace=\u0026#34;production\u0026#34;} / kube_deployment_spec_replicas{namespace=\u0026#34;production\u0026#34;} == 0 for: 4m labels: severity: critical4m is the empirical sweet spot for healthcheck-gated pods with ~30s readiness. During a rollout, replicas can briefly hit 0 available (old terminating, new not-yet-Ready). 2m fires false; 5m hides slow rollouts; 4m is long enough to span typical rollouts and short enough to catch real outages within the operator\u0026rsquo;s first reconcile cycle. Calibrate to your own readiness SLAs — but pick a number deliberately, not a round one.\nCompanion rules for missing-limit cases. A common pattern is alerting on a ratio whose denominator can be absent. The main rule fails to fire for the most dangerous case (no limit set at all) because the ratio is undefined:\n- alert: PodCPUSustained_NoLimit expr: | rate(container_cpu_usage_seconds_total{namespace=\u0026#34;production\u0026#34;,container!=\u0026#34;\u0026#34;}[5m]) \u0026gt; 1.0 AND on (pod, container) absent(kube_pod_container_resource_limits{namespace=\u0026#34;production\u0026#34;,resource=\u0026#34;cpu\u0026#34;}) for: 10m labels: { severity: warning }Pair every ratio-based rule with an absent() companion that catches the denominator-missing case explicitly.\nThe --reuse-values interaction# helm upgrade --reuse-values -f new-values.yaml silently ignores -f for any path already present in stored values. The operator continues serving the old alertmanager.config block. There is no error and the helm release shows deployed. This is the most common reason a \u0026ldquo;deployed\u0026rdquo; alertmanager config update has no effect.\n# Wrong — appears to succeed, no-ops the -f helm upgrade \u0026lt;release\u0026gt; prometheus-community/kube-prometheus-stack \\ --reuse-values -f bootstrap/helm-values/prometheus-stack.yaml # Right — full re-render with -f helm upgrade \u0026lt;release\u0026gt; prometheus-community/kube-prometheus-stack \\ -n monitoring -f bootstrap/helm-values/prometheus-stack.yaml \\ --version \u0026lt;chart-version\u0026gt; # Helm 3.14+: reset stored values, then layer new -f on top helm upgrade \u0026lt;release\u0026gt; prometheus-community/kube-prometheus-stack \\ --reset-then-reuse-values -f bootstrap/helm-values/prometheus-stack.yamlVerify the new values landed:\nhelm get values \u0026lt;release\u0026gt; -n monitoring --revision \u0026lt;n\u0026gt; helm history \u0026lt;release\u0026gt; -n monitoringIf helm get values doesn\u0026rsquo;t show your -f paths, the upgrade silently dropped them. Full mechanic and recovery patterns in helm-gotchas-reuse-values-revisions-rollback.\nOperational lessons# When the prometheus-operator manages alertmanager, alertmanager config errors surface in operator logs — the alertmanager pod itself stays quiet on bad config and serves the previous-good rendering. Always include a step-(1) check on operator logs before chasing alertmanager-pod symptoms. Native receiver schemas are not symmetric. mattermost_configs has no title: field despite slack/pagerduty/opsgenie all having one. Don\u0026rsquo;t assume; read the schema for the specific integration. Static review can\u0026rsquo;t catch unknown-field errors. An amtool check-config (or operator dry-run) belongs in CI as a gate, not on the code reviewer\u0026rsquo;s checklist. helm upgrade --reuse-values -f new.yaml silently ignores -f for paths already in stored values. If the alertmanager config doesn\u0026rsquo;t seem to be updating after a \u0026ldquo;successful\u0026rdquo; upgrade, suspect this first. Silence chart-default false-positives, but layer that under a helm-values rule-disablement PR. Silences are the bridge, not the destination. A synthetic POST /api/v2/alerts is the only test that exercises the full receiver path. Watchdog proves the pipeline runs; it doesn\u0026rsquo;t prove a new severity branch routes correctly. When the receiver is Mattermost, debugging tools that grep the post message field will see nothing. Alertmanager writes to props.attachments[0].text. ","date":"2026-05-07","description":"Day-2 operations for kube-prometheus-stack alertmanager: where validation errors actually surface, native receiver schema asymmetries (including Mattermost), secret-mount patterns, two-tier silence discipline, and synthetic alert testing.","lastmod":"2026-05-07","levels":["intermediate"],"reading_time_minutes":10,"section":"knowledge","skills":["alertmanager-debugging","operator-config-validation","silence-discipline","synthetic-alert-testing"],"tags":["alertmanager","prometheus-operator","kube-prometheus-stack","mattermost","silences","amtool","helm","observability"],"title":"Operating prometheus-stack Alertmanager: Operator Validation, Native Receivers, and Silence Discipline","tools":["alertmanager","prometheus-operator","amtool","helm","kubectl"],"url":"https://agent-zone.ai/knowledge/observability/prometheus-stack-alertmanager-operations/","word_count":1937}}