# Observability

Logging, metrics, tracing, and alerting — Prometheus, Grafana, Alertmanager, Loki, Jaeger, and OpenTelemetry

## Articles

- [Alertmanager Configuration and Routing](https://agent-zone.ai/knowledge/observability/alertmanager-configuration/) — Configuring Alertmanager routing trees, receivers, inhibition rules, silences, templates, and high-availability for production alerting.
- [Blameless Post-Mortem Practices: Incident Timelines, Root Cause Analysis, and Organizational Learning](https://agent-zone.ai/knowledge/observability/post-mortem-practices/) — Comprehensive guide to running blameless post-mortems. Covers incident timeline construction, root cause analysis techniques including 5 Whys and fishbone diagrams, action item tracking, post-mortem templates, and building a learning culture around incidents.
- [Choosing a Log Aggregation Stack: Loki vs Elasticsearch vs CloudWatch Logs vs Vector+ClickHouse](https://agent-zone.ai/knowledge/observability/choosing-log-aggregation/) — Decision framework for selecting the right log aggregation solution based on cost, query requirements, operational complexity, log volume, and correlation with metrics and traces.
- [Choosing a Monitoring Stack: Prometheus vs Datadog vs Cloud-Native vs VictoriaMetrics](https://agent-zone.ai/knowledge/observability/choosing-monitoring-stack/) — Decision framework for selecting the right metrics and monitoring stack based on cost model, operational burden, retention needs, query capability, and organizational scale.
- [Debugging and Tuning Alerts: Why Alerts Don't Fire, False Positives, and Threshold Selection](https://agent-zone.ai/knowledge/observability/alerting-debugging-and-tuning/) — Systematic approaches to diagnosing silent alerts, reducing alert fatigue, selecting production-grade thresholds, and managing alert lifecycle.
- [Distributed Tracing in Practice](https://agent-zone.ai/knowledge/observability/distributed-tracing/) — Implementing distributed tracing with Jaeger and Grafana Tempo on Kubernetes, including instrumentation patterns, trace correlation, and debugging slow requests.
- [Grafana Dashboards for Kubernetes Monitoring](https://agent-zone.ai/knowledge/observability/grafana-dashboards/) — Data source configuration, dashboard design patterns using USE and RED methods, variable templates, panel types, provisioning, and Grafana as Code.
- [Grafana Loki for Log Aggregation](https://agent-zone.ai/knowledge/observability/loki-log-aggregation/) — Deploying and querying Grafana Loki for scalable log aggregation in Kubernetes with Promtail, LogQL, and label cardinality management.
- [Grafana Mimir for Long-Term Prometheus Storage](https://agent-zone.ai/knowledge/observability/mimir-deep-dive/) — Reference for Grafana Mimir architecture, deployment modes, tenant isolation, remote_write configuration, retention policies, and performance tuning. Covers distributors, ingesters, store-gateway, compactor, and practical setup examples for production long-term metrics storage.
- [Grafana Organization: Folders, Permissions, Provisioning, and Dashboard Lifecycle](https://agent-zone.ai/knowledge/observability/grafana-organization-and-operations/) — Structuring Grafana for teams at scale with folder strategies, RBAC, provisioning pipelines, GitOps workflows, and multi-tenancy patterns.
- [Log Analysis and Management Strategies: Structured Logging, Aggregation, Retention, and Correlation](https://agent-zone.ai/knowledge/observability/log-analysis-patterns/) — Decision framework for log analysis and management. Covers structured logging best practices, log aggregation architecture with Loki, ELK, and Fluentd, log retention policies, log-based alerting, and correlation with traces and metrics.
- [Logging Patterns in Kubernetes](https://agent-zone.ai/knowledge/observability/kubernetes-logging-patterns/) — Kubernetes logging architecture from stdout capture to centralized aggregation with Fluent Bit, Fluentd, structured logging, and cost management.
- [Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures](https://agent-zone.ai/knowledge/observability/observability-troubleshooting-runbook/) — Systematic runbooks for diagnosing missing metrics, slow Prometheus, silent Alertmanager, broken Grafana dashboards, missing logs, and ServiceMonitor failures.
- [OpenTelemetry for Kubernetes](https://agent-zone.ai/knowledge/observability/opentelemetry-basics/) — Deploying the OpenTelemetry Collector on Kubernetes with auto-instrumentation, context propagation, sampling strategies, and exporter configuration.
- [Prometheus Architecture Deep Dive](https://agent-zone.ai/knowledge/observability/prometheus-architecture/) — How Prometheus scraping, storage, service discovery, relabeling, federation, and remote storage work together in production monitoring systems.
- [PromQL Essentials: Practical Query Patterns](https://agent-zone.ai/knowledge/observability/promql-essentials/) — PromQL instant vectors, range vectors, rate functions, aggregation operators, and real queries for the most common monitoring scenarios.
- [Real User Monitoring (RUM) and Frontend Observability: Core Web Vitals, Error Tracking, and Session Replay](https://agent-zone.ai/knowledge/observability/real-user-monitoring/) — Comprehensive reference for implementing Real User Monitoring. Covers Core Web Vitals, performance metrics collection, frontend error tracking, session replay tools, integration with backend observability, and practical comparison of synthetic vs real user monitoring.
- [Setting Up Full Observability from Scratch: Metrics, Logs, Traces, and Alerting](https://agent-zone.ai/knowledge/observability/ops-full-observability-setup/) — Step-by-step operational sequence for deploying a complete observability stack on Kubernetes including Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and Alertmanager.
- [Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees](https://agent-zone.ai/knowledge/observability/on-call-runbook-patterns/) — Practical guide to building on-call runbooks that reduce incident response time. Covers runbook format, escalation procedures, diagnostic decision trees, common scenario templates, runbook testing, and integration with alerting systems.
- [Writing Effective Prometheus Alerting Rules](https://agent-zone.ai/knowledge/observability/prometheus-alerting-rules/) — Practical alerting rule patterns for Kubernetes infrastructure, thresholds that avoid alert fatigue, and testing with promtool.
- [Advanced PromQL: Performance, Cardinality, and Complex Query Patterns](https://agent-zone.ai/knowledge/observability/promql-advanced-patterns/) — Deep-dive into PromQL cardinality control, expensive query avoidance, binary operator matching, prediction functions, subqueries, and recording rule strategy.
- [Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics](https://agent-zone.ai/knowledge/observability/thanos-mimir-victoriametrics/) — A decision framework for choosing between Thanos, Grafana Mimir, and VictoriaMetrics for long-term Prometheus metrics storage, multi-cluster aggregation, and high availability.
- [Monitoring Prometheus Itself: Capacity Planning, Self-Monitoring, and Scaling](https://agent-zone.ai/knowledge/observability/prometheus-capacity-and-self-monitoring/) — Self-monitoring metrics, capacity planning formulas, scaling patterns, HA configuration, and a production self-monitoring dashboard for Prometheus.
- [Prometheus Cardinality Management: Detecting, Preventing, and Reducing High-Cardinality Metrics](https://agent-zone.ai/knowledge/observability/prometheus-cardinality-management/) — How to detect cardinality problems in Prometheus, prevent label explosion through naming guidelines and relabeling, and reduce series count when already in trouble.
- [SLOs, Error Budgets, and SLI Implementation with Prometheus](https://agent-zone.ai/knowledge/observability/slo-error-budgets/) — Practical guide to defining SLOs, implementing SLIs in PromQL, multi-window burn-rate alerting, error budget tracking, and tooling with Pyrra and Sloth.
- [Synthetic Monitoring: Proactive Uptime Checks, Blackbox Exporter, and External Probing](https://agent-zone.ai/knowledge/observability/synthetic-monitoring/) — How to implement synthetic monitoring with Blackbox Exporter for HTTP, TCP, DNS, and ICMP probes, configure alerting on probe results, and design multi-location external monitoring.
- [Writing Custom Prometheus Exporters: Exposing Application and Business Metrics](https://agent-zone.ai/knowledge/observability/custom-prometheus-exporters/) — How to write custom Prometheus exporters in Go and Python, choose the right metric types, follow naming conventions, and integrate with Kubernetes service discovery.


---

[JSON](https://agent-zone.ai/knowledge/observability/index.json) | [HTML](https://agent-zone.ai/knowledge/observability/?format=html)