Prometheus and Grafana Monitoring Stack

Prometheus Architecture#

Prometheus pulls metrics from targets at regular intervals (scraping). Each target exposes an HTTP endpoint (typically /metrics) that returns metrics in a text format. Prometheus stores the scraped data in a local time-series database and evaluates alerting rules against it. Grafana connects to Prometheus as a data source and renders dashboards.

Scrape Configuration#

The core of Prometheus configuration is the scrape config. Each scrape_config block defines a set of targets and how to scrape them.

Setting Up Full Observability from Scratch: Metrics, Logs, Traces, and Alerting

Setting Up Full Observability from Scratch#

This operational sequence deploys a complete observability stack on Kubernetes: metrics (Prometheus + Grafana), logs (Loki + Promtail), traces (Tempo + OpenTelemetry), and alerting (Alertmanager). Each phase is self-contained with verification steps. Complete them in order – later phases depend on earlier infrastructure.

Prerequisite: a running Kubernetes cluster with Helm installed and a monitoring namespace created.

kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

Phase 1 – Metrics (Prometheus + Grafana)#

Metrics are the foundation. Logging and tracing integrations all route through Grafana, so this phase must be solid before continuing.

SLOs, Error Budgets, and SLI Implementation with Prometheus

SLI, SLO, and SLA – What They Actually Mean#

An SLI (Service Level Indicator) is a quantitative measurement of service quality – a number computed from your metrics. Examples: the proportion of successful HTTP requests, the proportion of requests faster than 500ms, the proportion of jobs completing within their deadline.

An SLO (Service Level Objective) is a target value for an SLI. It is an internal engineering commitment: “99.9% of requests will succeed over a 30-day rolling window.”

Synthetic Monitoring: Proactive Uptime Checks, Blackbox Exporter, and External Probing

What Synthetic Monitoring Is#

Synthetic monitoring means actively probing your services on a schedule rather than waiting for users to report problems. Instead of relying on internal health checks or real user traffic to detect issues, you send controlled requests and measure the results. The fundamental question it answers is: “Is my service reachable and responding correctly right now?”

This is distinct from real user monitoring (RUM), which observes actual user interactions. Synthetic probes run 24/7 regardless of traffic volume, so they catch outages at 3 AM when no users are active. They provide consistent, repeatable measurements that are easy to alert on. The tradeoff is that synthetic probes test a narrow, predefined path – they do not capture the full range of user experience.