---
title: "Prometheus and Grafana Monitoring Stack"
description: "Setting up Prometheus scrape configs, PromQL queries, Grafana dashboards, alerting rules, and the kube-prometheus-stack for Kubernetes monitoring."
url: https://agent-zone.ai/knowledge/infrastructure/prometheus-and-grafana-setup/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["prometheus","grafana","monitoring","alerting","observability","kubernetes"]
skills: ["monitoring-setup","promql","alerting-configuration"]
tools: ["prometheus","grafana","helm","kubectl"]
levels: ["intermediate"]
word_count: 808
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/prometheus-and-grafana-setup/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/prometheus-and-grafana-setup/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Prometheus+and+Grafana+Monitoring+Stack
---


## Prometheus Architecture

Prometheus pulls metrics from targets at regular intervals (scraping). Each target exposes an HTTP endpoint (typically `/metrics`) that returns metrics in a text format. Prometheus stores the scraped data in a local time-series database and evaluates alerting rules against it. Grafana connects to Prometheus as a data source and renders dashboards.

## Scrape Configuration

The core of Prometheus configuration is the scrape config. Each `scrape_config` block defines a set of targets and how to scrape them.

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "app"
    metrics_path: /metrics
    static_configs:
      - targets: ["app:8080"]
        labels:
          env: "production"

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "postgres"
    static_configs:
      - targets: ["postgres-exporter:9187"]
```

For dynamic environments, use service discovery instead of static configs. In Kubernetes, Prometheus discovers pods and services via the API server:

```yaml
scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        regex: (.+);(.+)
        replacement: $2:$1
```

This scrapes any pod with the annotation `prometheus.io/scrape: "true"`. The relabel configs extract the metrics path and port from pod annotations.

## PromQL Essentials

PromQL is Prometheus's query language. Every query returns either an instant vector (one value per time series) or a range vector (multiple values over time).

```promql
# Current CPU usage rate per instance (last 5 minutes)
rate(node_cpu_seconds_total{mode!="idle"}[5m])

# Total HTTP requests per second by status code
sum by (status_code) (rate(http_requests_total[5m]))

# 95th percentile request latency
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk space remaining percentage
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Container CPU usage in a Kubernetes cluster
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

# Container memory working set
sum by (pod) (container_memory_working_set_bytes{container!=""})
```

Key functions: `rate()` computes per-second average, `increase()` gives total increase, `sum by ()` aggregates across labels, `histogram_quantile()` computes percentiles from histogram buckets.

## Alerting Rules

Alerting rules are evaluated by Prometheus and fired when conditions hold for a specified duration. Alerts are sent to Alertmanager, which handles routing, deduplication, and notification.

```yaml
# alert-rules.yml
groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage above 80% for 10 minutes. Current: {{ $value | printf \"%.1f\" }}%"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

      - alert: HighErrorRate
        expr: sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.job }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
```

The `for` duration prevents alerting on brief spikes. A condition must hold continuously for the entire duration before the alert fires.

## Grafana Data Sources and Dashboards

Connect Grafana to Prometheus by adding it as a data source. This can be provisioned declaratively:

```yaml
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
```

Dashboard provisioning loads JSON dashboards from files on startup:

```yaml
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: ""
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
```

Place dashboard JSON files in the configured path. Export dashboards from the Grafana UI (Share > Export > Save to file) and commit them to version control for reproducibility.

## USE and RED Methods

Structure your monitoring around established methodologies.

**USE method** (for infrastructure resources -- CPU, memory, disk, network):
- **Utilization**: What percentage of the resource is in use? (`node_cpu_seconds_total`, `node_memory_MemAvailable_bytes`)
- **Saturation**: Is work queuing? (`node_load1` vs CPU count, swap usage)
- **Errors**: Are there error conditions? (`node_disk_io_errors`, `node_network_receive_errs_total`)

**RED method** (for request-driven services -- APIs, web servers):
- **Rate**: Requests per second (`rate(http_requests_total[5m])`)
- **Errors**: Error rate or ratio (`rate(http_requests_total{status_code=~"5.."}[5m])`)
- **Duration**: Latency distribution (`histogram_quantile(0.95, ...)`)

## kube-prometheus-stack for Kubernetes

The kube-prometheus-stack Helm chart deploys Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics in one shot. It is the standard way to monitor a Kubernetes cluster.

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=admin \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
```

This deploys Prometheus scraping the Kubernetes API, kubelet, node-exporter, and kube-state-metrics. Grafana comes with dashboards for cluster health, node resources, and pod workloads. Access Grafana with `kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80`.

To add custom scrape targets, create a ServiceMonitor resource:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: monitoring
  labels:
    release: monitoring    # must match the Helm release label selector
spec:
  namespaceSelector:
    matchNames:
      - default
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
```

The Prometheus Operator watches for ServiceMonitor resources and updates the scrape configuration automatically. The `release: monitoring` label is critical -- without it, the Operator ignores the ServiceMonitor.

