---
title: "Grafana Dashboards for Kubernetes Monitoring"
description: "Data source configuration, dashboard design patterns using USE and RED methods, variable templates, panel types, provisioning, and Grafana as Code."
url: https://agent-zone.ai/knowledge/observability/grafana-dashboards/
section: knowledge
date: 2026-02-22
categories: ["observability"]
tags: ["grafana","dashboards","prometheus","loki","tempo","kubernetes","monitoring","grafonnet"]
skills: ["grafana-dashboard-design","data-source-configuration","dashboard-provisioning","grafana-as-code"]
tools: ["grafana","prometheus","loki","tempo","grafonnet","terraform"]
levels: ["intermediate"]
word_count: 754
formats:
  json: https://agent-zone.ai/knowledge/observability/grafana-dashboards/index.json
  html: https://agent-zone.ai/knowledge/observability/grafana-dashboards/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Grafana+Dashboards+for+Kubernetes+Monitoring
---


## Data Source Configuration

Grafana connects to backend data stores through data sources. For a complete Kubernetes observability stack, you need three: Prometheus for metrics, Loki for logs, and Tempo for traces.

Provision data sources declaratively so they survive Grafana restarts and are version-controlled:

```yaml
# grafana/provisioning/datasources/observability.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-operated:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"traceID":"(\w+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3100
    jsonData:
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{key: "service.name", value: "job"}]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
```

The cross-linking configuration lets you click from a metric data point to the trace that generated it, and extract trace IDs from log lines to link to Tempo.

## Dashboard Design: USE Method

The USE method monitors infrastructure resources. Build one row per resource type.

**CPU Row:**

```promql
# Utilization - time series panel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Saturation - time series panel
node_load1 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})

# Errors - stat panel (should normally be 0)
sum by (instance) (rate(node_cpu_guest_seconds_total[5m]))
```

**Memory Row:**

```promql
# Utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Saturation (swap usage indicates memory pressure)
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes

# Errors (OOM kills)
increase(node_vmstat_oom_kill[1h])
```

**Disk Row:**

```promql
# Utilization
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

# Saturation (IO wait)
rate(node_cpu_seconds_total{mode="iowait"}[5m])

# Errors
rate(node_disk_io_errors_total[5m])
```

## Dashboard Design: RED Method

The RED method monitors request-driven services. One dashboard per service.

```promql
# Rate - time series panel
sum by (handler) (rate(http_requests_total[5m]))

# Errors - time series panel (show as percentage)
sum by (handler) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (handler) (rate(http_requests_total[5m])) * 100

# Duration - time series panel with multiple percentile lines
# p50
histogram_quantile(0.50, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p95
histogram_quantile(0.95, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p99
histogram_quantile(0.99, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
```

Place all three queries on the same duration panel with distinct colors. Seeing p50, p95, and p99 together reveals tail latency issues that averages would hide.

## Variable Templates

Dashboard variables make dashboards reusable across namespaces, clusters, and workloads. Define them in the dashboard settings.

**Namespace selector** -- variable type Query, data source Prometheus:

```
label_values(kube_pod_info, namespace)
```

Enable multi-value and "Include All" to allow selecting multiple namespaces or all at once.

**Pod selector** (chained to namespace):

```
label_values(kube_pod_info{namespace=~"$namespace"}, pod)
```

**Node selector:**

```
label_values(kube_node_info, node)
```

Use variables in panel queries with `$variable` syntax:

```promql
sum by (pod) (rate(container_cpu_usage_seconds_total{
  namespace=~"$namespace",
  pod=~"$pod",
  container!=""
}[5m]))
```

The `=~` operator with `$namespace` handles both single selection and the "All" option (which produces a regex like `ns1|ns2|ns3`).

## Panel Types

Choose the right panel type for the data:

- **Time series**: Primary panel for anything over time -- CPU, memory, request rate, latency.
- **Stat**: Single-value with thresholds -- error count, uptime, active replicas.
- **Gauge**: Value within a known range -- disk/memory/CPU percentage.
- **Table**: Multi-column data -- top pods by CPU, certificate expiration dates.
- **Logs**: Loki log streams. Pair with metrics panels above to correlate spikes.
- **Bar gauge**: Horizontal bars for ranked comparisons -- top 10 pods by memory.

## Dashboard Provisioning

In Kubernetes, provision dashboards via ConfigMaps. Grafana's sidecar container watches for ConfigMaps with a specific label and loads their contents as dashboards.

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"   # label the sidecar watches for
data:
  app-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Overview",
        "panels": [ ... ],
        "templating": { ... }
      }
    }
```

With kube-prometheus-stack, the sidecar label is configured via `grafana.sidecar.dashboards.label` in Helm values (default: `grafana_dashboard`).

For file-based provisioning outside Kubernetes:

```yaml
# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
```

## Grafana as Code

**Grafonnet** is a Jsonnet library for generating dashboards programmatically:

```jsonnet
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

grafana.dashboard.new('Service Overview')
+ grafana.dashboard.withPanels([
  grafana.panel.timeSeries.new('Request Rate')
  + grafana.panel.timeSeries.queryOptions.withTargets([
    grafana.query.prometheus.new('Prometheus',
      'sum by (handler) (rate(http_requests_total{namespace="$namespace"}[5m]))')
  ])
  + grafana.panel.timeSeries.standardOptions.withUnit('reqps'),
])
```

Build with `jsonnet -J vendor service-dashboard.jsonnet > service-dashboard.json`.

**Terraform provider** manages Grafana resources as infrastructure:

```hcl
resource "grafana_dashboard" "app" {
  config_json = file("dashboards/app.json")
  folder      = grafana_folder.monitoring.id
}
```

## Community Dashboards

Import proven dashboards by ID rather than building from scratch: Node Exporter Full (1860), Kubernetes Cluster (7249), Kubernetes Pods (6879), CoreDNS (5926), NGINX Ingress (9614). Import via `grafana.com/grafana/dashboards/{ID}` and customize thresholds to match your environment.

## Grafana Alerting vs Alertmanager

Grafana 9+ has a built-in alerting engine that evaluates rules against any data source. Use Grafana alerting for Loki log queries or multi-datasource conditions. Use Alertmanager for purely Prometheus-based alerts with gossip HA deduplication. Running both is common: Prometheus rules go through Alertmanager, Loki-based alerts go through Grafana alerting. Avoid duplicating the same alert in both.