Prometheus and Grafana on Minikube: Production-Like Monitoring Without the Cost

Why Monitor a POC Cluster#

Monitoring on minikube serves two purposes. First, it catches resource problems early – your app might work in tests but OOM-kill under load, and you will not know without metrics. Second, it validates that your monitoring configuration works before you deploy it to production. If your ServiceMonitors, dashboards, and alert rules work on minikube, they will work on EKS or GKE.

The Right Chart: kube-prometheus-stack#

There are multiple Prometheus-related Helm charts. Use the right one:

Choosing a Monitoring Stack: Prometheus vs Datadog vs Cloud-Native vs VictoriaMetrics

Choosing a Monitoring Stack#

Monitoring is not optional. Without metrics, you are guessing. The question is not whether to monitor but which stack to use. The right choice depends on your cost tolerance, operational capacity, retention requirements, and how much you value control versus convenience.

Decision Criteria#

Before comparing tools, clarify what matters to your organization:

  • Cost model: Are you optimizing for infrastructure spend or engineering time? Self-managed tools cost less in licensing but more in operational hours. SaaS tools cost more in subscription fees but less in engineering effort.
  • Operational burden: Who manages the monitoring system? Do you have an infrastructure team, or are developers responsible for everything?
  • Data retention: Do you need metrics for 15 days, 90 days, or years? Long retention changes the equation significantly.
  • Query capability: Does your team know PromQL? Do they need ad-hoc analysis or mostly pre-built dashboards?
  • Alerting requirements: Simple threshold alerts, or complex multi-signal alerts with routing and escalation?
  • Team expertise: An organization fluent in Prometheus wastes that investment by switching to Datadog. An organization with no Prometheus experience faces a learning curve.

Options at a Glance#

Capability Prometheus + Grafana Prometheus + Thanos/Mimir VictoriaMetrics Datadog Cloud-Native Grafana Cloud
Cost model Infrastructure only Infrastructure only Infrastructure only Per host ($15-23/mo) Per metric/API call Per series/GB
Operational burden High Very high Medium None Low Low
Query language PromQL PromQL MetricsQL (PromQL-compatible) Datadog query language Vendor-specific PromQL, LogQL
Default retention 15 days (local disk) Unlimited (object storage) Unlimited (configurable) 15 months Varies (15 days - 15 months) Plan-dependent
HA built-in No (requires federation) Yes Yes (cluster mode) Yes Yes Yes
Multi-cluster Federation (limited) Yes (global view) Yes (cluster mode) Yes Per-account Yes
APM/Tracing No (separate tools) No (separate tools) No (separate tools) Yes (integrated) Varies Yes (Tempo)
Vendor lock-in None None Low High High Low-Medium

Prometheus + Grafana (Self-Managed)#

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model, scraping metrics from endpoints at configurable intervals, and stores time series data on local disk. Grafana provides visualization. Alertmanager handles alert routing.

CockroachDB Day-2 Operations

Adding and Removing Nodes#

Adding a node: start a new cockroach process with --join pointing to existing nodes. CockroachDB automatically rebalances ranges to the new node.

cockroach start --insecure --store=node4-data \
  --advertise-addr=node4:26257 \
  --join=node1:26257,node2:26257,node3:26257

Watch rebalancing in the DB Console under Metrics > Replication, or query directly:

SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status;

Decommissioning a node moves all range replicas off before shutdown, preventing under-replication:

cockroach node decommission 4 --insecure --host=node1:26257

# Monitor progress
cockroach node status --insecure --host=node1:26257 --decommission

Do not simply kill a node. Without decommissioning, CockroachDB treats it as a failure and waits 5 minutes before re-replicating. On Kubernetes with the operator, scale by changing spec.nodes in the CrdbCluster resource.

Grafana Dashboards for Kubernetes Monitoring

Data Source Configuration#

Grafana connects to backend data stores through data sources. For a complete Kubernetes observability stack, you need three: Prometheus for metrics, Loki for logs, and Tempo for traces.

Provision data sources declaratively so they survive Grafana restarts and are version-controlled:

# grafana/provisioning/datasources/observability.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-operated:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"traceID":"(\w+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3100
    jsonData:
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{key: "service.name", value: "job"}]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

The cross-linking configuration lets you click from a metric data point to the trace that generated it, and extract trace IDs from log lines to link to Tempo.

Kubernetes Events Debugging: Patterns, Filtering, and Alerting

Kubernetes Events Debugging#

Kubernetes events are the cluster’s built-in audit trail for what is happening to resources. When a pod fails to schedule, a container crashes, a node runs out of disk, or a volume fails to mount, the system records an event. Events are the first place to look when something goes wrong, and learning to read them efficiently separates quick diagnosis from hours of guessing.

Event Structure#

Every Kubernetes event has these fields:

Prometheus and Grafana Monitoring Stack

Prometheus Architecture#

Prometheus pulls metrics from targets at regular intervals (scraping). Each target exposes an HTTP endpoint (typically /metrics) that returns metrics in a text format. Prometheus stores the scraped data in a local time-series database and evaluates alerting rules against it. Grafana connects to Prometheus as a data source and renders dashboards.

Scrape Configuration#

The core of Prometheus configuration is the scrape config. Each scrape_config block defines a set of targets and how to scrape them.

PromQL Essentials: Practical Query Patterns

Instant Vectors vs Range Vectors#

An instant vector returns one sample per time series at a single point in time. A range vector returns multiple samples per time series over a time window.

# Instant vector: current value of each series
http_requests_total{job="api"}

# Range vector: last 5 minutes of samples for each series
http_requests_total{job="api"}[5m]

You cannot graph a range vector directly. Functions like rate() and increase() consume a range vector and return an instant vector, which Grafana can then plot.

Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Kubernetes Resource Management: QoS Classes, Eviction, OOM Scoring, and Capacity Planning

Kubernetes Resource Management Deep Dive#

Resource management in Kubernetes is the mechanism that decides which pods get scheduled, which pods get killed when the node runs low, and how much CPU and memory each container is actually allowed to use. The surface-level concept of requests and limits is straightforward. The underlying mechanics – QoS classification, CFS CPU quotas, kernel OOM scoring, kubelet eviction thresholds – are where misconfigurations cause production outages.

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.