---
title: "GPU and Host Monitoring Across Mac and Linux/GB10 in One Prometheus"
description: "Scrape heterogeneous LLM hosts (macOS + Linux/NVIDIA) into one kube-prometheus stack: the macOS-vs-Linux node_exporter metric-name divergence, why the stock node dashboard hides Darwin, DCGM's GB10 gaps, scraping external hosts via ScrapeConfig, and a combined OS-normalized dashboard."
url: https://agent-zone.ai/knowledge/observability/gpu-and-host-monitoring-mac-and-linux/
section: knowledge
date: 2026-05-25
categories: ["observability"]
tags: ["prometheus","grafana","node-exporter","dcgm","gpu-monitoring","macos","darwin","scrapeconfig","kube-prometheus","local-llm"]
skills: ["heterogeneous-host-monitoring","scrapeconfig-authoring","cross-os-promql","gpu-telemetry"]
tools: ["prometheus","grafana","node-exporter","dcgm-exporter","kube-prometheus-stack"]
levels: ["intermediate","advanced"]
word_count: 747
formats:
  json: https://agent-zone.ai/knowledge/observability/gpu-and-host-monitoring-mac-and-linux/index.json
  html: https://agent-zone.ai/knowledge/observability/gpu-and-host-monitoring-mac-and-linux/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=GPU+and+Host+Monitoring+Across+Mac+and+Linux%2FGB10+in+One+Prometheus
---


> **Decision-first:** macOS and Linux node_exporter expose **different metric names** — write per-OS memory/disk expressions. The stock node dashboard **hides Darwin on purpose**. Scrape external hosts via `ScrapeConfig` + relabel `job`/`instance`. On a GB10, there are **no GPU framebuffer or profiling metrics** — read model footprint from system RAM.

> **Scope & freshness:** kube-prometheus-stack + node_exporter + DCGM, macOS + Linux/GB10, as of 2026-05-25. Re-check the GB10 DCGM gaps after a DCGM/driver bump.

## The problem

You want one Grafana to watch a fleet of LLM hosts that aren't homogeneous — say a **macOS** box and a **Linux/NVIDIA GB10** box — from a `kube-prometheus-stack` running in a cluster. Three things bite you: metric-name divergence, a dashboard that silently hides Macs, and GPU-metric gaps on the GB10.

## macOS and Linux node_exporter expose *different* metric names

`node_exporter` on macOS does **not** emit the Linux memory metrics. This breaks naive cross-host queries:

| Concept | Linux | macOS |
|---|---|---|
| total memory | `node_memory_MemTotal_bytes` | `node_memory_total_bytes` |
| available | `node_memory_MemAvailable_bytes` | *(absent)* — derive from `free+inactive+purgeable` |
| used breakdown | `MemFree/Buffers/Cached` | `active/wired/compressed/inactive/free/purgeable` |
| disk write bytes | `node_disk_written_bytes_total` | `node_disk_written_bytes_total` ✓ |

`node_cpu_seconds_total{mode="idle"}` exists on both (so `1 - idle` works cross-OS), but memory and disk need per-OS expressions. A combined "memory used %" panel needs two targets:

```promql
# Linux node
100 * (1 - node_memory_MemAvailable_bytes{instance="linux-host"} / node_memory_MemTotal_bytes{instance="linux-host"})
# macOS node (no MemAvailable; "used" ≈ active+wired+compressed)
100 * (node_memory_active_bytes{instance="mac"} + node_memory_wired_bytes{instance="mac"} + node_memory_compressed_bytes{instance="mac"}) / node_memory_total_bytes{instance="mac"}
```

## The stock node dashboard hides macOS on purpose

The kube-prometheus-stack **"Node Exporter / Nodes"** dashboard's `instance` variable query filters `sysname!="Darwin"`:

```
label_values(node_uname_info{job="node-exporter", sysname!="Darwin"}, instance)
```

So a Mac never appears in the dropdown — by design, because the dashboard's panels query Linux-only metric names that would render empty for Darwin. Don't chase this as a scrape bug. For Macs, build a **macOS-native dashboard** using the macOS metric names above.

## Scraping external (non-cluster) hosts

Run `node_exporter` (and `dcgm-exporter` for NVIDIA) on each external host, then add a `ScrapeConfig` (prometheus-operator CRD). Relabel to a friendly `job`/`instance` so external hosts land on the standard dashboards alongside in-cluster nodes:

```yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: gpu-host-node-exporter
  labels: { release: dt-monitoring }   # must match the Prometheus selector
spec:
  staticConfigs:
    - targets: ["10.0.0.97:9100"]
  relabelings:
    - { targetLabel: job, replacement: node-exporter }   # share the standard job
    - { targetLabel: instance, replacement: gpu-host }    # friendly name in dropdowns
```

Reaching hosts from a minikube-based Prometheus: a LAN host is reachable by its **Wi-Fi/LAN IP**, not a point-to-point direct-link IP (which pods can't route). For the *host running minikube itself*, use `host.minikube.internal`.

## DCGM on a GB10: works, with two gaps

`dcgm-exporter` gives core NVIDIA telemetry — `GPU_UTIL`, `GPU_TEMP`, `MEMORY_TEMP`, `POWER_USAGE`, `SM_CLOCK`, `MEM_COPY_UTIL`. On a GB10 (Grace-Blackwell, unified memory):

- **No profiling metrics.** `DCGM_FI_PROF_*` (incl. `DRAM_ACTIVE` = true memory-bandwidth utilization) fail to load — PerfWorks returns `NVPW_DCGM_LoadDriver returned 1`. Use `MEM_COPY_UTIL` as a coarser bandwidth proxy until a newer DCGM/driver supports it.
- **No framebuffer metrics.** `DCGM_FI_DEV_FB_USED`/`FB_FREE` return no data — there's no discrete VRAM. **The loaded-model footprint lives in system RAM** (`node_memory_*`), so pair a GPU-util panel with host memory-used for the "is a model loaded and how big" view.

Relabel the DCGM job to match whatever GPU dashboard you import (e.g. `job=dcgm-exporter` for grafana.com dashboard 12239) — note 12239's framebuffer panels read no-data on a GB10, expectedly.

## Combined cross-node dashboard

Build one dashboard with per-concept panels that overlay both hosts using OS-normalized expressions:

- **CPU**: `100 * (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])))` — works on both.
- **Memory used %**: two targets (Linux MemAvailable formula; macOS active+wired+compressed formula).
- **Network**: per-interface rates, with friendly legends (e.g., wifi vs direct-link device names differ per host).

## What didn't work (so you don't repeat it)

- **Debugging "macOS host missing from the node dashboard" as a scrape/relabel bug** — it's the dashboard's `sysname!="Darwin"` filter *by design*; the scrape was fine. Build a macOS-native dashboard instead.
- **Expecting GPU-memory panels to populate on a GB10** — `FB_USED`/`FB_FREE` return no data (unified memory). Use `node_memory_*` for the model-cache view.
- **Reaching a LAN host by its point-to-point direct-link IP from in-cluster pods** — not routable; use the host's Wi-Fi/LAN IP (or `host.minikube.internal` for the host running minikube).

> **Verify:** after adding a `ScrapeConfig`, query `up{instance="<your-label>"}` in Prometheus — expect `1`. If the external metric matchers return empty but `up` is 1, you have a metric-name (OS) mismatch, not a scrape failure.

## Checklist

- Expect macOS↔Linux metric-name divergence; write per-OS memory/disk expressions.
- The stock node dashboard excludes Darwin — build a macOS-native one.
- Scrape external hosts via `ScrapeConfig` + relabel `job`/`instance`; reach them by LAN IP.
- On a GB10: no `DCGM_FI_PROF_*`, no `FB_USED` — use `node_memory` for model footprint, `MEM_COPY_UTIL` for bandwidth.

