---
title: "Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)"
description: "Practical guide to serving LLMs on a GB10 Grace-Blackwell box: why it's bandwidth-bound (low-active MoE only), GGUF vs MLX, LM Studio + the lms CLI, unified-memory sizing, the one-model-at-a-time OOM guard, and DCGM monitoring limits."
url: https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/
section: knowledge
date: 2026-05-25
categories: ["infrastructure"]
tags: ["gb10","dgx-spark","asus-ascent-gx10","local-llm","lm-studio","llama-cpp","gguf","unified-memory","moe","grace-blackwell","dcgm"]
skills: ["local-llm-deployment","gpu-memory-sizing","model-runtime-selection","moe-model-selection"]
tools: ["lm-studio","lms","llama.cpp","dcgm-exporter","ssh"]
levels: ["intermediate","advanced"]
word_count: 1267
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/running-llms-on-nvidia-gb10-dgx-spark/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Running+Local+LLMs+on+the+NVIDIA+GB10+%28DGX+Spark+%2F+ASUS+Ascent+GX10%29
---


> **Decision-first:** On a GB10, pick **low-active MoE** models (A3B-class), serve **GGUF** (not MLX) via LM Studio, run **one model at a time** behind an OOM guard, and monitor GPU via DCGM but read the **model footprint from system RAM** (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

> **Scope & freshness:** GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

## What a GB10 box actually is

The NVIDIA **GB10** (Grace-Blackwell) ships in several forms — the NVIDIA **DGX Spark** and OEM variants like the **ASUS Ascent GX10**. They share the defining trait: a single Grace-Blackwell GPU with **128 GB of unified LPDDR5x memory** shared between CPU and GPU. It runs ARM64 Linux (e.g., Ubuntu).

The number that governs everything you do with it is **memory bandwidth**, in the ~270 GB/s class. That is an order of magnitude below a discrete data-center GPU (an H100 is ~3 TB/s). The GB10 has plenty of *capacity* (128 GB) but is severely *bandwidth-limited*.

## The binding constraint: bandwidth, not compute or capacity

LLM decode (token generation) reads the model's active weights from memory **once per token**. Throughput is therefore bounded by `memory_bandwidth / active_parameter_bytes`, not by FLOPs. On a ~270 GB/s box this has hard consequences:

- **Dense 70B models are unusable for interactive work** — decode lands around **~2-3 tokens/sec**. The weights don't fit the bandwidth budget per token.
- **Low-active-parameter MoE models are the sweet spot.** A 35B model with only **3B active** (35B-A3B) reads ~3B params/token, so it decodes many times faster than a 35B-dense model despite occupying similar memory. MoE turns the GB10's capacity advantage into usable speed.

**Selection rule:** on a GB10, choose models by **active** parameter count, not total. Favor MoE with small active experts (A3B-class). Treat total parameters as a memory-capacity question and active parameters as a speed question — they are separate axes.

## Model format: GGUF, not MLX

A common mix-up: **MLX is Apple-Silicon only.** It will not run on a GB10 (that's an ARM+CUDA Linux box, not an Apple GPU). Use **GGUF** (the llama.cpp format), which is cross-platform and runs on the GB10's CUDA backend.

This is the inverse of a Mac, where MLX is often the *faster* native option. If you're moving a workload between a Mac and a GB10, the GB10 forces GGUF; only the Mac side has the MLX choice.

## Runtime: LM Studio + the `lms` CLI

LM Studio (llama.cpp backend) is a practical server for a GB10. It exposes:

- An **OpenAI-compatible** API at `http://<host>:8888/v1` (`/v1/chat/completions`, `/v1/models`).
- A **management** API at `http://<host>:8888/api/v1/models` (richer: load state, context length, loaded instances).

The `lms` CLI manages models. Note that a **non-interactive SSH session won't have `lms` on PATH** — prepend it:

```bash
LMS='PATH="$HOME/.lmstudio/bin:$HOME/bin:$PATH"'
ssh user@gb10 "$LMS lms ls"                       # list local models + LOADED state
ssh user@gb10 "$LMS lms server start"             # start the HTTP server (needed after reboot)
ssh user@gb10 "$LMS lms-model load 'org/model' --context-kb 128 --parallel 1 --gpu max"
ssh user@gb10 "$LMS lms-model unload --all"
```

After a reboot the LM Studio backend may need a restart (`systemctl restart lmstudio` then `lms server start`); a "passkey" error on `lms` usually means the backend is still mid-startup.

## Memory sizing and the one-model-at-a-time rule

**Resident size is larger than the GGUF file** — the KV cache, context allocation, and runtime overhead add up. A model whose file is ~21 GB can occupy **~24 GB+ resident** at a 128k context, and a 73 GB model can push **~96 GB** of system memory at 128k. Budget for the file size **plus** KV cache (which scales with context length) **plus** OS/runtime overhead.

The practical ceiling on a 128 GB GB10 is **one large model at a time**. Two large models exceed 128 GB and OOM the box (a real incident: a 73 GB model + a 48 GB model = ~121 GB of weights alone, before KV/overhead, which hung the machine). LM Studio's JIT auto-load can silently reload a second model on request, so guard against it explicitly.

### OOM guard: verify clean before every load

Before loading a target model, **unload everything and confirm zero models are resident** — don't assume the previous unload landed (especially over a flaky SSH link). Pattern:

```bash
ssh user@gb10 "$LMS lms-model unload --all"
# Retry + verify: refuse to load if anything is still resident
for attempt in 1 2 3; do
  loaded=$(ssh user@gb10 "$LMS lms ls 2>/dev/null | grep -i LOADED | awk '{print \$1}'" | xargs)
  [ -z "$loaded" ] && break
  ssh user@gb10 "$LMS lms-model unload --all"; sleep 3
done
[ -n "$loaded" ] && { echo "ABORT: still loaded [$loaded] — refusing to load (OOM risk)"; exit 1; }
ssh user@gb10 "$LMS lms-model load '$MODEL' --context-kb 128 --gpu max"
```

If you run benchmarks or batch jobs, run them **serially** (one model loaded at a time) and **don't download a model while inference is running** — both saturate the same memory bandwidth and tank each other.

## Connecting from a Mac: the Local Network Privacy trap

If you drive the GB10 from a Mac and use a **locally-built binary** (e.g., a Go client), macOS **Local Network Privacy** can block it from reaching the GB10's LAN IPs with `no route to host` — while `curl` works fine (it's exempt). This looks like a flaky link or cable but isn't.

**Fix:** SSH-tunnel the LM Studio port to loopback and point the client at `127.0.0.1` (loopback is LNP-exempt):

```bash
ssh -N -f -L 18888:localhost:8888 user@gb10
# client now targets http://127.0.0.1:18888/v1 instead of http://gb10:8888/v1
```

## Monitoring: DCGM works, but with GB10-specific gaps

The NVIDIA **DCGM exporter** runs on the GB10 and gives you the core GPU telemetry — `DCGM_FI_DEV_GPU_UTIL`, `GPU_TEMP`, `MEMORY_TEMP`, `POWER_USAGE`, `SM_CLOCK`, `MEM_COPY_UTIL`. Two gaps to expect on current GB10 + DCGM + driver combos:

- **No profiling metrics.** The `DCGM_FI_PROF_*` family (including `DRAM_ACTIVE`, the true memory-bandwidth-utilization metric you'd most want on a bandwidth-bound box) **fails to load** — the profiling module errors with `NVPW_DCGM_LoadDriver returned 1` because NVIDIA PerfWorks won't initialize on this device. Re-check after a DCGM/driver bump; the counters CSV is harmless to leave configured. Until then, `MEM_COPY_UTIL` is your (coarser) bandwidth proxy.
- **No framebuffer metrics.** `DCGM_FI_DEV_FB_USED` / `FB_FREE` return no data — there is no discrete VRAM on a unified-memory box. **The loaded-model footprint shows up in *system RAM* instead** (scrape the host's `node_exporter` `node_memory_*`), not in any GPU framebuffer metric. A dashboard panel pairing `DCGM_FI_DEV_GPU_UTIL` with host `node_memory` used is the practical "is a model loaded and how big" view.

## What didn't work (so you don't repeat it)

- **Enabling DCGM profiling metrics** (`DCGM_FI_PROF_*`, incl. `DRAM_ACTIVE`) — PerfWorks won't initialize on GB10 (`NVPW_DCGM_LoadDriver returned 1`). It's a hardware/driver gap, not a config error; two sessions chased it before confirming. Re-check at a newer DCGM/driver.
- **Loading two models to compare them quickly** — a 73 GB + 48 GB model = ~121 GB before KV/overhead → hung the box (required a hard reboot). Always unload-and-verify first.
- **Diagnosing a Mac→GB10 LAN connection failure as a cable/link issue** — it was macOS Local Network Privacy blocking a locally-built binary; `curl` worked, the binary didn't. Tunnel to loopback.

## Checklist

- Pick models by **active** params (A3B-class MoE), not total — dense 70B is ~2-3 tok/s.
- Use **GGUF** (MLX is Apple-only and won't run here).
- Serve via LM Studio; manage with `lms` (prepend PATH over SSH; `lms server start` after reboot).
- **One large model at a time**; resident size > file size; unload-and-verify before each load.
- From a Mac client, SSH-tunnel to loopback to dodge Local Network Privacy.
- Monitor with DCGM for util/temp/power; use host `node_memory` for the model-cache view (no FB metrics, no profiling metrics on current GB10).

