---
title: "Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native"
description: "Installing and configuring Ollama for local LLM inference — pulling models, managing GPU memory, running multiple models, understanding quantization levels, and optimizing for Apple Silicon and ARM64."
url: https://agent-zone.ai/knowledge/agent-tooling/ollama-setup-and-model-management/
section: knowledge
date: 2026-02-22
categories: ["agent-tooling"]
tags: ["ollama","local-llm","model-management","apple-silicon","arm64","gpu-memory","quantization"]
skills: ["ollama-setup","model-management","local-inference-configuration"]
tools: ["ollama","docker"]
levels: ["intermediate"]
word_count: 1200
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/ollama-setup-and-model-management/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/ollama-setup-and-model-management/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Ollama+Setup+and+Model+Management%3A+Installation%2C+Model+Selection%2C+Memory+Management%2C+and+ARM64+Native
---


# Ollama Setup and Model Management

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

## Installation

```bash
# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

Start the Ollama server:

```bash
# macOS: Ollama runs as a menu bar app or
ollama serve

# Linux: systemd service
sudo systemctl enable ollama
sudo systemctl start ollama
```

Verify it is running:

```bash
curl http://localhost:11434/api/tags
```

## Pulling and Running Models

```bash
# Pull a model (downloads once, reuses after)
ollama pull qwen2.5-coder:7b

# Run interactively
ollama run qwen2.5-coder:7b

# List downloaded models
ollama list

# Show model details (parameters, quantization, size)
ollama show qwen2.5-coder:7b
```

## Model Naming Convention

Ollama model names follow the pattern `name:tag` where the tag indicates size and quantization:

```
qwen2.5-coder:7b        # 7B parameters, default quantization (Q4_K_M)
qwen2.5-coder:7b-q8_0   # 7B parameters, Q8 quantization (higher quality, more memory)
qwen2.5-coder:32b       # 32B parameters
llama3.3:70b             # 70B parameters
phi3:mini                # 3.8B parameters (alias)
```

## Quantization and Quality Tradeoffs

Quantization reduces model precision to use less memory. Ollama models default to Q4_K_M, which is a good balance:

| Quantization | Bits per Weight | Memory (7B) | Memory (32B) | Quality Impact |
|---|---|---|---|---|
| Q4_K_M | ~4.5 | ~5 GB | ~22 GB | Slight degradation, good for most tasks |
| Q5_K_M | ~5.5 | ~6 GB | ~26 GB | Minimal degradation |
| Q6_K | ~6.5 | ~7 GB | ~30 GB | Near-original quality |
| Q8_0 | 8 | ~8 GB | ~36 GB | Essentially lossless |
| FP16 | 16 | ~14 GB | ~64 GB | Original precision |

For code tasks, Q4_K_M is sufficient for extraction and classification. For complex reasoning where every token matters, Q5_K_M or Q6_K can measurably improve output quality.

## Memory Management

This is where most people hit problems. Understanding how Ollama manages GPU memory prevents the most common issues.

### How Ollama Loads Models

When you run or call a model, Ollama loads it into GPU memory (unified memory on Apple Silicon, VRAM on discrete GPUs). The model stays loaded after the request completes for fast subsequent requests.

```bash
# See what models are currently loaded
ollama ps

# Output:
# NAME                     SIZE      PROCESSOR    UNTIL
# qwen2.5-coder:32b        22 GB     100% GPU     4 minutes from now
```

Models are evicted after an idle timeout (default 5 minutes). You can explicitly stop a model:

```bash
ollama stop qwen2.5-coder:32b
```

### Memory Budget Planning

Plan your model loading around your available memory:

| Hardware | Total Memory | Usable for Models | Recommended Setup |
|---|---|---|---|
| Mac Mini M4 Pro (48GB) | 48 GB | ~38 GB (OS needs ~10GB) | 32B daily + 7B worker simultaneously |
| Mac Mini M4 Pro (64GB) | 64 GB | ~52 GB | 70B loaded, or 32B + 7B + embeddings |
| Linux with 24GB VRAM | 24 GB | ~22 GB | 32B quantized, or two 7B models |
| Linux with 48GB VRAM | 48 GB | ~44 GB | 70B quantized |

### Running Multiple Models

Ollama can hold multiple models in memory simultaneously if they fit:

```bash
# Load a small model for fast extraction
ollama run qwen2.5-coder:7b "extract the function names from this code"

# While 7B is still loaded, call the 32B for correlation
ollama run qwen2.5-coder:32b "analyze these summaries for architectural issues"

# Both stay in memory if RAM permits
ollama ps
# qwen2.5-coder:7b     5 GB    100% GPU    4 minutes
# qwen2.5-coder:32b    22 GB   100% GPU    4 minutes
```

When memory is tight and you need a larger model:

```bash
# Explicitly stop the 32B to free memory for the 70B
ollama stop qwen2.5-coder:32b
ollama run llama3.3:70b "deep analysis of this architecture"
```

### Apple Silicon Unified Memory

On Apple Silicon Macs (M1/M2/M3/M4), CPU and GPU share the same memory pool. This means:

- Models run on the GPU (Metal) natively with no data copying.
- Token generation speed is excellent (30-80 tokens/sec for 7B, 15-30 for 32B on M4 Pro).
- The OS, applications, and models all compete for the same memory pool. Budget 10GB for the OS and apps.
- There is no discrete VRAM — "GPU memory" and "system RAM" are the same thing.

### ARM64 Native

Ollama on Apple Silicon and ARM64 Linux runs natively. There is no emulation layer. This matters because:

- Performance is significantly better than x86 emulation (Rosetta or QEMU).
- Models that depend on specific CPU instructions (AVX2 on x86) may need ARM64-specific builds — Ollama handles this automatically.
- Docker on ARM64 Macs uses the native ARM64 Ollama image.

## The Ollama REST API

Every Ollama command maps to an HTTP API call. Applications should use the API, not shell out to the CLI:

```bash
# Generate a completion
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5-coder:7b",
  "messages": [{"role": "user", "content": "explain this function"}],
  "stream": false
}'

# Generate embeddings
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "function calculateTotal(items) { return items.reduce(...) }"
}'

# List loaded models
curl http://localhost:11434/api/ps
```

### API Options

Control generation behavior per-request:

```json
{
  "model": "qwen2.5-coder:7b",
  "messages": [{"role": "user", "content": "extract fields as JSON"}],
  "stream": false,
  "options": {
    "temperature": 0.0,
    "num_predict": 1024,
    "num_ctx": 8192,
    "top_p": 0.9
  },
  "format": "json"
}
```

Key options:
- **`temperature`** — 0.0 for deterministic output (extraction, classification). 0.7+ for creative generation.
- **`num_predict`** — Maximum tokens to generate. Critical for small models that can loop in JSON mode (see Structured Output article).
- **`num_ctx`** — Context window size. Larger contexts use more memory. Default varies by model (4096-131072).
- **`format: "json"`** — Forces JSON output. The model wraps its response in valid JSON.

## Client Libraries

### Go

```go
import "github.com/ollama/ollama/api"

client, _ := api.ClientFromEnvironment()
resp, _ := client.Chat(ctx, &api.ChatRequest{
    Model:  "qwen2.5-coder:7b",
    Messages: []api.Message{{Role: "user", Content: prompt}},
    Options: map[string]interface{}{
        "temperature": 0.0,
        "num_predict": 1024,
    },
})
```

### Python

```python
import ollama

response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": 0.0, "num_predict": 1024},
    format="json",
)
print(response["message"]["content"])
```

### HTTP (Language-Agnostic)

Any language with an HTTP client can call Ollama. The API is simple JSON over HTTP — no SDK required.

## Pre-Flight Checks

Before integrating Ollama into a workflow, verify the setup:

```bash
# Is Ollama running?
curl -s http://localhost:11434/api/tags > /dev/null && echo "OK" || echo "Ollama not running"

# Is the required model pulled?
ollama list | grep -q "qwen2.5-coder:7b" && echo "Model ready" || echo "Pull model first"

# How much memory is available?
ollama ps  # Shows loaded models and their memory usage

# Test a generation
ollama run qwen2.5-coder:7b "say hello" --verbose 2>&1 | grep "eval rate"
# Shows tokens/second — expect 30-80 tok/s for 7B on M4 Pro
```

## Common Mistakes

1. **Not checking loaded models before loading a new one.** Ollama does not warn when loading a model that will not fit. It silently falls back to CPU inference, which is 10-50x slower. Check `ollama ps` and stop unneeded models first.
2. **Using default context window for large inputs.** The default context varies by model. If your input exceeds it, the model silently truncates. Set `num_ctx` explicitly based on your input size.
3. **Shelling out to `ollama run` instead of using the API.** The CLI adds overhead (process startup, output parsing). Use the HTTP API or a client library for programmatic access.
4. **Expecting cloud-model quality from 7B models.** A 7B model is excellent for extraction, classification, and structured output. It is not a replacement for GPT-4 or Claude on complex reasoning. Match model size to task complexity.
5. **Not pinning model versions.** `ollama pull qwen2.5-coder:7b` pulls the latest tag, which can change. For reproducible results in production, record the model digest from `ollama show --modelfile` and verify it matches.

