Choosing a Local Model: Size Tiers, Task Matching, and Cost Comparison with Cloud APIs

February 22, 2026

Model-Selection, Cost-Analysis, Task-Model-Matching

Local-Llm, Model-Selection, Benchmarking, Ollama, Cost-Comparison, Small-Models

Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native

February 22, 2026

Agent-Tooling

Intermediate

Ollama-Setup, Model-Management, Local-Inference-Configuration

Ollama, Local-Llm, Model-Management, Apple-Silicon, Arm64, Gpu-Memory, Quantization

Ollama, Docker

Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server:

Prompt Engineering for Local Models: Presets, Focus Areas, and Differences from Cloud Model Prompting

February 22, 2026

Agent-Tooling

Intermediate

Local-Model-Prompting, Preset-Design, Prompt-Debugging

Prompt-Engineering, Local-Llm, Ollama, Presets, Structured-Prompts, Small-Models

Ollama, Qwen, Llama, Python

Prompt Engineering for Local Models#

Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.

This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.

RAG for Codebases Without Cloud APIs: ChromaDB, Embedding Models, and Semantic Code Search

February 22, 2026

Agent-Tooling

Intermediate

Rag-Pipeline-Construction, Embedding-Model-Usage, Semantic-Code-Search

Rag, Embeddings, Chromadb, Local-Llm, Semantic-Search, Code-Search, Vector-Database

Ollama, Chromadb, Python, Nomic-Embed-Text

RAG for Codebases Without Cloud APIs#

When a codebase has hundreds of files, neither direct concatenation nor summarize-then-correlate is ideal for targeted questions like “where is authentication handled?” or “what calls the payment API?” RAG (Retrieval-Augmented Generation) indexes the codebase into a vector database and retrieves only the relevant chunks for each query.

The key advantage: query time is constant regardless of codebase size. Whether the codebase has 50 files or 5,000, a query takes the same time because only the top-K relevant chunks are retrieved and sent to the model.

Structured Output from Small Local Models: JSON Mode, Extraction, Classification, and Token Runaway Fixes

February 22, 2026

Agent-Tooling

Intermediate

Structured-Extraction, Json-Output-Engineering, Classification-Pipeline, Output-Scoring

Local-Llm, Structured-Output, Json-Mode, Extraction, Classification, Function-Calling, Ollama

Ollama, Qwen, Ministral, Python, Go

Structured Output from Small Local Models#

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model’s limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

JSON Mode#

Ollama’s JSON mode forces the model to produce valid JSON:

Two-Pass Analysis: The Summarize-Then-Correlate Pattern for Scaling Beyond Context Windows

February 22, 2026

Agent-Tooling

Intermediate

Multi-File-Analysis, Llm-Orchestration, Context-Window-Management

Local-Llm, Two-Pass, Summarize-Correlate, Codebase-Analysis, Context-Window, Architecture-Pattern

Ollama, Python, Qwen

Two-Pass Analysis: Summarize-Then-Correlate#

A 32B model with a 32K context window can process roughly 8-10 source files at once. A real codebase has hundreds. Concatenating everything into one prompt fails — the context overflows, quality degrades, and the model either truncates or hallucinates connections.

The two-pass pattern solves this by splitting analysis into two stages:

Pass 1 (Summarize): A fast 7B model reads each file independently and produces a focused summary.
Pass 2 (Correlate): A capable 32B model reads all summaries (which are much shorter than the original files) and answers the cross-cutting question.

This effectively multiplies your context window by the compression ratio of summarization — typically 10-20x. A 32K context that handles 10 files directly can handle 100-200 files through summaries.