Choosing a Local Model: Size Tiers, Task Matching, and Cost Comparison with Cloud APIs

Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.

Prompt Engineering for Local Models: Presets, Focus Areas, and Differences from Cloud Model Prompting

Prompt Engineering for Local Models#

Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.

This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.

Structured Output from Small Local Models: JSON Mode, Extraction, Classification, and Token Runaway Fixes

Structured Output from Small Local Models#

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model’s limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

JSON Mode#

Ollama’s JSON mode forces the model to produce valid JSON:

Two-Pass Analysis: The Summarize-Then-Correlate Pattern for Scaling Beyond Context Windows

Two-Pass Analysis: Summarize-Then-Correlate#

A 32B model with a 32K context window can process roughly 8-10 source files at once. A real codebase has hundreds. Concatenating everything into one prompt fails — the context overflows, quality degrades, and the model either truncates or hallucinates connections.

The two-pass pattern solves this by splitting analysis into two stages:

  1. Pass 1 (Summarize): A fast 7B model reads each file independently and produces a focused summary.
  2. Pass 2 (Correlate): A capable 32B model reads all summaries (which are much shorter than the original files) and answers the cross-cutting question.

This effectively multiplies your context window by the compression ratio of summarization — typically 10-20x. A 32K context that handles 10 files directly can handle 100-200 files through summaries.