---
title: "Prompt Engineering for Local Models: Presets, Focus Areas, and Differences from Cloud Model Prompting"
description: "Prompt engineering techniques specific to small and medium local models — why local models need different prompt strategies, using presets and focus areas, schema-driven prompts, and common failures with fixes."
url: https://agent-zone.ai/knowledge/agent-tooling/prompt-engineering-local-models/
section: knowledge
date: 2026-02-22
categories: ["agent-tooling"]
tags: ["prompt-engineering","local-llm","ollama","presets","structured-prompts","small-models"]
skills: ["local-model-prompting","preset-design","prompt-debugging"]
tools: ["ollama","qwen","llama","python"]
levels: ["intermediate"]
word_count: 1504
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/prompt-engineering-local-models/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/prompt-engineering-local-models/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Prompt+Engineering+for+Local+Models%3A+Presets%2C+Focus+Areas%2C+and+Differences+from+Cloud+Model+Prompting
---


# Prompt Engineering for Local Models

Prompting a 7B local model is not the same as prompting Claude or GPT-4. Cloud models are overtrained on instruction following, tolerate vague prompts, and self-correct. Small local models need more structure, more constraints, and more explicit formatting instructions. The prompts that work effortlessly on cloud models often produce garbage on local models.

This is not a weakness — it is a design consideration. Local models trade generality for speed and cost. Your prompts must compensate by being more specific.

## How Local Models Differ

### Less Instruction Following

Cloud models are trained with extensive RLHF (reinforcement learning from human feedback) to follow instructions precisely. A 7B model has less of this training. Result:

```
# Works on Claude, fails on 7B local
"Analyze this code for security issues. Be thorough but concise.
Focus on injection vulnerabilities and authentication bypass."

# Works on both
"List security issues in this code. For each issue, output:
- LINE: the line number
- TYPE: one of [injection, auth_bypass, xss, ssrf, other]
- SEVERITY: one of [low, medium, high, critical]
- DESCRIPTION: one sentence explaining the issue"
```

The second prompt constrains the output format so the model does not have to decide how to structure its response.

### Narrower Context Understanding

Cloud models maintain coherence over long prompts. Local models lose focus after a few hundred tokens of instructions. Front-load the important parts:

```
# Bad: important instruction buried at the end
"Here is a 500-line source file. <file content>
By the way, only focus on the error handling patterns."

# Good: instruction first, content second
"Focus ONLY on error handling patterns in the following code.
For each error handling pattern found, output the function name
and how errors are propagated.

<file content>"
```

### Less Self-Correction

Cloud models catch their own mistakes mid-generation. Local models commit to their first token and follow through. If the first few tokens go wrong, the rest follows:

```
# Local model starts generating a list when you wanted prose
"1. The first issue is..."  → continues as a numbered list even if you wanted paragraphs

# Fix: explicitly state the format
"Write your response as continuous paragraphs, not as a list."
```

## The Preset Pattern

Presets are reusable prompt templates with a defined focus area and output format. Instead of writing a new prompt for every task, select a preset.

### Defining Presets

```python
PRESETS = {
    "architecture": {
        "system": "You are a software architect analyzing code structure.",
        "focus": "dependencies, imports, data flow, coupling between components, design patterns",
        "output_format": "Organize findings by theme. Reference file names.",
        "question": "How do the components of this codebase fit together?",
    },
    "security": {
        "system": "You are a security auditor reviewing code for vulnerabilities.",
        "focus": "input validation, authentication, authorization, secrets in code, error messages that leak information, injection points",
        "output_format": "For each finding: file, line/function, severity (low/medium/high/critical), description, fix.",
        "question": "What security vulnerabilities exist in this code?",
    },
    "review": {
        "system": "You are a senior developer reviewing code for bugs.",
        "focus": "bugs, edge cases, off-by-one errors, null/nil handling, unchecked errors, race conditions, resource leaks",
        "output_format": "For each issue: file, function, severity, description, suggested fix.",
        "question": "What bugs and issues exist in this code?",
    },
    "consistency": {
        "system": "You are reviewing code for consistency across a codebase.",
        "focus": "naming conventions, error handling patterns, logging patterns, API conventions, code style",
        "output_format": "Group inconsistencies by category. Show examples from specific files.",
        "question": "What inconsistencies exist across this codebase?",
    },
    "document": {
        "system": "You are a technical writer documenting code.",
        "focus": "purpose, public API, parameters, return values, side effects, usage examples",
        "output_format": "Markdown documentation with code examples.",
        "question": "Generate documentation for this code.",
    },
}
```

### Using Presets

```python
def build_prompt(preset_name: str, content: str, custom_question: str = None) -> list[dict]:
    preset = PRESETS[preset_name]
    question = custom_question or preset["question"]

    return [
        {"role": "system", "content": preset["system"]},
        {"role": "user", "content": f"""Focus on: {preset["focus"]}

Output format: {preset["output_format"]}

{content}

Question: {question}"""},
    ]
```

Presets eliminate prompt drift — the tendency to tweak prompts per-run until they work for one example but fail on others. A well-tested preset works consistently across inputs.

## Schema-Driven Prompts

For structured output, include the exact schema in the prompt. This is the single most impactful technique for local models.

### Extraction Schema

```python
def extraction_prompt(text: str, schema: dict) -> str:
    return f"""Extract information from the following text.
Return a JSON object matching this EXACT schema:

{json.dumps(schema, indent=2)}

Rules:
- Use ONLY the values specified in "enum" fields.
- Set fields to null if the information is not present.
- Do not add fields not in the schema.

Text:
{text}"""
```

### Classification Schema

```python
LABELS = ["bug", "feature", "question", "docs"]

def classification_prompt(text: str) -> str:
    return f"""Classify the following text into exactly ONE category.

Categories: {json.dumps(LABELS)}

Return JSON: {{"category": "<one of the categories>", "confidence": <0.0 to 1.0>}}

Text:
{text}"""
```

The explicit listing of valid values and the exact JSON format leaves no ambiguity for the model.

## Prompt Debugging

When a prompt produces wrong output, diagnose systematically:

### 1. Check the Output Format

Is the model generating the right format? If you expect JSON and get prose, the format instruction is not strong enough.

**Fix:** Add `format="json"` to the Ollama call AND include JSON format instructions in the prompt. Belt and suspenders.

### 2. Check the First Few Tokens

Local models commit early. If the first token is wrong, everything after follows:

```
Expected: {"category": "bug", ...}
Got:      The category of this issue is bug because...

Fix: Start the assistant message with the opening brace
```

```python
messages = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": "{"},  # Prime the model to start with JSON
]
```

### 3. Check Token Budget

If JSON output is truncated, `num_predict` is too low:

```
Got: {"category": "bug", "details": "This is a significant issue that affects the core authenticat
```

**Fix:** Increase `num_predict` or simplify the requested output.

### 4. Check Temperature

Temperature > 0 adds randomness. For extraction and classification, randomness is noise:

```python
# WRONG for structured output
options={"temperature": 0.7}

# RIGHT for structured output
options={"temperature": 0.0}
```

Use temperature > 0 only for generation tasks where variety is desired (writing, brainstorming).

### 5. Compare Against a Larger Model

If the prompt works on 32B but fails on 7B, the task may be too complex for the small model. Options:
- Simplify the prompt (fewer fields, simpler schema)
- Split into multiple smaller calls
- Use the larger model

## Prompt Anti-Patterns for Local Models

### Too Many Instructions

```
# Bad: 8 instructions, local model loses track after 3
"Analyze this code. Focus on security. Also check performance.
Consider edge cases. Look at error handling. Check naming conventions.
Verify the API contract. Suggest refactoring opportunities."

# Good: one clear instruction
"List security vulnerabilities in this code.
For each: file, line, severity (low/medium/high/critical), description."
```

### Implicit Output Format

```
# Bad: model decides format
"What's wrong with this code?"

# Good: explicit format
"List issues in this code as a JSON array:
[{\"file\": \"...\", \"line\": N, \"issue\": \"...\", \"severity\": \"...\"}]"
```

### Negative Instructions

```
# Bad: models are weak at "don't"
"Don't include explanations. Don't add commentary. Don't use markdown."

# Good: say what you want, not what you don't want
"Output ONLY the JSON object. No text before or after the JSON."
```

### Asking for Confidence Without Calibration

Small models are poorly calibrated on confidence. A 7B model saying "confidence: 0.95" does not mean it is 95% likely to be correct. Use confidence scores for relative ranking (higher is more likely correct than lower), not as absolute probabilities.

## Testing Prompts

Never deploy a prompt tested on one example. Build a small test suite:

```python
TEST_CASES = [
    {"input": "...", "expected": {"category": "bug", ...}},
    {"input": "...", "expected": {"category": "feature", ...}},
    {"input": "...", "expected": {"category": "question", ...}},
    # Include edge cases:
    {"input": "(empty string)", "expected": {"category": "other", ...}},
    {"input": "(ambiguous input)", "expected": {"category": "question", ...}},
    {"input": "(very long input)", "expected": {"category": "bug", ...}},
]

def test_prompt(prompt_fn, model, test_cases):
    results = []
    for case in test_cases:
        actual = run_prompt(prompt_fn, model, case["input"])
        match = actual == case["expected"]
        results.append({"input": case["input"][:50], "match": match, "actual": actual})

    accuracy = sum(r["match"] for r in results) / len(results)
    print(f"Accuracy: {accuracy:.0%} ({sum(r['match'] for r in results)}/{len(results)})")
    return results
```

Test across at least 3 models at the target size tier. A prompt that works on Qwen but fails on Llama is too model-specific — generalize it.

## Common Mistakes

1. **Using cloud-model prompting style on local models.** Vague instructions, implicit formats, and conversational prompts work on GPT-4 and Claude. Local models need explicit structure, constrained output, and front-loaded instructions.
2. **Not using the system message.** The system message sets the model's role and behavior. Local models respond noticeably better when given a clear system role ("You are a security auditor") versus user-only prompts.
3. **Changing prompts based on one failure.** A prompt that works 90% of the time should not be rewritten because of one bad output. Test on a suite. If accuracy is below your threshold, then adjust.
4. **Including unnecessary context.** Local models have smaller effective context windows. Every token of irrelevant context reduces the quality of the response. Send only what the model needs.
5. **Expecting chain-of-thought from 3-4B models.** Small models cannot reliably reason through multiple steps. If the task requires reasoning, either use a larger model or decompose the task into sequential calls where each step is simple.

