---
title: "Structured Output from Small Local Models: JSON Mode, Extraction, Classification, and Token Runaway Fixes"
description: "Getting reliable structured output (JSON, classifications, function calls) from 2-7B local models — using JSON mode, constraining output schemas, handling token runaway, and scoring extraction accuracy."
url: https://agent-zone.ai/knowledge/agent-tooling/structured-output-local-models/
section: knowledge
date: 2026-02-22
categories: ["agent-tooling"]
tags: ["local-llm","structured-output","json-mode","extraction","classification","function-calling","ollama"]
skills: ["structured-extraction","json-output-engineering","classification-pipeline","output-scoring"]
tools: ["ollama","qwen","ministral","python","go"]
levels: ["intermediate"]
word_count: 1250
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/structured-output-local-models/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/structured-output-local-models/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Structured+Output+from+Small+Local+Models%3A+JSON+Mode%2C+Extraction%2C+Classification%2C+and+Token+Runaway+Fixes
---


# Structured Output from Small Local Models

Small models (2-7B parameters) produce structured output that is 85-95% as accurate as cloud APIs for well-defined extraction and classification tasks. The key is constraining the output space so the model's limited reasoning capacity is focused on filling fields rather than deciding what to generate.

This is where local models genuinely compete with — and sometimes match — models 30x their size.

## JSON Mode

Ollama's JSON mode forces the model to produce valid JSON:

```python
import ollama

response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[{
        "role": "user",
        "content": """Extract the following fields from this support ticket as JSON:
- category (one of: billing, technical, account, other)
- priority (one of: low, medium, high, critical)
- summary (one sentence)

Ticket: "My credit card was charged twice for the same order #12345.
I need an immediate refund for the duplicate charge of $49.99."
"""
    }],
    format="json",
    options={"temperature": 0.0, "num_predict": 512},
)
print(response["message"]["content"])
```

Output:

```json
{
  "category": "billing",
  "priority": "high",
  "summary": "Customer was charged twice for order #12345 and needs a refund of $49.99."
}
```

### The Token Runaway Problem

Small models in JSON mode can enter a loop where they generate thousands of repetitive tokens — repeating fields, nesting infinitely, or producing valid-looking JSON that never terminates.

```json
{"category": "billing", "priority": "high", "summary": "...",
 "details": {"category": "billing", "priority": "high", "summary": "...",
  "details": {"category": "billing", ...
```

**The fix is `num_predict`.** Always set a maximum output token limit:

```python
response = ollama.chat(
    model="qwen2.5-coder:7b",
    messages=[...],
    format="json",
    options={
        "temperature": 0.0,
        "num_predict": 1024,  # CRITICAL: cap output tokens
    },
)
```

For extraction tasks, 256-1024 tokens is almost always sufficient. For complex multi-field schemas, 2048 may be needed. Never leave `num_predict` unlimited with small models in JSON mode.

### Schema-in-Prompt Pattern

The most reliable pattern for structured extraction: include the exact JSON schema in the prompt.

```python
SCHEMA = {
    "type": "object",
    "properties": {
        "category": {"type": "string", "enum": ["billing", "technical", "account", "other"]},
        "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
        "summary": {"type": "string", "maxLength": 100},
        "action_required": {"type": "boolean"},
    },
    "required": ["category", "priority", "summary", "action_required"]
}

prompt = f"""Extract information from the following text.
Return a JSON object matching this schema:
{json.dumps(SCHEMA, indent=2)}

Text: {input_text}"""
```

Including `enum` values in the schema dramatically improves accuracy. The model selects from the provided options instead of generating arbitrary values.

## Structured Extraction

### Invoice Parsing

```python
def extract_invoice(text: str) -> dict:
    prompt = f"""Extract invoice details from this text as JSON with these fields:
- invoice_number: string
- date: string (YYYY-MM-DD format)
- vendor: string
- line_items: array of {{description: string, quantity: number, unit_price: number}}
- total: number
- currency: string (3-letter code)

Text:
{text}"""

    response = ollama.chat(
        model="qwen2.5-coder:7b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 1024},
    )
    return json.loads(response["message"]["content"])
```

### Log Event Parsing

```python
def parse_log_event(log_line: str) -> dict:
    prompt = f"""Parse this log line into structured JSON:
- timestamp: string (ISO 8601)
- level: string (one of: DEBUG, INFO, WARN, ERROR, FATAL)
- service: string
- message: string
- error_code: string or null
- stack_trace: boolean

Log: {log_line}"""

    response = ollama.chat(
        model="ministral:3b",  # Small model is sufficient for single-line parsing
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])
```

## Classification and Routing

Classification is the sweet spot for small models. The output space is small and well-defined.

### Multi-Label Classification

```python
CATEGORIES = ["bug", "feature-request", "question", "documentation", "security"]

def classify_issue(title: str, body: str) -> dict:
    prompt = f"""Classify this GitHub issue. Return JSON with:
- primary_category: one of {CATEGORIES}
- confidence: number between 0.0 and 1.0
- suggested_labels: array of up to 3 strings from {CATEGORIES}

Issue title: {title}
Issue body: {body}"""

    response = ollama.chat(
        model="qwen3:4b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])
```

### Routing with Confidence Threshold

Use the model's confidence to route uncertain classifications to a larger model:

```python
result = classify_issue(title, body)

if result["confidence"] >= 0.8:
    # Small model is confident — use its classification
    apply_label(result["primary_category"])
elif result["confidence"] >= 0.5:
    # Medium confidence — use 32B for a second opinion
    result_32b = classify_issue_32b(title, body)
    apply_label(result_32b["primary_category"])
else:
    # Low confidence — flag for human review
    flag_for_review(title, body, result)
```

## Function Calling

Function calling with small models works when the tool schema is small and well-defined.

```python
TOOLS = [
    {
        "name": "search_docs",
        "description": "Search documentation for a topic",
        "parameters": {
            "query": {"type": "string", "description": "search query"},
            "section": {"type": "string", "enum": ["api", "guides", "faq"]},
        }
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket",
        "parameters": {
            "title": {"type": "string"},
            "priority": {"type": "string", "enum": ["low", "medium", "high"]},
            "category": {"type": "string", "enum": ["billing", "technical", "account"]},
        }
    },
    {
        "name": "check_status",
        "description": "Check status of an order or ticket",
        "parameters": {
            "id": {"type": "string", "description": "order or ticket ID"},
        }
    },
]

def select_tool(user_message: str) -> dict:
    prompt = f"""Given the user message, select the appropriate tool and fill in its parameters.
Return JSON with:
- tool: the tool name
- parameters: object with the tool's parameters filled in

Available tools:
{json.dumps(TOOLS, indent=2)}

User message: {user_message}"""

    response = ollama.chat(
        model="ministral:3b",
        messages=[{"role": "user", "content": prompt}],
        format="json",
        options={"temperature": 0.0, "num_predict": 256},
    )
    return json.loads(response["message"]["content"])
```

Small models reliably select the correct tool when there are 3-8 tools with clear descriptions. Beyond 10-15 tools, accuracy drops and you should either use a larger model or pre-filter the tool list.

## Scoring Extraction Quality

Never trust model output without measurement. Score extraction quality with deterministic metrics:

### Field-Level Exact Match

```python
def score_extraction(expected: dict, actual: dict) -> dict:
    scores = {}
    for field, expected_value in expected.items():
        actual_value = actual.get(field)
        if actual_value is None:
            scores[field] = 0.0
        elif isinstance(expected_value, str):
            scores[field] = 1.0 if actual_value.strip().lower() == expected_value.strip().lower() else 0.0
        elif isinstance(expected_value, (int, float)):
            scores[field] = 1.0 if abs(actual_value - expected_value) < 0.01 else 0.0
        elif isinstance(expected_value, bool):
            scores[field] = 1.0 if actual_value == expected_value else 0.0
        else:
            scores[field] = 1.0 if actual_value == expected_value else 0.0

    scores["overall"] = sum(scores.values()) / len(scores) if scores else 0.0
    return scores
```

### F1 Score for Multi-Label

```python
def f1_score(expected_labels: set, actual_labels: set) -> float:
    if not expected_labels and not actual_labels:
        return 1.0
    if not expected_labels or not actual_labels:
        return 0.0

    true_positives = len(expected_labels & actual_labels)
    precision = true_positives / len(actual_labels) if actual_labels else 0
    recall = true_positives / len(expected_labels) if expected_labels else 0

    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)
```

### Running a Scoring Suite

```python
import json

# Load test fixtures
with open("testdata/invoices.json") as f:
    test_cases = json.load(f)

results = []
for case in test_cases:
    actual = extract_invoice(case["input"])
    score = score_extraction(case["expected"], actual)
    results.append({"case": case["name"], "score": score["overall"], "details": score})

avg_score = sum(r["score"] for r in results) / len(results)
print(f"Average extraction accuracy: {avg_score:.1%}")
print(f"Cases below 80%: {sum(1 for r in results if r['score'] < 0.8)}/{len(results)}")
```

Run scoring suites against every model change, prompt change, and quantization change. Regressions in extraction quality are silent — you will not notice them without automated testing.

## Common Mistakes

1. **Not setting `num_predict` in JSON mode.** The single most common issue with local model structured output. Small models can generate 10,000+ tokens of repetitive JSON without this limit.
2. **Using high temperature for extraction.** Temperature adds randomness. Extraction should be deterministic. Always use `temperature: 0.0` for structured output tasks.
3. **Providing open-ended schemas without enums.** A field like `category: string` gives the model unlimited options. `category: one of [billing, technical, account]` constrains it to valid values.
4. **Not validating JSON output.** Even with JSON mode, the output may be truncated (hit token limit) or have wrong types. Always parse with error handling and validate against the expected schema.
5. **Testing on one example and deploying.** Small models are less consistent than large models. A prompt that works on your test example may fail on edge cases. Build a test suite of 20+ examples covering variations and edge cases.

