---
title: "Wake-Filter Pattern: Cheap Classifier Before Expensive Agent"
description: "Putting a small classifier in front of a frontier agent to avoid paying full-cycle cost on noise — design, eval discipline, failure modes, and the cost arithmetic that flips when local LLMs are available."
url: https://agent-zone.ai/knowledge/agent-tooling/wake-filter-pattern/
section: knowledge
date: 2026-05-07
categories: ["agent-tooling"]
tags: ["wake-filter","classifier","agent-architecture","cost-optimization","local-llm","ollama"]
skills: ["wake-filter-design","classifier-eval-harness","agent-cost-arithmetic"]
tools: ["ollama","anthropic-api"]
levels: ["intermediate"]
word_count: 1954
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/wake-filter-pattern/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/wake-filter-pattern/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Wake-Filter+Pattern%3A+Cheap+Classifier+Before+Expensive+Agent
---


An agent fleet wired to a high-volume trigger source — channel mentions, queue events, webhooks — pays full cost on every cycle, even when the trigger is noise. A classifier placed in front of the main agent decides which triggers deserve a real cycle and which to drop. The pattern is old; what is new is that local LLMs make the classifier cost effectively zero, which flips the arithmetic in the pattern's favor for cases that previously didn't justify the latency.

## The pattern in one diagram

```
                  Trigger event (mention, queue item, webhook, etc.)
                                    │
                                    ▼
                           ┌─────────────────┐
                           │  Wake-filter    │  ← cheap classifier
                           │  (small model)  │     "Should the main agent run on this?"
                           └─────────────────┘
                                    │
                          ┌─────────┴─────────┐
                          ▼                   ▼
                    classify=wake      classify=skip
                          │                   │
                          ▼                   ▼
                   ┌──────────────┐    (no-op; cycle ends)
                   │  Main agent  │  ← expensive call
                   │  (frontier)  │
                   └──────────────┘
```

The classifier is a single LLM call with a focused prompt and a structured output. The main agent is the full reasoning loop — multi-turn, tool-calling, expensive. The wake-filter's job is to spend a few cents (or less) to decide whether to spend dollars.

## Cost arithmetic

The pattern only pays off when the math works. Three regimes, with `signal_rate` = the fraction of triggers that are real work:

| Scenario | Per-trigger cost (no wake-filter) | Per-trigger cost (with wake-filter) | Verdict |
|---|---|---|---|
| Frontier-API main + frontier-API filter | $X | $X × signal_rate + $small | Improves only if classifier is meaningfully cheaper than main |
| Frontier-API main + local filter | $X | $X × signal_rate + ~$0 | **Wins significantly when signal_rate < 50%** |
| Local main + local filter | ~$0 | ~$0 | Latency optimization, not cost |

The middle row is where the local-LLM era flips the decision. Before cheap local inference, a wake-filter's per-call cost ate into the savings; the pattern paid off only when noise was overwhelming and the main agent was extremely expensive. With a local classifier the per-call cost rounds to zero, so the break-even shifts dramatically: even a 30% noise rate justifies the pattern.

This flip is the spine of the broader local-vs-API decision. See the companion piece, [Local LLM Cost-Capability Tradeoff]({{< relref "local-llm-cost-capability-tradeoff" >}}), for the general framing — the wake-filter is the strongest specific case where local wins outright on a workload the API would otherwise dominate.

**The pattern wins when:**
- The trigger source is mostly noise (broad channel mentions, fan-out subscriptions)
- The main agent is frontier-API priced (per-cycle cost is meaningful)
- A local LLM is available for classification (per-call cost ~$0)
- Classifier latency fits within end-to-end SLO

**The pattern doesn't win when:**
- The trigger source is already curated (every backlog assignment is a directive — there is nothing to filter)
- The main agent is itself cheap (local-on-local just adds latency)
- End-to-end response is latency-critical and the classifier round-trip pushes past budget

## Implementation: Ollama OpenAI-compat

A working wake-filter is a single LLM call with three properties: deterministic output (`temperature: 0`), structured response, and a focused prompt. The OpenAI-compatible Ollama endpoint makes the integration boilerplate minimal.

```go
// Adapt to the host runtime; the shape is the same in any language.
type WakeFilter interface {
    Classify(ctx context.Context, event Event) (wake bool, reason string, err error)
}

type LocalWakeFilter struct {
    client    *openai.Client // OpenAI-compat client pointed at Ollama (host:11434/v1)
    model     string         // "gemma4:e4b" or another small fast model
    promptTpl string         // System prompt: "You are a wake-filter for agent X. Decide..."
}

func (l *LocalWakeFilter) Classify(ctx context.Context, e Event) (bool, string, error) {
    resp, err := l.client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{
        Model: l.model,
        Messages: []openai.ChatCompletionMessage{
            {Role: "system", Content: l.promptTpl},
            {Role: "user", Content: fmt.Sprintf("Event: %s\nDecide wake/skip with reason.", e.Summary)},
        },
        Temperature:    0,
        ResponseFormat: &openai.ChatCompletionResponseFormat{Type: "json_object"},
    })
    if err != nil {
        // On classifier infrastructure failure, default to wake (false-positive bias).
        return true, "classifier_unavailable", err
    }
    var decision struct {
        Wake   bool   `json:"wake"`
        Reason string `json:"reason"`
    }
    if err := json.Unmarshal([]byte(resp.Choices[0].Message.Content), &decision); err != nil {
        return true, "parse_failure", err
    }
    return decision.Wake, decision.Reason, nil
}
```

The host running Ollama owns the latency budget. A small model like `gemma4:e4b` typically returns a JSON decision in under a second on consumer hardware; that fits inside most agent cycle SLOs. If the classifier service becomes unreachable, the integration above defaults to wake — wasting a cycle is recoverable; missing real work usually is not.

The classifier prompt is the part that takes iteration. A vague prompt like "Should the agent wake?" is too abstract for a small model. A concrete prompt — "Is this message a directive that requires the PM agent to dispatch a backlog item, or is it a status update?" — gives the model criteria it can apply consistently. **Wake-filter prompts must be concrete, role-specific, and grounded in examples.** A wake-filter for a code-review agent classifies differently from one for a triage agent; the prompts should not be shared.

A working prompt has four parts. First, a one-sentence role definition that names the agent the filter is gating. Second, a list of wake criteria — concrete patterns that should always trigger the main agent. Third, a list of skip criteria with examples of the noise to drop. Fourth, an explicit JSON schema for the output. The order matters: small models anchor on the role, then check the wake list, then the skip list, and finally produce the structured response.

```
You are a wake-filter for the PM agent. The PM dispatches backlog
items to builders when given a directive.

Wake the PM if the event is:
- An explicit dispatch instruction ("assign X to Y", "queue this")
- A new backlog item that needs triage
- A direct mention with an actionable request

Skip the PM if the event is:
- A status update or progress report (no action requested)
- A message tagged at another agent (PM is not the addressee)
- A heartbeat, log line, or routine system message

Respond with JSON: {"wake": <bool>, "reason": "<short phrase>"}
```

The skip list is what saves the most cost. Most production noise comes from a small number of recurring patterns — heartbeats, automated reports, broadcast announcements. Naming those patterns explicitly in the skip list teaches the small model to recognize them at a glance, without the model needing to reason about why they're noise.

## Eval discipline

A classifier without an eval harness is a silent liability. Prompt edits, model upgrades, and upstream trigger changes all drift the decision boundary. Without a corpus to regress against, the drift becomes visible only after production breaks — either a cost spike (false positives) or a backlog of missed work (false negatives).

The minimum viable harness:

```
eval/wake-filter/
├── corpus.jsonl       # 20-50 hand-labeled examples: {event: ..., expected_wake: bool}
├── runner.py          # iterates corpus, calls classifier, computes precision/recall
└── results-2026-05-07.md  # snapshot per evaluation run
```

The corpus should cover every cluster of triggers the production stream produces, with a deliberate mix of obvious-wake, obvious-skip, and ambiguous edge cases. New trigger patterns observed in production should be added to the corpus with their correct labels — this is how the harness stays honest as the trigger source evolves.

Pre-merge requirement: the classifier must hit a target accuracy on the corpus before the prompt or model change deploys. Some teams use 95% as the bar; others require 100% on small corpora because every miss is a known production scenario the agent will see. One reference deployment uses 100% on a 23-case corpus as the merge gate — that specific number isn't universal, but the discipline is: **without an eval harness, classifier drift is invisible until production breaks.**

## Failure modes and the false-positive bias

A wake-filter has two failure modes, and they are not symmetric.

| Failure | What it costs | Recovery | Bias toward |
|---|---|---|---|
| **False positive** (wake on noise) | One wasted main-agent cycle | Automatic next cycle | acceptable |
| **False negative** (skip real work) | Missed work, user-visible delay | Manual escalation or next trigger | avoid |

A wasted cycle costs the price of one main-agent invocation and is recoverable on the next trigger. A missed cycle hides real work — it doesn't surface until a user notices something is overdue. **Tune the classifier prompt for false-positive bias: when the model is uncertain, default to wake.** This is the right asymmetry for almost every wake-filter use case.

The same bias governs error handling in the integration code. Network errors talking to the classifier, malformed JSON in the response, model timeouts — every one of these should default to wake. The classifier exists to save money on the easy-to-classify majority; the hard cases should fall through to the main agent that can actually reason about them.

## Debugging classifier failures

Three failure signatures appear in production wake-filters often enough to recognize on sight.

**Malformed JSON output.** Small models occasionally drift from the requested format, especially after a prompt edit or model upgrade:

```
Expected: {"wake": true, "reason": "explicit dispatch directive"}
Got:      The answer here is to wake because this looks like...
```

The mitigation has two layers. First, request `response_format: {type: "json_object"}` if the model and inference server support it — this enforces JSON at decode time. Second, wrap the parser in a fallback that defaults to wake on parse failure. A classifier that occasionally produces prose is recoverable; one that produces prose AND blocks the main agent is not.

**False-positive cluster.** Suddenly the main agent runs on everything; cost spikes; the alerts queue fills with low-value cycles.

- *Diagnostic*: dump the last 100 classifier decisions. Look for `wake: true, reason: ""` (an empty reason usually means the classifier punted to the safe default).
- *Causes*: prompt edit that loosened criteria, corpus drift that no longer matches production triggers, or an upstream change to the trigger source that introduced new event shapes the prompt doesn't recognize.

**False-negative cluster.** Real work backs up; users escalate manually because the agent went quiet.

- *Diagnostic*: dump the last 100 classifier decisions. Look for `wake: false` on items that look like real work in retrospect.
- *Causes*: prompt overfit to skip examples, an unbalanced corpus that taught the classifier to be too aggressive, or a model upgrade that shifted the decision boundary.

**Eval regression at merge time.** The harness is the cheapest place to catch a problem:

```
eval pre-merge required: 100% on corpus
actual:                  22/23 (95.7%)
diff:                    case "X" now classified as skip; was wake before
```

Block the deploy. The root cause is almost always either a prompt edit that changed semantics in a way the author didn't anticipate, or a model upgrade that moved the boundary on a borderline case. Either way, the corpus is doing its job — investigate before merging, not after.

## Operational tips

A few things worth doing on day one rather than discovering through outages.

**Log every decision.** `{event_id, wake, reason, model, prompt_version, latency_ms}` per classification. Cost-spike investigations and accuracy audits both need this data; reconstructing it after the fact is impossible.

**Version the prompt.** Treat the wake-filter prompt as code — store it in a file, commit it, tag deploys with the version. When false positives or negatives spike, the first question is "did the prompt change recently?" and the answer should be a one-line check.

**Pin the model.** `gemma4:e4b` is not the same classifier as `gemma4:e4b-instruct-q4` — model tags drift, and Ollama will quietly pull a new version. Pin the exact tag, and treat a model upgrade as a prompt change for eval-gating purposes.

**Re-eval on trigger-source changes.** A new event type, a new channel subscription, or a new producer of triggers can blow up classifier accuracy without anything in the wake-filter itself changing. Add representative samples of new event shapes to the corpus and re-run the harness before assuming the classifier still works.

**Watch latency, not just accuracy.** A classifier that takes three seconds to decide is a classifier that pushed the agent past its SLO. The host running Ollama should be capacity-checked under realistic load — a single stuck request behind a slow classifier batch is a worse outcome than no classifier at all.

