{"page":{"agent_metadata":{"content_type":"pattern","outputs":["wake-filter-implementation","classifier-eval-corpus","false-positive-false-negative-tradeoff-analysis"],"prerequisites":["agent-architecture-basics","agent-cost-tracking","local-llm-cost-capability-tradeoff"]},"categories":["agent-tooling"],"content_plain":"An agent fleet wired to a high-volume trigger source — channel mentions, queue events, webhooks — pays full cost on every cycle, even when the trigger is noise. A classifier placed in front of the main agent decides which triggers deserve a real cycle and which to drop. The pattern is old; what is new is that local LLMs make the classifier cost effectively zero, which flips the arithmetic in the pattern\u0026rsquo;s favor for cases that previously didn\u0026rsquo;t justify the latency.\nThe pattern in one diagram# Trigger event (mention, queue item, webhook, etc.) │ ▼ ┌─────────────────┐ │ Wake-filter │ ← cheap classifier │ (small model) │ \u0026#34;Should the main agent run on this?\u0026#34; └─────────────────┘ │ ┌─────────┴─────────┐ ▼ ▼ classify=wake classify=skip │ │ ▼ ▼ ┌──────────────┐ (no-op; cycle ends) │ Main agent │ ← expensive call │ (frontier) │ └──────────────┘The classifier is a single LLM call with a focused prompt and a structured output. The main agent is the full reasoning loop — multi-turn, tool-calling, expensive. The wake-filter\u0026rsquo;s job is to spend a few cents (or less) to decide whether to spend dollars.\nCost arithmetic# The pattern only pays off when the math works. Three regimes, with signal_rate = the fraction of triggers that are real work:\nScenario Per-trigger cost (no wake-filter) Per-trigger cost (with wake-filter) Verdict Frontier-API main + frontier-API filter $X $X × signal_rate + $small Improves only if classifier is meaningfully cheaper than main Frontier-API main + local filter $X $X × signal_rate + ~$0 Wins significantly when signal_rate \u0026lt; 50% Local main + local filter ~$0 ~$0 Latency optimization, not cost The middle row is where the local-LLM era flips the decision. Before cheap local inference, a wake-filter\u0026rsquo;s per-call cost ate into the savings; the pattern paid off only when noise was overwhelming and the main agent was extremely expensive. With a local classifier the per-call cost rounds to zero, so the break-even shifts dramatically: even a 30% noise rate justifies the pattern.\nThis flip is the spine of the broader local-vs-API decision. See the companion piece, Local LLM Cost-Capability Tradeoff, for the general framing — the wake-filter is the strongest specific case where local wins outright on a workload the API would otherwise dominate.\nThe pattern wins when:\nThe trigger source is mostly noise (broad channel mentions, fan-out subscriptions) The main agent is frontier-API priced (per-cycle cost is meaningful) A local LLM is available for classification (per-call cost ~$0) Classifier latency fits within end-to-end SLO The pattern doesn\u0026rsquo;t win when:\nThe trigger source is already curated (every backlog assignment is a directive — there is nothing to filter) The main agent is itself cheap (local-on-local just adds latency) End-to-end response is latency-critical and the classifier round-trip pushes past budget Implementation: Ollama OpenAI-compat# A working wake-filter is a single LLM call with three properties: deterministic output (temperature: 0), structured response, and a focused prompt. The OpenAI-compatible Ollama endpoint makes the integration boilerplate minimal.\n// Adapt to the host runtime; the shape is the same in any language. type WakeFilter interface { Classify(ctx context.Context, event Event) (wake bool, reason string, err error) } type LocalWakeFilter struct { client *openai.Client // OpenAI-compat client pointed at Ollama (host:11434/v1) model string // \u0026#34;gemma4:e4b\u0026#34; or another small fast model promptTpl string // System prompt: \u0026#34;You are a wake-filter for agent X. Decide...\u0026#34; } func (l *LocalWakeFilter) Classify(ctx context.Context, e Event) (bool, string, error) { resp, err := l.client.CreateChatCompletion(ctx, openai.ChatCompletionRequest{ Model: l.model, Messages: []openai.ChatCompletionMessage{ {Role: \u0026#34;system\u0026#34;, Content: l.promptTpl}, {Role: \u0026#34;user\u0026#34;, Content: fmt.Sprintf(\u0026#34;Event: %s\\nDecide wake/skip with reason.\u0026#34;, e.Summary)}, }, Temperature: 0, ResponseFormat: \u0026amp;openai.ChatCompletionResponseFormat{Type: \u0026#34;json_object\u0026#34;}, }) if err != nil { // On classifier infrastructure failure, default to wake (false-positive bias). return true, \u0026#34;classifier_unavailable\u0026#34;, err } var decision struct { Wake bool `json:\u0026#34;wake\u0026#34;` Reason string `json:\u0026#34;reason\u0026#34;` } if err := json.Unmarshal([]byte(resp.Choices[0].Message.Content), \u0026amp;decision); err != nil { return true, \u0026#34;parse_failure\u0026#34;, err } return decision.Wake, decision.Reason, nil }The host running Ollama owns the latency budget. A small model like gemma4:e4b typically returns a JSON decision in under a second on consumer hardware; that fits inside most agent cycle SLOs. If the classifier service becomes unreachable, the integration above defaults to wake — wasting a cycle is recoverable; missing real work usually is not.\nThe classifier prompt is the part that takes iteration. A vague prompt like \u0026ldquo;Should the agent wake?\u0026rdquo; is too abstract for a small model. A concrete prompt — \u0026ldquo;Is this message a directive that requires the PM agent to dispatch a backlog item, or is it a status update?\u0026rdquo; — gives the model criteria it can apply consistently. Wake-filter prompts must be concrete, role-specific, and grounded in examples. A wake-filter for a code-review agent classifies differently from one for a triage agent; the prompts should not be shared.\nA working prompt has four parts. First, a one-sentence role definition that names the agent the filter is gating. Second, a list of wake criteria — concrete patterns that should always trigger the main agent. Third, a list of skip criteria with examples of the noise to drop. Fourth, an explicit JSON schema for the output. The order matters: small models anchor on the role, then check the wake list, then the skip list, and finally produce the structured response.\nYou are a wake-filter for the PM agent. The PM dispatches backlog items to builders when given a directive. Wake the PM if the event is: - An explicit dispatch instruction (\u0026#34;assign X to Y\u0026#34;, \u0026#34;queue this\u0026#34;) - A new backlog item that needs triage - A direct mention with an actionable request Skip the PM if the event is: - A status update or progress report (no action requested) - A message tagged at another agent (PM is not the addressee) - A heartbeat, log line, or routine system message Respond with JSON: {\u0026#34;wake\u0026#34;: \u0026lt;bool\u0026gt;, \u0026#34;reason\u0026#34;: \u0026#34;\u0026lt;short phrase\u0026gt;\u0026#34;}The skip list is what saves the most cost. Most production noise comes from a small number of recurring patterns — heartbeats, automated reports, broadcast announcements. Naming those patterns explicitly in the skip list teaches the small model to recognize them at a glance, without the model needing to reason about why they\u0026rsquo;re noise.\nEval discipline# A classifier without an eval harness is a silent liability. Prompt edits, model upgrades, and upstream trigger changes all drift the decision boundary. Without a corpus to regress against, the drift becomes visible only after production breaks — either a cost spike (false positives) or a backlog of missed work (false negatives).\nThe minimum viable harness:\neval/wake-filter/ ├── corpus.jsonl # 20-50 hand-labeled examples: {event: ..., expected_wake: bool} ├── runner.py # iterates corpus, calls classifier, computes precision/recall └── results-2026-05-07.md # snapshot per evaluation runThe corpus should cover every cluster of triggers the production stream produces, with a deliberate mix of obvious-wake, obvious-skip, and ambiguous edge cases. New trigger patterns observed in production should be added to the corpus with their correct labels — this is how the harness stays honest as the trigger source evolves.\nPre-merge requirement: the classifier must hit a target accuracy on the corpus before the prompt or model change deploys. Some teams use 95% as the bar; others require 100% on small corpora because every miss is a known production scenario the agent will see. One reference deployment uses 100% on a 23-case corpus as the merge gate — that specific number isn\u0026rsquo;t universal, but the discipline is: without an eval harness, classifier drift is invisible until production breaks.\nFailure modes and the false-positive bias# A wake-filter has two failure modes, and they are not symmetric.\nFailure What it costs Recovery Bias toward False positive (wake on noise) One wasted main-agent cycle Automatic next cycle acceptable False negative (skip real work) Missed work, user-visible delay Manual escalation or next trigger avoid A wasted cycle costs the price of one main-agent invocation and is recoverable on the next trigger. A missed cycle hides real work — it doesn\u0026rsquo;t surface until a user notices something is overdue. Tune the classifier prompt for false-positive bias: when the model is uncertain, default to wake. This is the right asymmetry for almost every wake-filter use case.\nThe same bias governs error handling in the integration code. Network errors talking to the classifier, malformed JSON in the response, model timeouts — every one of these should default to wake. The classifier exists to save money on the easy-to-classify majority; the hard cases should fall through to the main agent that can actually reason about them.\nDebugging classifier failures# Three failure signatures appear in production wake-filters often enough to recognize on sight.\nMalformed JSON output. Small models occasionally drift from the requested format, especially after a prompt edit or model upgrade:\nExpected: {\u0026#34;wake\u0026#34;: true, \u0026#34;reason\u0026#34;: \u0026#34;explicit dispatch directive\u0026#34;} Got: The answer here is to wake because this looks like...The mitigation has two layers. First, request response_format: {type: \u0026quot;json_object\u0026quot;} if the model and inference server support it — this enforces JSON at decode time. Second, wrap the parser in a fallback that defaults to wake on parse failure. A classifier that occasionally produces prose is recoverable; one that produces prose AND blocks the main agent is not.\nFalse-positive cluster. Suddenly the main agent runs on everything; cost spikes; the alerts queue fills with low-value cycles.\nDiagnostic: dump the last 100 classifier decisions. Look for wake: true, reason: \u0026quot;\u0026quot; (an empty reason usually means the classifier punted to the safe default). Causes: prompt edit that loosened criteria, corpus drift that no longer matches production triggers, or an upstream change to the trigger source that introduced new event shapes the prompt doesn\u0026rsquo;t recognize. False-negative cluster. Real work backs up; users escalate manually because the agent went quiet.\nDiagnostic: dump the last 100 classifier decisions. Look for wake: false on items that look like real work in retrospect. Causes: prompt overfit to skip examples, an unbalanced corpus that taught the classifier to be too aggressive, or a model upgrade that shifted the decision boundary. Eval regression at merge time. The harness is the cheapest place to catch a problem:\neval pre-merge required: 100% on corpus actual: 22/23 (95.7%) diff: case \u0026#34;X\u0026#34; now classified as skip; was wake beforeBlock the deploy. The root cause is almost always either a prompt edit that changed semantics in a way the author didn\u0026rsquo;t anticipate, or a model upgrade that moved the boundary on a borderline case. Either way, the corpus is doing its job — investigate before merging, not after.\nOperational tips# A few things worth doing on day one rather than discovering through outages.\nLog every decision. {event_id, wake, reason, model, prompt_version, latency_ms} per classification. Cost-spike investigations and accuracy audits both need this data; reconstructing it after the fact is impossible.\nVersion the prompt. Treat the wake-filter prompt as code — store it in a file, commit it, tag deploys with the version. When false positives or negatives spike, the first question is \u0026ldquo;did the prompt change recently?\u0026rdquo; and the answer should be a one-line check.\nPin the model. gemma4:e4b is not the same classifier as gemma4:e4b-instruct-q4 — model tags drift, and Ollama will quietly pull a new version. Pin the exact tag, and treat a model upgrade as a prompt change for eval-gating purposes.\nRe-eval on trigger-source changes. A new event type, a new channel subscription, or a new producer of triggers can blow up classifier accuracy without anything in the wake-filter itself changing. Add representative samples of new event shapes to the corpus and re-run the harness before assuming the classifier still works.\nWatch latency, not just accuracy. A classifier that takes three seconds to decide is a classifier that pushed the agent past its SLO. The host running Ollama should be capacity-checked under realistic load — a single stuck request behind a slow classifier batch is a worse outcome than no classifier at all.\n","date":"2026-05-07","description":"Putting a small classifier in front of a frontier agent to avoid paying full-cycle cost on noise — design, eval discipline, failure modes, and the cost arithmetic that flips when local LLMs are available.","lastmod":"2026-05-07","levels":["intermediate"],"reading_time_minutes":10,"section":"knowledge","skills":["wake-filter-design","classifier-eval-harness","agent-cost-arithmetic"],"tags":["wake-filter","classifier","agent-architecture","cost-optimization","local-llm","ollama"],"title":"Wake-Filter Pattern: Cheap Classifier Before Expensive Agent","tools":["ollama","anthropic-api"],"url":"https://agent-zone.ai/knowledge/agent-tooling/wake-filter-pattern/","word_count":1954}}