---
title: "DeepSeek V4 Operational Quirks: Pro vs Flash, Reasoning Echo, and the Discount Cliff"
description: "Eleven concrete DeepSeek V4 quirks observed in production OFAT matrix runs — V4-Pro vs V4-Flash differences, reasoning_content echo enforcement, the May 2026 discount expiration, and the asymmetric prompt-style rule."
url: https://agent-zone.ai/knowledge/agent-tooling/deepseek-v4-operational-quirks/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["deepseek","deepseek-v4","llm-quirks","reasoning-models","openai-compatible","production","cost-modeling"]
skills: ["llm-adapter-development","provider-integration","cost-modeling"]
tools: ["deepseek","deepseek-v4-pro","deepseek-v4-flash","go"]
levels: ["intermediate","advanced"]
word_count: 2209
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/deepseek-v4-operational-quirks/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/deepseek-v4-operational-quirks/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=DeepSeek+V4+Operational+Quirks%3A+Pro+vs+Flash%2C+Reasoning+Echo%2C+and+the+Discount+Cliff
---


# DeepSeek V4 Operational Quirks

DeepSeek V4 ships two models behind one OpenAI-compatible API: V4-Pro (reasoning) at $1.74/M input / $3.48/M output and V4-Flash (chat) at $0.28/M input / $1.10/M output. Until 2026-05-31 V4-Pro carries a 75% discount, putting it at $0.435/M input — cheap enough to use as a heavy-tier coding model. After that, the cost steps up 4×.

The two models live on the same endpoint but want very different things. V4-Pro behaves like a reasoning model (thin prompts, reasoning_content echo required, tool_choice restrictions). V4-Flash behaves like a chat model (rich prompts win dramatically; rejects nothing). Confuse them and your matrix lights up red.

This page is the production-confirmed quirks list, each as `Symptom → Cause → Fix → Verify`. Numbers come from an OFAT matrix of 48 runs (16 cells × N=3) on `api.deepseek.com`, totaling $6.26. Full matrix synthesis: `~/.claude/projects/-Users-mstather/memory/project_deepseek_matrix_v1_final_2026_05_19.md`.

## TL;DR — pattern-match before reading

- If `dr-echo-drop` 0/3 at $0.00 sounds familiar: the API rejected the request before any tokens were charged because `reasoning_content` was missing from history
- If you are choosing between V4-Pro and V4-Flash: Pro for tier-3 heavy refactors; Flash with a d4-rich prompt for tier-2 bug-bundle work at $0.04/run
- If your cost model assumes today's V4-Pro rate forever, you are off by 4× starting 2026-06-01 — the 75% discount expires
- If you are setting `reasoning_effort: max` on V4-Pro, drop it — observed quality regression (1/3 vs 3/3 baseline)
- If you are setting `strict_tools: true` on V4-Pro, drop it — mild regression (2/3 vs 3/3 baseline), opposite of the kimi rule
- If you are setting `tool_choice: "required"`, expect HTTP 400 on Pro and infinite loops on Flash
- If `finish_reason == "length"` and `content == ""` on V4-Pro, raise `max_tokens` to ≥32K — reasoning shares the budget
- If you are coalescing same-role messages "for token efficiency", stop — matrix proved zero benefit
- If you turned on `parallel_tool_calls: true`, turn it off — unreliable

## 1. V4-Pro vs V4-Flash — pick the right model for the work

**Symptom**: tier-2 bug-bundle runs cost $1.61 each on V4-Pro and $0.04 each on V4-Flash. Tier-3 heavy refactors pass on V4-Pro and fail on V4-Flash. The "wrong" model for a workload either over-pays or under-delivers.

**Cause**: the two models are tuned for different work shapes. V4-Pro is a reasoning model: it emits `reasoning_content`, takes longer per turn, and earns its price on multi-file refactors that need plan-then-execute. V4-Flash is non-reasoning chat: fast, cheap, no reasoning channel — needs prompt scaffolding to compose multi-step work.

**Fix**: route by tier. Concrete production-deployable configs from the matrix:

```yaml
# Tier-3 heavy refactor pod (V4-Pro):
models:
  provider: deepseek
  main: deepseek-reasoner       # V4-Pro
  max_output_tokens: 32000
  # do NOT set: prompt scaffolding, reasoning_effort, strict_tools

# Tier-2 bug-bundle pod (V4-Flash):
models:
  provider: deepseek
  main: deepseek-chat            # V4-Flash, aliases server-side
  max_output_tokens: 8000
prompt:
  style: d4-rich                  # completion-contract + named-callsite enumeration
```

V4-Flash + d4-rich is the cheapest passing config observed in any matrix to date: 3/3 pass at $0.04/run. 190× cheaper than uncached Sonnet for the same work shape.

**Verify**: run the same tier-2 canary on both models. If V4-Flash with d4-rich beats V4-Pro on cost AND pass rate, the routing rule is validated.

## 2. `reasoning_content` echo is HARD-required on V4-Pro

**Symptom**: turn 2 of any V4-Pro tool loop returns an immediate API error before any tokens are charged. Matrix cell `dr-echo-drop` registered 0/3 pass at exactly $0.00.

**Cause**: DeepSeek's V4 reasoning models emit `reasoning_content` on every assistant response. The API rejects subsequent requests whose history omits this field on prior assistant turns. The error is `"reasoning_content missing in assistant message"`. The `$0.00` cost is the smoking gun — the request never reached generation.

**Fix**: same shape as the Moonshot quirk. Capture `reasoning_content` on response, re-emit it on every assistant message in history:

```go
type wireMessage struct {
    Role             string         `json:"role"`
    Content          string         `json:"content"`
    ReasoningContent string         `json:"-"`
    ToolCalls        []wireToolCall `json:"tool_calls,omitempty"`
}

func (m wireMessage) MarshalJSON() ([]byte, error) {
    type alias wireMessage
    raw, _ := json.Marshal(alias(m))
    if m.Role != "assistant" {
        return raw, nil
    }
    var obj map[string]json.RawMessage
    json.Unmarshal(raw, &obj)
    rc, _ := json.Marshal(m.ReasoningContent)
    obj["reasoning_content"] = rc
    return json.Marshal(obj)
}
```

Unlike xAI, where dropping `reasoning_content` silently loses chain-of-thought, DeepSeek hard-rejects. If you ask DeepSeek-self "do I need this echoed?" it sometimes claims "no, it is harmful" — the matrix proved this self-report is wrong.

**Verify**: send a 3-turn tool loop with `reasoning_content` deliberately stripped on turn 2's request. If the response is an error at $0.00 cost, the echo requirement is active.

## 3. The 75% V4-Pro discount expires 2026-05-31

**Symptom**: cost models budgeted at current observed V4-Pro rates will under-budget by 4× starting 2026-06-01.

**Cause**: DeepSeek launched V4-Pro with a temporary discount: 75% off list. Current observed rates are $0.435/M input cache-miss and $0.87/M output. List rates are $1.74/M input and $3.48/M output. The discount has a hard expiration date.

**Fix**: encode the cliff in your cost tracker:

```go
func deepseekProRates(now time.Time) (input, output float64) {
    cliff := time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)
    if now.Before(cliff) {
        return 0.435, 0.87  // 75% discount window
    }
    return 1.74, 3.48
}
```

Re-plan capacity for 2026-06-01: V4-Pro $/run roughly $0.17 → $0.68. Still 11× cheaper than uncached Sonnet, ~2× cheaper than properly-cached Sonnet. V4-Flash is unaffected ($0.28/$1.10 list with no temporary discount) and stays at $0.04/run.

**Verify**: probe DeepSeek's pricing page on 2026-06-01 and reconcile your tracker. If your cost numbers do not 4× on Pro that day, the cliff is not encoded.

## 4. `max_tokens` includes reasoning tokens on V4-Pro

**Symptom**: V4-Pro responses have `content: ""` and `finish_reason: "length"`. `completion_tokens` equals exactly the configured `max_tokens`. Same shape as the kimi-k2.6 trap.

**Cause**: V4-Pro reasoning routinely consumes 10-30K tokens before emitting visible content. `max_tokens` covers reasoning AND content together. At an OpenAI-default 2048, V4-Pro spends the entire budget thinking and returns empty.

**Fix**:

```yaml
# pod config:
models:
  max_output_tokens: 32000   # floor for V4-Pro tier-3 work
```

Adapter default:

```go
const defaultMaxTokens = 8192  // safe for V4-Flash
// V4-Pro callers should pass MaxTokens >= 16000 explicitly
if req.Model == "deepseek-reasoner" && req.MaxTokens < 16000 {
    req.MaxTokens = 32000
}
```

V4-Flash doesn't have a reasoning channel, so its budget can stay smaller (8K is fine).

**Verify**: same as kimi — grep task-complete logs for `output_tokens` clustering at a round number.

## 5. Role coalescing is optional — don't bother

**Symptom**: developers add same-role message coalescing "to save tokens" and observe zero measurable benefit.

**Cause**: DeepSeek tolerates consecutive same-role messages in history (e.g. two user messages back-to-back). Some adapters coalesce them into a single message hoping to reduce per-message overhead. The matrix cell `dr-coalesce-roles` was equivalent to baseline at 3/3 pass and $0.17/run — no win, no harm.

**Fix**: skip the coalescing logic unless you are deliberately optimizing for sub-1% token savings on a high-volume workload. The complexity is not worth it.

```go
// Don't bother with this:
func coalesce(msgs []wireMessage) []wireMessage { /* ... */ }
```

If you are chasing the last percent, validate on your own workload — the matrix used a single canary.

**Verify**: A/B the same task with and without coalescing. If cost difference is <2%, the optimization is noise.

## 6. `reasoning_effort: max` HURTS V4-Pro

**Symptom**: setting `reasoning_effort: "max"` on V4-Pro regressed tier-3 pass rate from 3/3 baseline to 1/3 in the matrix.

**Cause**: V4-Pro's default reasoning effort is already tuned for its training distribution. Forcing maximum effort extends reasoning chains past the convergence point — the model second-guesses, refactors plans, and runs out of effective context for actual emission.

**Fix**:

```go
// Don't:
req.ReasoningEffort = "max"

// Do:
// Leave ReasoningEffort unset — accept the default.
```

Same finding holds across grok-4.20-reasoning (rejects the param entirely) and kimi-k2.6 (matrix `kimi-baseline` beats every variant). For reasoning models, defaults are tuned; community advice to "crank reasoning_effort" is consistently wrong in our data.

**Verify**: run the same heavy canary with `reasoning_effort` unset, then `"max"`. If max regresses, the default was optimal.

## 7. `strict_tools: true` mildly hurts V4-Pro

**Symptom**: matrix cell `dr-strict-true` regressed tier-3 from 3/3 baseline to 2/3 pass at $0.27/run.

**Cause**: this is the opposite of the kimi rule. V4-Pro's reasoning channel handles tool argument schemas internally — the model plans before emitting tool calls and rarely produces malformed JSON. Adding the `strict: true` flag introduces server-side validation overhead that occasionally rejects edge-case-but-valid arguments.

**Fix**:

```go
type wireFunctionDef struct {
    Name        string `json:"name"`
    Parameters  any    `json:"parameters"`
    // Don't set Strict for V4-Pro; default-false works.
}
```

Note: V4-Flash is untested in the matrix for strict mode. If you adopt Flash for production, A/B it separately.

**Verify**: run a tier-3 canary with `strict: true` vs unset on V4-Pro. If strict regresses, the default-false is correct.

## 8. `tool_choice: "required"` is rejected by V4-Pro, loops V4-Flash

**Symptom**: V4-Pro returns HTTP 400 on `tool_choice: "required"`. V4-Flash accepts it but enters an infinite loop, calling tools until `max_tokens` is exhausted.

**Cause**: the reasoning model treats forced tool calls as incompatible with its plan-first architecture. The chat model accepts the directive but has no convergence criterion when forced — it keeps emitting calls hoping to satisfy the "must call tool" rule.

**Fix**:

```go
// In all cases:
req.ToolChoice = "auto"  // or "none" to suppress tools

// "required" is unsafe on both V4-Pro and V4-Flash.
```

If you need terminal-tool enforcement (model must call `push_branch` or `escalate` before ending), use a runtime-side guardrail instead — see the LLM adapter audit checklist's "terminal-tool guardrail" entry.

**Verify**: probe with `"tool_choice": "required"`. V4-Pro returns 400; V4-Flash returns a long chain of unnecessary tool calls. Either way, switch to `"auto"`.

## 9. V4-Flash with d4-rich prompt — the cheapest passing config

**Symptom**: V4-Flash with default thin prompt: 33% pass on tier-2 canary. V4-Flash with d4-rich prompt addendum (completion contract + named-callsite enumeration): 100% pass at $0.04/run.

**Cause**: non-reasoning chat models need explicit scaffolding to compose multi-step work. The d4-rich prompt provides the planning structure that V4-Pro's reasoning channel does internally. This is the asymmetric tuning rule: reasoning models want thin prompts; chat models want rich prompts.

**Fix**: deploy V4-Flash with the d4-rich addendum for tier-2 bug-bundle workloads:

```yaml
# pod-builder-medium-flash.yaml
models:
  provider: deepseek
  main: deepseek-chat
prompt:
  base: CLAUDE-builder-medium.md
  addendum: |
    ## Completion contract
    Before push_branch:
    - Every file listed in the spec must be present in the diff
    - Tests for each modified function must exist
    - run_tests must return zero failures
    - PR body must enumerate which callsites were modified by name
```

This is the engine of the builder-heavy-fast-0 pod deployed 2026-05-19 in production. The same matrix axis on V4-Pro (`dr-prompt-d4-both`) regressed from 3/3 to 2/3 — confirming the asymmetric pattern across kimi, sonnet, grok-reasoning, and now deepseek-reasoner.

**Verify**: A/B V4-Flash with and without d4 on your own tier-2 canary. If pass rate jumps from ~30% to ~100% at <$0.10/run, the pattern transfers.

## 10. No reliable `parallel_tool_calls` support

**Symptom**: setting `parallel_tool_calls: true` results in tool-call results returning in unpredictable order. Sometimes the model emits valid parallel calls; sometimes it serializes them anyway.

**Cause**: DeepSeek's tool-call infrastructure does not consistently parallelize across both models. The feature is advertised but not load-bearing.

**Fix**:

```go
req.ParallelToolCalls = boolPtr(false)
```

Same recommendation across grok and most reasoning models. If your orchestrator can't safely interleave reasoning with parallel tool execution, the safe default is sequential.

**Verify**: trace a multi-tool turn with parallel enabled. If results return in non-call order or one tool gets serialized, the feature is unreliable for your workload.

## 11. Rate-card miscost — adapter must split cache-hit vs cache-miss

**Symptom**: `hub_agent_budget.cost_usd` for a DeepSeek pod is 2-5× the actual DeepSeek billing console number.

**Cause**: DeepSeek bills input tokens at one rate when cached (`prompt_cache_hit_tokens`) and another when uncached (`prompt_cache_miss_tokens`). Adapters that lump all input tokens at the cache-miss rate over-bill by the cached fraction. Reasoning tokens bill at the OUTPUT rate (DeepSeek bundles them into `completion_tokens` for accounting).

**Fix**:

```go
func costFromUsage(u Usage) float64 {
    // V4-Flash rates (calibrated 2026-05-16 against billing console):
    const (
        inputMiss  = 0.28 / 1e6
        inputHit   = 0.028 / 1e6
        output     = 1.10 / 1e6
    )
    cost := float64(u.PromptCacheMissTokens) * inputMiss
    cost += float64(u.PromptCacheHitTokens) * inputHit
    cost += float64(u.CompletionTokens) * output  // includes reasoning_tokens
    return cost
}
```

If the response omits the cache split (older response shape), treat all input as cache-miss — upper-bounds the cost.

**Verify**: compare 24h of your tracker's cost number against the DeepSeek billing console. If your tracker > console by >2×, the split is wrong.

## Bonus: pricing summary

| Model | Input (miss) | Input (hit) | Output | Notes |
|---|---|---|---|---|
| V4-Pro | $1.74/M | n/a | $3.48/M | 75% off until 2026-05-31: $0.435/$0.87 |
| V4-Flash | $0.28/M | $0.028/M | $1.10/M | No discount; permanent |

## Common Mistakes

**Treating V4-Pro and V4-Flash as one model with different latency.** They are different tools for different jobs. Pro for plan-then-execute heavy work; Flash + d4-rich for cheap tier-2 fixes. Routing by tier captures both wins.

**Trusting DeepSeek's self-report that reasoning_content echo is "harmful".** Matrix cell `dr-echo-drop` rejected at 0/3 / $0.00. The four external sources (Cline, RooCode, Continue.dev, the API docs themselves) were right; the model's introspection was wrong.

**Stacking community advice without testing.** Three commonly-recommended levers (`reasoning_effort: max`, `strict_tools: true`, `prompt-rich` on Pro) ALL regressed quality in the matrix. The DeepSeek-Pro baseline is already well-tuned; every variant beat it on cost, none beat it on pass rate.

**Budgeting at today's V4-Pro rate forever.** The 4× cliff on 2026-06-01 surprises cost models that don't encode the date. Encode it.

**Ignoring the cache-hit/miss split when computing cost.** Adapters that bill all input as cache-miss over-bill by 2-5× on workloads with stable system prompts. Use the per-token split from `usage`, not a single rate.

