{"page":{"agent_metadata":{"content_type":"reference","outputs":["diagnose-kimi-400-errors","configure-kimi-pod","audit-moonshot-adapter"],"prerequisites":["openai-api-basics","reasoning-model-fundamentals","json-http-clients"]},"categories":["agent-tooling"],"content_plain":"Moonshot Kimi K2.6 Operational Quirks# Kimi K2.6 is one of the cheapest competent reasoning models — $0.95/M input cache-miss, $0.16/M cache-hit, $4.00/M output, 256K context. It is also one of the most opinionated. Half of what works on OpenAI breaks here, and the failures are silent: empty content, mid-reasoning truncation, 400 errors that don\u0026rsquo;t mention the actual problem, and a cache key parameter that makes cost go up instead of down.\nThis page is the production-confirmed list of quirks, each as Symptom → Cause → Fix → Verify. Numbers come from an OFAT matrix of 48 runs (8 cells × 2 canaries × N=3) executed 2026-05-20 against api.moonshot.ai. The full matrix synthesis is in dream-team/planning/kimi-matrix-v1-results-2026-05-20.md.\nTL;DR — pattern-match before reading# If your call returns HTTP 400 about temperature, drop to temperature: 1.0 — it is hard-locked in thinking mode If turn 2 of a tool loop returns 400 about reasoning_content, your adapter is stripping the field on round-trip If finish_reason == \u0026quot;length\u0026quot; and content == \u0026quot;\u0026quot;, max_tokens is too low — raise to ≥32K If your adapter sets tool_choice: \u0026quot;required\u0026quot; and gets 400, switch to \u0026quot;auto\u0026quot; — \u0026quot;required\u0026quot; is rejected in thinking mode If you get HTTP 401 from a valid-looking key, you are pointing the wrong region endpoint (api.moonshot.ai vs api.moonshot.cn) If your cost rose after enabling prompt_cache_key, remove it — the key actively hurt cost by 44% in the matrix If tool_calls[] is missing on a multi-turn task, you may be hitting the default 50 RPM tier limit — email support@moonshot.ai for production tier Default strict_tools: true on coding agents — strict: false dropped tier-3 pass rate from 2/3 to 1/3 in the matrix 1. temperature locked to 1.0 in thinking mode# Symptom: HTTP 400 with \u0026quot;invalid temperature: only 1 is allowed for this model\u0026quot; on every request that sets temperature to anything other than 1.0.\nCause: Moonshot\u0026rsquo;s thinking-mode API contract pins three sampling parameters. The kimi-k2.6 reasoning channel was trained at fixed sampler settings and the server rejects deviations. top_p is similarly pinned to 0.95, presence_penalty and frequency_penalty must be 0, and n must be 1. This is documented but not loud — the OpenAI Python client default temperature of 0.7 fails immediately.\nFix: when thinking.type: \u0026quot;enabled\u0026quot; (the default for kimi-k2.6), hardcode the constrained sampler:\nreq := chatRequest{ Model: \u0026#34;kimi-k2.6\u0026#34;, Messages: messages, Temperature: 1.0, // hard-required in thinking mode TopP: 0.95, // hard-required in thinking mode N: 1, // hard-required in thinking mode // do NOT set PresencePenalty or FrequencyPenalty Thinking: \u0026amp;thinking{Type: \u0026#34;enabled\u0026#34;, Keep: \u0026#34;all\u0026#34;}, }If you need lower-temperature behavior, disable thinking (thinking.type: \u0026quot;disabled\u0026quot;) — that path accepts temperature: 0.6. But the matrix proved thinking-off drops tier-3 pass rate from 2/3 to 0/3; the reasoning channel is essential for hard work.\nVerify: curl -sS https://api.moonshot.ai/v1/chat/completions -H \u0026quot;Authorization: Bearer $KEY\u0026quot; -H \u0026quot;Content-Type: application/json\u0026quot; -d '{\u0026quot;model\u0026quot;:\u0026quot;kimi-k2.6\u0026quot;,\u0026quot;messages\u0026quot;:[{\u0026quot;role\u0026quot;:\u0026quot;user\u0026quot;,\u0026quot;content\u0026quot;:\u0026quot;hi\u0026quot;}],\u0026quot;temperature\u0026quot;:0.7}' | jq .error — if you see \u0026quot;invalid temperature\u0026quot;, the lock is active.\n2. reasoning_content MUST round-trip on every assistant turn# Symptom: First turn of a tool loop succeeds. Second turn returns HTTP 400: \u0026quot;thinking is enabled but reasoning_content is missing in assistant tool call message at index N\u0026quot;.\nCause: kimi-k2.6 in thinking mode emits a reasoning_content field on every assistant response, alongside content and tool_calls. On the next request, Moonshot requires the field be echoed back verbatim in the conversation history. Most OpenAI-shape adapters strip it because the standard OpenAI client library doesn\u0026rsquo;t know about it. This is documented in LiteLLM issue #26156 and confirmed by Moonshot\u0026rsquo;s own docs.\nFix: capture reasoning_content on response, re-emit on every assistant message in the request history:\ntype wireMessage struct { Role string `json:\u0026#34;role\u0026#34;` Content string `json:\u0026#34;content\u0026#34;` ReasoningContent string `json:\u0026#34;-\u0026#34;` ToolCalls []wireToolCall `json:\u0026#34;tool_calls,omitempty\u0026#34;` } func (m wireMessage) MarshalJSON() ([]byte, error) { type alias wireMessage raw, _ := json.Marshal(alias(m)) if m.Role != \u0026#34;assistant\u0026#34; { return raw, nil } var obj map[string]json.RawMessage json.Unmarshal(raw, \u0026amp;obj) rc, _ := json.Marshal(m.ReasoningContent) obj[\u0026#34;reasoning_content\u0026#34;] = rc return json.Marshal(obj) }If you ask the model \u0026ldquo;do you need reasoning_content echoed?\u0026rdquo; it often answers false. That answer is wrong. Trust the 400 response, not the self-report.\nVerify: run a 3-turn tool-use trace. If turn 2 returns 400 mentioning reasoning_content, the round-trip is missing.\n3. max_tokens includes reasoning tokens# Symptom: response has content: \u0026quot;\u0026quot; and finish_reason: \u0026quot;length\u0026quot;. No error, no warning. completion_tokens equals exactly the configured max_tokens (smoking gun: round numbers like 2048, 4096, 8192).\nCause: in thinking mode, max_tokens covers reasoning AND content together. Reasoning routinely consumes 10-30K tokens on heavy multi-file tasks. At the OpenAI-default max_tokens: 2048, kimi spends the entire budget thinking and never emits a visible response or tool call. The runtime then treats it as \u0026ldquo;model gave up\u0026rdquo; — but it was a silent truncation.\nFix:\n# pod-builder-medium-3.yaml or equivalent models: max_output_tokens: 32000 # matrix-validated floor for tier-3 workAdapter-side default:\nmaxTokens := req.MaxTokens if maxTokens == 0 || maxTokens \u0026lt; 16000 { maxTokens = 32000 // 96000 is Moonshot\u0026#39;s documented default }The matrix proved 16K truncates on tier-3 (0/3 pass), 32K is optimal (2/3 pass at $1.61/run), 64K wastes money (still 0/3 pass at +46% cost). 32K is the floor.\nVerify: kubectl logs -l app=\u0026lt;agent\u0026gt; -c main --tail=500 | grep '\u0026quot;main: task complete\u0026quot;' | jq -r .output_tokens | sort | uniq -c | sort -rn — if the output clusters at exactly 2048 or 4096, you have the silent truncation.\n4. thinking.keep: \u0026quot;all\u0026quot; required for multi-turn tool use# Symptom: multi-turn tool flows work for ~2 turns then start dropping reasoning context. Model behavior degrades: forgets earlier tool results, repeats already-executed actions, or fails to compose multi-step reasoning.\nCause: thinking.keep controls how reasoning history is retained across turns. Default behavior in some adapter shapes drops older reasoning blocks. For multi-turn coding agents, this destroys the chain-of-thought that makes the reasoning channel useful.\nFix:\ntype thinkingConfig struct { Type string `json:\u0026#34;type\u0026#34;` // \u0026#34;enabled\u0026#34; | \u0026#34;disabled\u0026#34; Keep string `json:\u0026#34;keep\u0026#34;` // MUST be \u0026#34;all\u0026#34; for multi-turn tool use } req.Thinking = \u0026amp;thinkingConfig{Type: \u0026#34;enabled\u0026#34;, Keep: \u0026#34;all\u0026#34;}Verify: trace a 5-turn tool conversation. Inspect each request\u0026rsquo;s assistant messages — every prior assistant turn should still carry its original reasoning_content.\n5. tool_choice restricted to \u0026quot;auto\u0026quot; or \u0026quot;none\u0026quot; in thinking mode# Symptom: HTTP 400 when setting tool_choice: \u0026quot;required\u0026quot; or tool_choice: {\u0026quot;type\u0026quot;:\u0026quot;function\u0026quot;,\u0026quot;function\u0026quot;:{\u0026quot;name\u0026quot;:\u0026quot;...\u0026quot;}}.\nCause: thinking-mode kimi only accepts the loose tool-choice values. The forcing variants are rejected.\nFix:\n// In thinking mode: req.ToolChoice = \u0026#34;auto\u0026#34; // or \u0026#34;none\u0026#34; to suppress tools entirely // \u0026#34;required\u0026#34; or a named-function force is illegal.If you genuinely need to force a tool call, disable thinking mode for that request. The matrix proved strict_tools: true is a better lever than tool_choice: required for coding-agent reliability.\nVerify: send a request with \u0026quot;tool_choice\u0026quot;: \u0026quot;required\u0026quot; — if HTTP 400, the restriction is active.\n6. Region-locked API keys# Symptom: HTTP 401 from a valid-looking API key. The key works in one context, fails in another.\nCause: Moonshot operates two regional endpoints with separate key namespaces:\nhttps://api.moonshot.ai/v1 — international tenant https://api.moonshot.cn/v1 — China tenant Keys issued on .ai do NOT work on .cn and vice versa. If you copy a curl example from the wrong region\u0026rsquo;s docs, you get 401 with no explanation.\nFix: pin the endpoint in your adapter and match it to your key\u0026rsquo;s origin:\nconst defaultEndpoint = \u0026#34;https://api.moonshot.ai/v1/chat/completions\u0026#34; // or \u0026#34;https://api.moonshot.cn/v1/chat/completions\u0026#34; if your key is China-regionVerify: curl -sS https://api.moonshot.ai/v1/models -H \u0026quot;Authorization: Bearer $KEY\u0026quot; — if 401, try the .cn endpoint. If 200, you have the right pairing.\n7. prompt_cache_key HURTS cost in coding workloads# Symptom: enabling prompt_cache_key with a stable per-task value increased observed cost by 44% on both tier-2 and tier-3 canaries. Quality also dropped (tier-3 pass rate 2/3 → 1/3).\nCause: Moonshot caches input prompts and bills cache-hit input at $0.16/M instead of $0.95/M (6× savings). For this to help, the first ~2K tokens of the prompt must be byte-identical across requests. In coding-agent workloads the prompt embeds spec content, file paths, and tool results that differ every cycle — so the cache never hits. Worse, the matrix observed a 44% cost INCREASE when the key was set, possibly because Moonshot bills cache-lookup attempts even on miss. The mechanism is not fully understood.\nFix: do not set prompt_cache_key for coding agents. If you need to test it for a stable-prompt workload (e.g. system-prompt-only queries with no embedded user content):\n// Probe before adopting: // 1. Send 5 identical requests with cache_key set. // 2. Inspect billing dashboard for cache_hit_tokens \u0026gt; 0. // 3. If 5/5 miss, the cache_key is not helping.Verify: instrument adapter to record usage.prompt_cache_hit_tokens vs usage.prompt_cache_miss_tokens per call. If hit-rate is \u0026lt;10%, remove the key.\n8. Default tier limited to 50 RPM# Symptom: bursty agent fleets hit HTTP 429 after a small number of concurrent requests. Tier-2 and tier-3 canary throughput stalls.\nCause: new Moonshot accounts default to a 50 RPM tier — fine for a single agent, hostile to a pool of 4+ pods making concurrent multi-turn tool calls. The tier is not visible in the dashboard until you exceed it.\nFix: email support@moonshot.ai with your account ID and request the production tier. Until granted, throttle concurrency in the adapter:\n// Per-pod rate limiter: limiter := rate.NewLimiter(rate.Every(2*time.Second), 1) // ~30 RPM, headroomCombine with the standard retry-with-backoff pattern for 429 responses, but bound total time at the call edge (single retry loop with per-attempt timeout can multiply stuck-time N× — see the adapter audit checklist).\nVerify: tail logs for HTTP 429 frequency. If 429s appear at low concurrency (≤4 pods), you are tier-limited.\n9. strict_tools: true is the right default for coding agents# Symptom: with strict: false on function definitions, tier-3 coding tasks regressed from 2/3 pass to 1/3 pass in the matrix. Tier-2 was unaffected.\nCause: when strict: false, Moonshot performs only JSON-validity checks on tool arguments, not schema enforcement. Malformed-but-parseable args propagate into conversation history and poison further generations. On heavy tasks where the model emits many tool calls per turn, even one bad call class compounds into session-level failure.\nFix:\ntype wireFunctionDef struct { Name string `json:\u0026#34;name\u0026#34;` Description string `json:\u0026#34;description,omitempty\u0026#34;` Parameters any `json:\u0026#34;parameters\u0026#34;` Strict bool `json:\u0026#34;strict\u0026#34;` // default true for coding agents } for _, t := range req.Tools { def := wireFunctionDef{ Name: t.Name, Description: t.Description, Parameters: t.Parameters, Strict: true, } // ... }The community guidance to use lenient mode for \u0026ldquo;compatibility\u0026rdquo; is wrong for this workload. Strict mode does not introduce 422s in well-formed schemas; it just catches the malformed calls that would otherwise poison the loop.\nVerify: run a 10-turn multi-tool task with strict: false, then again with strict: true. If the strict run completes more tool calls successfully, the lenient mode was hurting you.\nBonus: pricing summary# Item Rate Input (cache miss) $0.95/M Input (cache hit) $0.16/M Output $4.00/M Context window 256K Real rates as of 2026-05-20. Plug them into your cost tracker — the OpenAI-default Sonnet rates in many adapters over-bill kimi by ~5×, triggering false budget pauses.\nCommon Mistakes# Trusting kimi\u0026rsquo;s self-report about its own quirks. Asked \u0026ldquo;do you need reasoning_content echoed?\u0026rdquo; kimi answers false. Reality (per docs, LiteLLM #26156, and every framework we audited) is true. Verify against the actual 400 response, not the model\u0026rsquo;s introspection.\nAdopting prompt_cache_key on the assumption it always helps. Validated 44% cost regression in coding workloads. If your prompt is dynamic per-task, the cache never hits and the key actively hurts.\nDefaulting max_tokens to OpenAI\u0026rsquo;s 2048. Reasoning models share that budget with reasoning. 2048 silently truncates everything. The runtime sees \u0026ldquo;no tool calls\u0026rdquo; and concludes \u0026ldquo;model gave up\u0026rdquo; — wrong diagnosis, wrong fix.\nSkipping the rate-card audit. Adapters that fall back to Sonnet rates over-bill kimi by ~5×. The tracker\u0026rsquo;s cost number is fictional until you add an explicit {0.95, 4.00, ..., 0.16} entry for kimi-k2.6.\nRunning prompt-rich (d4/scaffolded) prompts on kimi. The matrix proved this is 12× more expensive on tier-2 with zero quality improvement. Kimi is a reasoning model — same asymmetry as sonnet, grok-reasoning, deepseek-reasoner. Thin directive prompts win; rich examples-heavy prompts hurt.\n","date":"2026-05-20","description":"Nine concrete Moonshot Kimi K2.6 quirks observed in production OFAT matrix runs — temperature locks, reasoning_content echo, max_tokens traps, region-locked keys, and prompt_cache_key cost regression.","lastmod":"2026-05-20","levels":["intermediate","advanced"],"reading_time_minutes":10,"section":"knowledge","skills":["llm-adapter-development","provider-integration","production-debugging"],"tags":["moonshot","kimi","kimi-k2","llm-quirks","reasoning-models","openai-compatible","production","thinking-mode"],"title":"Moonshot Kimi K2.6 Operational Quirks: What Breaks in Production","tools":["moonshot","kimi-k2.6","go"],"url":"https://agent-zone.ai/knowledge/agent-tooling/moonshot-kimi-k2.6-operational-quirks/","word_count":1964}}