Moonshot Kimi K2.6 Operational Quirks#
Kimi K2.6 is one of the cheapest competent reasoning models — $0.95/M input cache-miss, $0.16/M cache-hit, $4.00/M output, 256K context. It is also one of the most opinionated. Half of what works on OpenAI breaks here, and the failures are silent: empty content, mid-reasoning truncation, 400 errors that don’t mention the actual problem, and a cache key parameter that makes cost go up instead of down.
This page is the production-confirmed list of quirks, each as Symptom → Cause → Fix → Verify. Numbers come from an OFAT matrix of 48 runs (8 cells × 2 canaries × N=3) executed 2026-05-20 against api.moonshot.ai. The full matrix synthesis is in dream-team/planning/kimi-matrix-v1-results-2026-05-20.md.
TL;DR — pattern-match before reading#
- If your call returns HTTP 400 about
temperature, drop totemperature: 1.0— it is hard-locked in thinking mode - If turn 2 of a tool loop returns 400 about
reasoning_content, your adapter is stripping the field on round-trip - If
finish_reason == "length"andcontent == "",max_tokensis too low — raise to ≥32K - If your adapter sets
tool_choice: "required"and gets 400, switch to"auto"—"required"is rejected in thinking mode - If you get HTTP 401 from a valid-looking key, you are pointing the wrong region endpoint (
api.moonshot.aivsapi.moonshot.cn) - If your cost rose after enabling
prompt_cache_key, remove it — the key actively hurt cost by 44% in the matrix - If
tool_calls[]is missing on a multi-turn task, you may be hitting the default 50 RPM tier limit — email support@moonshot.ai for production tier - Default
strict_tools: trueon coding agents —strict: falsedropped tier-3 pass rate from 2/3 to 1/3 in the matrix
1. temperature locked to 1.0 in thinking mode#
Symptom: HTTP 400 with "invalid temperature: only 1 is allowed for this model" on every request that sets temperature to anything other than 1.0.
Cause: Moonshot’s thinking-mode API contract pins three sampling parameters. The kimi-k2.6 reasoning channel was trained at fixed sampler settings and the server rejects deviations. top_p is similarly pinned to 0.95, presence_penalty and frequency_penalty must be 0, and n must be 1. This is documented but not loud — the OpenAI Python client default temperature of 0.7 fails immediately.
Fix: when thinking.type: "enabled" (the default for kimi-k2.6), hardcode the constrained sampler:
req := chatRequest{
Model: "kimi-k2.6",
Messages: messages,
Temperature: 1.0, // hard-required in thinking mode
TopP: 0.95, // hard-required in thinking mode
N: 1, // hard-required in thinking mode
// do NOT set PresencePenalty or FrequencyPenalty
Thinking: &thinking{Type: "enabled", Keep: "all"},
}If you need lower-temperature behavior, disable thinking (thinking.type: "disabled") — that path accepts temperature: 0.6. But the matrix proved thinking-off drops tier-3 pass rate from 2/3 to 0/3; the reasoning channel is essential for hard work.
Verify: curl -sS https://api.moonshot.ai/v1/chat/completions -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"model":"kimi-k2.6","messages":[{"role":"user","content":"hi"}],"temperature":0.7}' | jq .error — if you see "invalid temperature", the lock is active.
2. reasoning_content MUST round-trip on every assistant turn#
Symptom: First turn of a tool loop succeeds. Second turn returns HTTP 400: "thinking is enabled but reasoning_content is missing in assistant tool call message at index N".
Cause: kimi-k2.6 in thinking mode emits a reasoning_content field on every assistant response, alongside content and tool_calls. On the next request, Moonshot requires the field be echoed back verbatim in the conversation history. Most OpenAI-shape adapters strip it because the standard OpenAI client library doesn’t know about it. This is documented in LiteLLM issue #26156 and confirmed by Moonshot’s own docs.
Fix: capture reasoning_content on response, re-emit on every assistant message in the request history:
type wireMessage struct {
Role string `json:"role"`
Content string `json:"content"`
ReasoningContent string `json:"-"`
ToolCalls []wireToolCall `json:"tool_calls,omitempty"`
}
func (m wireMessage) MarshalJSON() ([]byte, error) {
type alias wireMessage
raw, _ := json.Marshal(alias(m))
if m.Role != "assistant" {
return raw, nil
}
var obj map[string]json.RawMessage
json.Unmarshal(raw, &obj)
rc, _ := json.Marshal(m.ReasoningContent)
obj["reasoning_content"] = rc
return json.Marshal(obj)
}If you ask the model “do you need reasoning_content echoed?” it often answers false. That answer is wrong. Trust the 400 response, not the self-report.
Verify: run a 3-turn tool-use trace. If turn 2 returns 400 mentioning reasoning_content, the round-trip is missing.
3. max_tokens includes reasoning tokens#
Symptom: response has content: "" and finish_reason: "length". No error, no warning. completion_tokens equals exactly the configured max_tokens (smoking gun: round numbers like 2048, 4096, 8192).
Cause: in thinking mode, max_tokens covers reasoning AND content together. Reasoning routinely consumes 10-30K tokens on heavy multi-file tasks. At the OpenAI-default max_tokens: 2048, kimi spends the entire budget thinking and never emits a visible response or tool call. The runtime then treats it as “model gave up” — but it was a silent truncation.
Fix:
# pod-builder-medium-3.yaml or equivalent
models:
max_output_tokens: 32000 # matrix-validated floor for tier-3 workAdapter-side default:
maxTokens := req.MaxTokens
if maxTokens == 0 || maxTokens < 16000 {
maxTokens = 32000 // 96000 is Moonshot's documented default
}The matrix proved 16K truncates on tier-3 (0/3 pass), 32K is optimal (2/3 pass at $1.61/run), 64K wastes money (still 0/3 pass at +46% cost). 32K is the floor.
Verify: kubectl logs -l app=<agent> -c main --tail=500 | grep '"main: task complete"' | jq -r .output_tokens | sort | uniq -c | sort -rn — if the output clusters at exactly 2048 or 4096, you have the silent truncation.
4. thinking.keep: "all" required for multi-turn tool use#
Symptom: multi-turn tool flows work for ~2 turns then start dropping reasoning context. Model behavior degrades: forgets earlier tool results, repeats already-executed actions, or fails to compose multi-step reasoning.
Cause: thinking.keep controls how reasoning history is retained across turns. Default behavior in some adapter shapes drops older reasoning blocks. For multi-turn coding agents, this destroys the chain-of-thought that makes the reasoning channel useful.
Fix:
type thinkingConfig struct {
Type string `json:"type"` // "enabled" | "disabled"
Keep string `json:"keep"` // MUST be "all" for multi-turn tool use
}
req.Thinking = &thinkingConfig{Type: "enabled", Keep: "all"}Verify: trace a 5-turn tool conversation. Inspect each request’s assistant messages — every prior assistant turn should still carry its original reasoning_content.
5. tool_choice restricted to "auto" or "none" in thinking mode#
Symptom: HTTP 400 when setting tool_choice: "required" or tool_choice: {"type":"function","function":{"name":"..."}}.
Cause: thinking-mode kimi only accepts the loose tool-choice values. The forcing variants are rejected.
Fix:
// In thinking mode:
req.ToolChoice = "auto" // or "none" to suppress tools entirely
// "required" or a named-function force is illegal.If you genuinely need to force a tool call, disable thinking mode for that request. The matrix proved strict_tools: true is a better lever than tool_choice: required for coding-agent reliability.
Verify: send a request with "tool_choice": "required" — if HTTP 400, the restriction is active.
6. Region-locked API keys#
Symptom: HTTP 401 from a valid-looking API key. The key works in one context, fails in another.
Cause: Moonshot operates two regional endpoints with separate key namespaces:
https://api.moonshot.ai/v1— international tenanthttps://api.moonshot.cn/v1— China tenant
Keys issued on .ai do NOT work on .cn and vice versa. If you copy a curl example from the wrong region’s docs, you get 401 with no explanation.
Fix: pin the endpoint in your adapter and match it to your key’s origin:
const defaultEndpoint = "https://api.moonshot.ai/v1/chat/completions"
// or "https://api.moonshot.cn/v1/chat/completions" if your key is China-regionVerify: curl -sS https://api.moonshot.ai/v1/models -H "Authorization: Bearer $KEY" — if 401, try the .cn endpoint. If 200, you have the right pairing.
7. prompt_cache_key HURTS cost in coding workloads#
Symptom: enabling prompt_cache_key with a stable per-task value increased observed cost by 44% on both tier-2 and tier-3 canaries. Quality also dropped (tier-3 pass rate 2/3 → 1/3).
Cause: Moonshot caches input prompts and bills cache-hit input at $0.16/M instead of $0.95/M (6× savings). For this to help, the first ~2K tokens of the prompt must be byte-identical across requests. In coding-agent workloads the prompt embeds spec content, file paths, and tool results that differ every cycle — so the cache never hits. Worse, the matrix observed a 44% cost INCREASE when the key was set, possibly because Moonshot bills cache-lookup attempts even on miss. The mechanism is not fully understood.
Fix: do not set prompt_cache_key for coding agents. If you need to test it for a stable-prompt workload (e.g. system-prompt-only queries with no embedded user content):
// Probe before adopting:
// 1. Send 5 identical requests with cache_key set.
// 2. Inspect billing dashboard for cache_hit_tokens > 0.
// 3. If 5/5 miss, the cache_key is not helping.Verify: instrument adapter to record usage.prompt_cache_hit_tokens vs usage.prompt_cache_miss_tokens per call. If hit-rate is <10%, remove the key.
8. Default tier limited to 50 RPM#
Symptom: bursty agent fleets hit HTTP 429 after a small number of concurrent requests. Tier-2 and tier-3 canary throughput stalls.
Cause: new Moonshot accounts default to a 50 RPM tier — fine for a single agent, hostile to a pool of 4+ pods making concurrent multi-turn tool calls. The tier is not visible in the dashboard until you exceed it.
Fix: email support@moonshot.ai with your account ID and request the production tier. Until granted, throttle concurrency in the adapter:
// Per-pod rate limiter:
limiter := rate.NewLimiter(rate.Every(2*time.Second), 1) // ~30 RPM, headroomCombine with the standard retry-with-backoff pattern for 429 responses, but bound total time at the call edge (single retry loop with per-attempt timeout can multiply stuck-time N× — see the adapter audit checklist).
Verify: tail logs for HTTP 429 frequency. If 429s appear at low concurrency (≤4 pods), you are tier-limited.
9. strict_tools: true is the right default for coding agents#
Symptom: with strict: false on function definitions, tier-3 coding tasks regressed from 2/3 pass to 1/3 pass in the matrix. Tier-2 was unaffected.
Cause: when strict: false, Moonshot performs only JSON-validity checks on tool arguments, not schema enforcement. Malformed-but-parseable args propagate into conversation history and poison further generations. On heavy tasks where the model emits many tool calls per turn, even one bad call class compounds into session-level failure.
Fix:
type wireFunctionDef struct {
Name string `json:"name"`
Description string `json:"description,omitempty"`
Parameters any `json:"parameters"`
Strict bool `json:"strict"` // default true for coding agents
}
for _, t := range req.Tools {
def := wireFunctionDef{
Name: t.Name,
Description: t.Description,
Parameters: t.Parameters,
Strict: true,
}
// ...
}The community guidance to use lenient mode for “compatibility” is wrong for this workload. Strict mode does not introduce 422s in well-formed schemas; it just catches the malformed calls that would otherwise poison the loop.
Verify: run a 10-turn multi-tool task with strict: false, then again with strict: true. If the strict run completes more tool calls successfully, the lenient mode was hurting you.
Bonus: pricing summary#
| Item | Rate |
|---|---|
| Input (cache miss) | $0.95/M |
| Input (cache hit) | $0.16/M |
| Output | $4.00/M |
| Context window | 256K |
Real rates as of 2026-05-20. Plug them into your cost tracker — the OpenAI-default Sonnet rates in many adapters over-bill kimi by ~5×, triggering false budget pauses.
Common Mistakes#
Trusting kimi’s self-report about its own quirks. Asked “do you need reasoning_content echoed?” kimi answers false. Reality (per docs, LiteLLM #26156, and every framework we audited) is true. Verify against the actual 400 response, not the model’s introspection.
Adopting prompt_cache_key on the assumption it always helps. Validated 44% cost regression in coding workloads. If your prompt is dynamic per-task, the cache never hits and the key actively hurts.
Defaulting max_tokens to OpenAI’s 2048. Reasoning models share that budget with reasoning. 2048 silently truncates everything. The runtime sees “no tool calls” and concludes “model gave up” — wrong diagnosis, wrong fix.
Skipping the rate-card audit. Adapters that fall back to Sonnet rates over-bill kimi by ~5×. The tracker’s cost number is fictional until you add an explicit {0.95, 4.00, ..., 0.16} entry for kimi-k2.6.
Running prompt-rich (d4/scaffolded) prompts on kimi. The matrix proved this is 12× more expensive on tier-2 with zero quality improvement. Kimi is a reasoning model — same asymmetry as sonnet, grok-reasoning, deepseek-reasoner. Thin directive prompts win; rich examples-heavy prompts hurt.