Moonshot Kimi K2.6 Operational Quirks#

Kimi K2.6 is one of the cheapest competent reasoning models — $0.95/M input cache-miss, $0.16/M cache-hit, $4.00/M output, 256K context. It is also one of the most opinionated. Half of what works on OpenAI breaks here, and the failures are silent: empty content, mid-reasoning truncation, 400 errors that don’t mention the actual problem, and a cache key parameter that makes cost go up instead of down.

This page is the production-confirmed list of quirks, each as Symptom → Cause → Fix → Verify. Numbers come from an OFAT matrix of 48 runs (8 cells × 2 canaries × N=3) executed 2026-05-20 against api.moonshot.ai. The full matrix synthesis is in dream-team/planning/kimi-matrix-v1-results-2026-05-20.md.

TL;DR — pattern-match before reading#

  • If your call returns HTTP 400 about temperature, drop to temperature: 1.0 — it is hard-locked in thinking mode
  • If turn 2 of a tool loop returns 400 about reasoning_content, your adapter is stripping the field on round-trip
  • If finish_reason == "length" and content == "", max_tokens is too low — raise to ≥32K
  • If your adapter sets tool_choice: "required" and gets 400, switch to "auto""required" is rejected in thinking mode
  • If you get HTTP 401 from a valid-looking key, you are pointing the wrong region endpoint (api.moonshot.ai vs api.moonshot.cn)
  • If your cost rose after enabling prompt_cache_key, remove it — the key actively hurt cost by 44% in the matrix
  • If tool_calls[] is missing on a multi-turn task, you may be hitting the default 50 RPM tier limit — email support@moonshot.ai for production tier
  • Default strict_tools: true on coding agents — strict: false dropped tier-3 pass rate from 2/3 to 1/3 in the matrix

1. temperature locked to 1.0 in thinking mode#

Symptom: HTTP 400 with "invalid temperature: only 1 is allowed for this model" on every request that sets temperature to anything other than 1.0.

Cause: Moonshot’s thinking-mode API contract pins three sampling parameters. The kimi-k2.6 reasoning channel was trained at fixed sampler settings and the server rejects deviations. top_p is similarly pinned to 0.95, presence_penalty and frequency_penalty must be 0, and n must be 1. This is documented but not loud — the OpenAI Python client default temperature of 0.7 fails immediately.

Fix: when thinking.type: "enabled" (the default for kimi-k2.6), hardcode the constrained sampler:

req := chatRequest{
    Model:       "kimi-k2.6",
    Messages:    messages,
    Temperature: 1.0,    // hard-required in thinking mode
    TopP:        0.95,   // hard-required in thinking mode
    N:           1,      // hard-required in thinking mode
    // do NOT set PresencePenalty or FrequencyPenalty
    Thinking: &thinking{Type: "enabled", Keep: "all"},
}

If you need lower-temperature behavior, disable thinking (thinking.type: "disabled") — that path accepts temperature: 0.6. But the matrix proved thinking-off drops tier-3 pass rate from 2/3 to 0/3; the reasoning channel is essential for hard work.

Verify: curl -sS https://api.moonshot.ai/v1/chat/completions -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"model":"kimi-k2.6","messages":[{"role":"user","content":"hi"}],"temperature":0.7}' | jq .error — if you see "invalid temperature", the lock is active.

2. reasoning_content MUST round-trip on every assistant turn#

Symptom: First turn of a tool loop succeeds. Second turn returns HTTP 400: "thinking is enabled but reasoning_content is missing in assistant tool call message at index N".

Cause: kimi-k2.6 in thinking mode emits a reasoning_content field on every assistant response, alongside content and tool_calls. On the next request, Moonshot requires the field be echoed back verbatim in the conversation history. Most OpenAI-shape adapters strip it because the standard OpenAI client library doesn’t know about it. This is documented in LiteLLM issue #26156 and confirmed by Moonshot’s own docs.

Fix: capture reasoning_content on response, re-emit on every assistant message in the request history:

type wireMessage struct {
    Role             string         `json:"role"`
    Content          string         `json:"content"`
    ReasoningContent string         `json:"-"`
    ToolCalls        []wireToolCall `json:"tool_calls,omitempty"`
}

func (m wireMessage) MarshalJSON() ([]byte, error) {
    type alias wireMessage
    raw, _ := json.Marshal(alias(m))
    if m.Role != "assistant" {
        return raw, nil
    }
    var obj map[string]json.RawMessage
    json.Unmarshal(raw, &obj)
    rc, _ := json.Marshal(m.ReasoningContent)
    obj["reasoning_content"] = rc
    return json.Marshal(obj)
}

If you ask the model “do you need reasoning_content echoed?” it often answers false. That answer is wrong. Trust the 400 response, not the self-report.

Verify: run a 3-turn tool-use trace. If turn 2 returns 400 mentioning reasoning_content, the round-trip is missing.

3. max_tokens includes reasoning tokens#

Symptom: response has content: "" and finish_reason: "length". No error, no warning. completion_tokens equals exactly the configured max_tokens (smoking gun: round numbers like 2048, 4096, 8192).

Cause: in thinking mode, max_tokens covers reasoning AND content together. Reasoning routinely consumes 10-30K tokens on heavy multi-file tasks. At the OpenAI-default max_tokens: 2048, kimi spends the entire budget thinking and never emits a visible response or tool call. The runtime then treats it as “model gave up” — but it was a silent truncation.

Fix:

# pod-builder-medium-3.yaml or equivalent
models:
  max_output_tokens: 32000   # matrix-validated floor for tier-3 work

Adapter-side default:

maxTokens := req.MaxTokens
if maxTokens == 0 || maxTokens < 16000 {
    maxTokens = 32000  // 96000 is Moonshot's documented default
}

The matrix proved 16K truncates on tier-3 (0/3 pass), 32K is optimal (2/3 pass at $1.61/run), 64K wastes money (still 0/3 pass at +46% cost). 32K is the floor.

Verify: kubectl logs -l app=<agent> -c main --tail=500 | grep '"main: task complete"' | jq -r .output_tokens | sort | uniq -c | sort -rn — if the output clusters at exactly 2048 or 4096, you have the silent truncation.

4. thinking.keep: "all" required for multi-turn tool use#

Symptom: multi-turn tool flows work for ~2 turns then start dropping reasoning context. Model behavior degrades: forgets earlier tool results, repeats already-executed actions, or fails to compose multi-step reasoning.

Cause: thinking.keep controls how reasoning history is retained across turns. Default behavior in some adapter shapes drops older reasoning blocks. For multi-turn coding agents, this destroys the chain-of-thought that makes the reasoning channel useful.

Fix:

type thinkingConfig struct {
    Type string `json:"type"` // "enabled" | "disabled"
    Keep string `json:"keep"` // MUST be "all" for multi-turn tool use
}

req.Thinking = &thinkingConfig{Type: "enabled", Keep: "all"}

Verify: trace a 5-turn tool conversation. Inspect each request’s assistant messages — every prior assistant turn should still carry its original reasoning_content.

5. tool_choice restricted to "auto" or "none" in thinking mode#

Symptom: HTTP 400 when setting tool_choice: "required" or tool_choice: {"type":"function","function":{"name":"..."}}.

Cause: thinking-mode kimi only accepts the loose tool-choice values. The forcing variants are rejected.

Fix:

// In thinking mode:
req.ToolChoice = "auto"  // or "none" to suppress tools entirely

// "required" or a named-function force is illegal.

If you genuinely need to force a tool call, disable thinking mode for that request. The matrix proved strict_tools: true is a better lever than tool_choice: required for coding-agent reliability.

Verify: send a request with "tool_choice": "required" — if HTTP 400, the restriction is active.

6. Region-locked API keys#

Symptom: HTTP 401 from a valid-looking API key. The key works in one context, fails in another.

Cause: Moonshot operates two regional endpoints with separate key namespaces:

  • https://api.moonshot.ai/v1 — international tenant
  • https://api.moonshot.cn/v1 — China tenant

Keys issued on .ai do NOT work on .cn and vice versa. If you copy a curl example from the wrong region’s docs, you get 401 with no explanation.

Fix: pin the endpoint in your adapter and match it to your key’s origin:

const defaultEndpoint = "https://api.moonshot.ai/v1/chat/completions"
// or "https://api.moonshot.cn/v1/chat/completions" if your key is China-region

Verify: curl -sS https://api.moonshot.ai/v1/models -H "Authorization: Bearer $KEY" — if 401, try the .cn endpoint. If 200, you have the right pairing.

7. prompt_cache_key HURTS cost in coding workloads#

Symptom: enabling prompt_cache_key with a stable per-task value increased observed cost by 44% on both tier-2 and tier-3 canaries. Quality also dropped (tier-3 pass rate 2/3 → 1/3).

Cause: Moonshot caches input prompts and bills cache-hit input at $0.16/M instead of $0.95/M (6× savings). For this to help, the first ~2K tokens of the prompt must be byte-identical across requests. In coding-agent workloads the prompt embeds spec content, file paths, and tool results that differ every cycle — so the cache never hits. Worse, the matrix observed a 44% cost INCREASE when the key was set, possibly because Moonshot bills cache-lookup attempts even on miss. The mechanism is not fully understood.

Fix: do not set prompt_cache_key for coding agents. If you need to test it for a stable-prompt workload (e.g. system-prompt-only queries with no embedded user content):

// Probe before adopting:
// 1. Send 5 identical requests with cache_key set.
// 2. Inspect billing dashboard for cache_hit_tokens > 0.
// 3. If 5/5 miss, the cache_key is not helping.

Verify: instrument adapter to record usage.prompt_cache_hit_tokens vs usage.prompt_cache_miss_tokens per call. If hit-rate is <10%, remove the key.

8. Default tier limited to 50 RPM#

Symptom: bursty agent fleets hit HTTP 429 after a small number of concurrent requests. Tier-2 and tier-3 canary throughput stalls.

Cause: new Moonshot accounts default to a 50 RPM tier — fine for a single agent, hostile to a pool of 4+ pods making concurrent multi-turn tool calls. The tier is not visible in the dashboard until you exceed it.

Fix: email support@moonshot.ai with your account ID and request the production tier. Until granted, throttle concurrency in the adapter:

// Per-pod rate limiter:
limiter := rate.NewLimiter(rate.Every(2*time.Second), 1)  // ~30 RPM, headroom

Combine with the standard retry-with-backoff pattern for 429 responses, but bound total time at the call edge (single retry loop with per-attempt timeout can multiply stuck-time N× — see the adapter audit checklist).

Verify: tail logs for HTTP 429 frequency. If 429s appear at low concurrency (≤4 pods), you are tier-limited.

9. strict_tools: true is the right default for coding agents#

Symptom: with strict: false on function definitions, tier-3 coding tasks regressed from 2/3 pass to 1/3 pass in the matrix. Tier-2 was unaffected.

Cause: when strict: false, Moonshot performs only JSON-validity checks on tool arguments, not schema enforcement. Malformed-but-parseable args propagate into conversation history and poison further generations. On heavy tasks where the model emits many tool calls per turn, even one bad call class compounds into session-level failure.

Fix:

type wireFunctionDef struct {
    Name        string `json:"name"`
    Description string `json:"description,omitempty"`
    Parameters  any    `json:"parameters"`
    Strict      bool   `json:"strict"`  // default true for coding agents
}

for _, t := range req.Tools {
    def := wireFunctionDef{
        Name:        t.Name,
        Description: t.Description,
        Parameters:  t.Parameters,
        Strict:      true,
    }
    // ...
}

The community guidance to use lenient mode for “compatibility” is wrong for this workload. Strict mode does not introduce 422s in well-formed schemas; it just catches the malformed calls that would otherwise poison the loop.

Verify: run a 10-turn multi-tool task with strict: false, then again with strict: true. If the strict run completes more tool calls successfully, the lenient mode was hurting you.

Bonus: pricing summary#

ItemRate
Input (cache miss)$0.95/M
Input (cache hit)$0.16/M
Output$4.00/M
Context window256K

Real rates as of 2026-05-20. Plug them into your cost tracker — the OpenAI-default Sonnet rates in many adapters over-bill kimi by ~5×, triggering false budget pauses.

Common Mistakes#

Trusting kimi’s self-report about its own quirks. Asked “do you need reasoning_content echoed?” kimi answers false. Reality (per docs, LiteLLM #26156, and every framework we audited) is true. Verify against the actual 400 response, not the model’s introspection.

Adopting prompt_cache_key on the assumption it always helps. Validated 44% cost regression in coding workloads. If your prompt is dynamic per-task, the cache never hits and the key actively hurts.

Defaulting max_tokens to OpenAI’s 2048. Reasoning models share that budget with reasoning. 2048 silently truncates everything. The runtime sees “no tool calls” and concludes “model gave up” — wrong diagnosis, wrong fix.

Skipping the rate-card audit. Adapters that fall back to Sonnet rates over-bill kimi by ~5×. The tracker’s cost number is fictional until you add an explicit {0.95, 4.00, ..., 0.16} entry for kimi-k2.6.

Running prompt-rich (d4/scaffolded) prompts on kimi. The matrix proved this is 12× more expensive on tier-2 with zero quality improvement. Kimi is a reasoning model — same asymmetry as sonnet, grok-reasoning, deepseek-reasoner. Thin directive prompts win; rich examples-heavy prompts hurt.