---
title: "Moonshot Kimi K2.6 Operational Quirks: What Breaks in Production"
description: "Nine concrete Moonshot Kimi K2.6 quirks observed in production OFAT matrix runs — temperature locks, reasoning_content echo, max_tokens traps, region-locked keys, and prompt_cache_key cost regression."
url: https://agent-zone.ai/knowledge/agent-tooling/moonshot-kimi-k2.6-operational-quirks/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["moonshot","kimi","kimi-k2","llm-quirks","reasoning-models","openai-compatible","production","thinking-mode"]
skills: ["llm-adapter-development","provider-integration","production-debugging"]
tools: ["moonshot","kimi-k2.6","go"]
levels: ["intermediate","advanced"]
word_count: 1964
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/moonshot-kimi-k2.6-operational-quirks/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/moonshot-kimi-k2.6-operational-quirks/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Moonshot+Kimi+K2.6+Operational+Quirks%3A+What+Breaks+in+Production
---


# Moonshot Kimi K2.6 Operational Quirks

Kimi K2.6 is one of the cheapest competent reasoning models — $0.95/M input cache-miss, $0.16/M cache-hit, $4.00/M output, 256K context. It is also one of the most opinionated. Half of what works on OpenAI breaks here, and the failures are silent: empty content, mid-reasoning truncation, 400 errors that don't mention the actual problem, and a cache key parameter that makes cost go up instead of down.

This page is the production-confirmed list of quirks, each as `Symptom → Cause → Fix → Verify`. Numbers come from an OFAT matrix of 48 runs (8 cells × 2 canaries × N=3) executed 2026-05-20 against `api.moonshot.ai`. The full matrix synthesis is in `dream-team/planning/kimi-matrix-v1-results-2026-05-20.md`.

## TL;DR — pattern-match before reading

- If your call returns HTTP 400 about `temperature`, drop to `temperature: 1.0` — it is hard-locked in thinking mode
- If turn 2 of a tool loop returns 400 about `reasoning_content`, your adapter is stripping the field on round-trip
- If `finish_reason == "length"` and `content == ""`, `max_tokens` is too low — raise to ≥32K
- If your adapter sets `tool_choice: "required"` and gets 400, switch to `"auto"` — `"required"` is rejected in thinking mode
- If you get HTTP 401 from a valid-looking key, you are pointing the wrong region endpoint (`api.moonshot.ai` vs `api.moonshot.cn`)
- If your cost rose after enabling `prompt_cache_key`, remove it — the key actively hurt cost by 44% in the matrix
- If `tool_calls[]` is missing on a multi-turn task, you may be hitting the default 50 RPM tier limit — email support@moonshot.ai for production tier
- Default `strict_tools: true` on coding agents — `strict: false` dropped tier-3 pass rate from 2/3 to 1/3 in the matrix

## 1. `temperature` locked to 1.0 in thinking mode

**Symptom**: HTTP 400 with `"invalid temperature: only 1 is allowed for this model"` on every request that sets `temperature` to anything other than 1.0.

**Cause**: Moonshot's thinking-mode API contract pins three sampling parameters. The kimi-k2.6 reasoning channel was trained at fixed sampler settings and the server rejects deviations. `top_p` is similarly pinned to 0.95, `presence_penalty` and `frequency_penalty` must be 0, and `n` must be 1. This is documented but not loud — the OpenAI Python client default temperature of 0.7 fails immediately.

**Fix**: when `thinking.type: "enabled"` (the default for kimi-k2.6), hardcode the constrained sampler:

```go
req := chatRequest{
    Model:       "kimi-k2.6",
    Messages:    messages,
    Temperature: 1.0,    // hard-required in thinking mode
    TopP:        0.95,   // hard-required in thinking mode
    N:           1,      // hard-required in thinking mode
    // do NOT set PresencePenalty or FrequencyPenalty
    Thinking: &thinking{Type: "enabled", Keep: "all"},
}
```

If you need lower-temperature behavior, disable thinking (`thinking.type: "disabled"`) — that path accepts `temperature: 0.6`. But the matrix proved thinking-off drops tier-3 pass rate from 2/3 to 0/3; the reasoning channel is essential for hard work.

**Verify**: `curl -sS https://api.moonshot.ai/v1/chat/completions -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" -d '{"model":"kimi-k2.6","messages":[{"role":"user","content":"hi"}],"temperature":0.7}' | jq .error` — if you see `"invalid temperature"`, the lock is active.

## 2. `reasoning_content` MUST round-trip on every assistant turn

**Symptom**: First turn of a tool loop succeeds. Second turn returns HTTP 400: `"thinking is enabled but reasoning_content is missing in assistant tool call message at index N"`.

**Cause**: kimi-k2.6 in thinking mode emits a `reasoning_content` field on every assistant response, alongside `content` and `tool_calls`. On the next request, Moonshot requires the field be echoed back verbatim in the conversation history. Most OpenAI-shape adapters strip it because the standard OpenAI client library doesn't know about it. This is documented in LiteLLM issue #26156 and confirmed by Moonshot's own docs.

**Fix**: capture `reasoning_content` on response, re-emit on every assistant message in the request history:

```go
type wireMessage struct {
    Role             string         `json:"role"`
    Content          string         `json:"content"`
    ReasoningContent string         `json:"-"`
    ToolCalls        []wireToolCall `json:"tool_calls,omitempty"`
}

func (m wireMessage) MarshalJSON() ([]byte, error) {
    type alias wireMessage
    raw, _ := json.Marshal(alias(m))
    if m.Role != "assistant" {
        return raw, nil
    }
    var obj map[string]json.RawMessage
    json.Unmarshal(raw, &obj)
    rc, _ := json.Marshal(m.ReasoningContent)
    obj["reasoning_content"] = rc
    return json.Marshal(obj)
}
```

If you ask the model "do you need reasoning_content echoed?" it often answers `false`. That answer is wrong. Trust the 400 response, not the self-report.

**Verify**: run a 3-turn tool-use trace. If turn 2 returns 400 mentioning `reasoning_content`, the round-trip is missing.

## 3. `max_tokens` includes reasoning tokens

**Symptom**: response has `content: ""` and `finish_reason: "length"`. No error, no warning. `completion_tokens` equals exactly the configured `max_tokens` (smoking gun: round numbers like 2048, 4096, 8192).

**Cause**: in thinking mode, `max_tokens` covers reasoning AND content together. Reasoning routinely consumes 10-30K tokens on heavy multi-file tasks. At the OpenAI-default `max_tokens: 2048`, kimi spends the entire budget thinking and never emits a visible response or tool call. The runtime then treats it as "model gave up" — but it was a silent truncation.

**Fix**:

```yaml
# pod-builder-medium-3.yaml or equivalent
models:
  max_output_tokens: 32000   # matrix-validated floor for tier-3 work
```

Adapter-side default:

```go
maxTokens := req.MaxTokens
if maxTokens == 0 || maxTokens < 16000 {
    maxTokens = 32000  // 96000 is Moonshot's documented default
}
```

The matrix proved 16K truncates on tier-3 (0/3 pass), 32K is optimal (2/3 pass at $1.61/run), 64K wastes money (still 0/3 pass at +46% cost). 32K is the floor.

**Verify**: `kubectl logs -l app=<agent> -c main --tail=500 | grep '"main: task complete"' | jq -r .output_tokens | sort | uniq -c | sort -rn` — if the output clusters at exactly 2048 or 4096, you have the silent truncation.

## 4. `thinking.keep: "all"` required for multi-turn tool use

**Symptom**: multi-turn tool flows work for ~2 turns then start dropping reasoning context. Model behavior degrades: forgets earlier tool results, repeats already-executed actions, or fails to compose multi-step reasoning.

**Cause**: `thinking.keep` controls how reasoning history is retained across turns. Default behavior in some adapter shapes drops older reasoning blocks. For multi-turn coding agents, this destroys the chain-of-thought that makes the reasoning channel useful.

**Fix**:

```go
type thinkingConfig struct {
    Type string `json:"type"` // "enabled" | "disabled"
    Keep string `json:"keep"` // MUST be "all" for multi-turn tool use
}

req.Thinking = &thinkingConfig{Type: "enabled", Keep: "all"}
```

**Verify**: trace a 5-turn tool conversation. Inspect each request's assistant messages — every prior assistant turn should still carry its original `reasoning_content`.

## 5. `tool_choice` restricted to `"auto"` or `"none"` in thinking mode

**Symptom**: HTTP 400 when setting `tool_choice: "required"` or `tool_choice: {"type":"function","function":{"name":"..."}}`.

**Cause**: thinking-mode kimi only accepts the loose tool-choice values. The forcing variants are rejected.

**Fix**:

```go
// In thinking mode:
req.ToolChoice = "auto"  // or "none" to suppress tools entirely

// "required" or a named-function force is illegal.
```

If you genuinely need to force a tool call, disable thinking mode for that request. The matrix proved `strict_tools: true` is a better lever than `tool_choice: required` for coding-agent reliability.

**Verify**: send a request with `"tool_choice": "required"` — if HTTP 400, the restriction is active.

## 6. Region-locked API keys

**Symptom**: HTTP 401 from a valid-looking API key. The key works in one context, fails in another.

**Cause**: Moonshot operates two regional endpoints with separate key namespaces:

- `https://api.moonshot.ai/v1` — international tenant
- `https://api.moonshot.cn/v1` — China tenant

Keys issued on `.ai` do NOT work on `.cn` and vice versa. If you copy a curl example from the wrong region's docs, you get 401 with no explanation.

**Fix**: pin the endpoint in your adapter and match it to your key's origin:

```go
const defaultEndpoint = "https://api.moonshot.ai/v1/chat/completions"
// or "https://api.moonshot.cn/v1/chat/completions" if your key is China-region
```

**Verify**: `curl -sS https://api.moonshot.ai/v1/models -H "Authorization: Bearer $KEY"` — if 401, try the `.cn` endpoint. If 200, you have the right pairing.

## 7. `prompt_cache_key` HURTS cost in coding workloads

**Symptom**: enabling `prompt_cache_key` with a stable per-task value increased observed cost by 44% on both tier-2 and tier-3 canaries. Quality also dropped (tier-3 pass rate 2/3 → 1/3).

**Cause**: Moonshot caches input prompts and bills cache-hit input at $0.16/M instead of $0.95/M (6× savings). For this to help, the first ~2K tokens of the prompt must be byte-identical across requests. In coding-agent workloads the prompt embeds spec content, file paths, and tool results that differ every cycle — so the cache never hits. Worse, the matrix observed a 44% cost INCREASE when the key was set, possibly because Moonshot bills cache-lookup attempts even on miss. The mechanism is not fully understood.

**Fix**: do not set `prompt_cache_key` for coding agents. If you need to test it for a stable-prompt workload (e.g. system-prompt-only queries with no embedded user content):

```go
// Probe before adopting:
// 1. Send 5 identical requests with cache_key set.
// 2. Inspect billing dashboard for cache_hit_tokens > 0.
// 3. If 5/5 miss, the cache_key is not helping.
```

**Verify**: instrument adapter to record `usage.prompt_cache_hit_tokens` vs `usage.prompt_cache_miss_tokens` per call. If hit-rate is <10%, remove the key.

## 8. Default tier limited to 50 RPM

**Symptom**: bursty agent fleets hit HTTP 429 after a small number of concurrent requests. Tier-2 and tier-3 canary throughput stalls.

**Cause**: new Moonshot accounts default to a 50 RPM tier — fine for a single agent, hostile to a pool of 4+ pods making concurrent multi-turn tool calls. The tier is not visible in the dashboard until you exceed it.

**Fix**: email `support@moonshot.ai` with your account ID and request the production tier. Until granted, throttle concurrency in the adapter:

```go
// Per-pod rate limiter:
limiter := rate.NewLimiter(rate.Every(2*time.Second), 1)  // ~30 RPM, headroom
```

Combine with the standard retry-with-backoff pattern for 429 responses, but bound total time at the call edge (single retry loop with per-attempt timeout can multiply stuck-time N× — see the adapter audit checklist).

**Verify**: tail logs for HTTP 429 frequency. If 429s appear at low concurrency (≤4 pods), you are tier-limited.

## 9. `strict_tools: true` is the right default for coding agents

**Symptom**: with `strict: false` on function definitions, tier-3 coding tasks regressed from 2/3 pass to 1/3 pass in the matrix. Tier-2 was unaffected.

**Cause**: when `strict: false`, Moonshot performs only JSON-validity checks on tool arguments, not schema enforcement. Malformed-but-parseable args propagate into conversation history and poison further generations. On heavy tasks where the model emits many tool calls per turn, even one bad call class compounds into session-level failure.

**Fix**:

```go
type wireFunctionDef struct {
    Name        string `json:"name"`
    Description string `json:"description,omitempty"`
    Parameters  any    `json:"parameters"`
    Strict      bool   `json:"strict"`  // default true for coding agents
}

for _, t := range req.Tools {
    def := wireFunctionDef{
        Name:        t.Name,
        Description: t.Description,
        Parameters:  t.Parameters,
        Strict:      true,
    }
    // ...
}
```

The community guidance to use lenient mode for "compatibility" is wrong for this workload. Strict mode does not introduce 422s in well-formed schemas; it just catches the malformed calls that would otherwise poison the loop.

**Verify**: run a 10-turn multi-tool task with `strict: false`, then again with `strict: true`. If the strict run completes more tool calls successfully, the lenient mode was hurting you.

## Bonus: pricing summary

| Item | Rate |
|---|---|
| Input (cache miss) | $0.95/M |
| Input (cache hit) | $0.16/M |
| Output | $4.00/M |
| Context window | 256K |

Real rates as of 2026-05-20. Plug them into your cost tracker — the OpenAI-default Sonnet rates in many adapters over-bill kimi by ~5×, triggering false budget pauses.

## Common Mistakes

**Trusting kimi's self-report about its own quirks.** Asked "do you need reasoning_content echoed?" kimi answers `false`. Reality (per docs, LiteLLM #26156, and every framework we audited) is `true`. Verify against the actual 400 response, not the model's introspection.

**Adopting `prompt_cache_key` on the assumption it always helps.** Validated 44% cost regression in coding workloads. If your prompt is dynamic per-task, the cache never hits and the key actively hurts.

**Defaulting `max_tokens` to OpenAI's 2048.** Reasoning models share that budget with reasoning. 2048 silently truncates everything. The runtime sees "no tool calls" and concludes "model gave up" — wrong diagnosis, wrong fix.

**Skipping the rate-card audit.** Adapters that fall back to Sonnet rates over-bill kimi by ~5×. The tracker's cost number is fictional until you add an explicit `{0.95, 4.00, ..., 0.16}` entry for kimi-k2.6.

**Running prompt-rich (d4/scaffolded) prompts on kimi.** The matrix proved this is 12× more expensive on tier-2 with zero quality improvement. Kimi is a reasoning model — same asymmetry as sonnet, grok-reasoning, deepseek-reasoner. Thin directive prompts win; rich examples-heavy prompts hurt.

