--- title: "xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces" description: "Ten concrete xAI Grok quirks observed in production matrix runs — wireError object vs string, HTML rate-limit responses, paginated read expectations, per-model tool exclusions, and the grok-4.3 vs grok-4.20-reasoning trade-off." url: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/ section: knowledge date: 2026-05-20 categories: ["agent-tooling"] tags: ["xai","grok","grok-4","llm-quirks","openai-compatible","production","reasoning-models"] skills: ["llm-adapter-development","provider-integration","production-debugging"] tools: ["xai","grok-4.3","grok-4.20-reasoning","go"] levels: ["intermediate","advanced"] word_count: 2586 formats: json: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/index.json html: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/?format=html api: https://api.agent-zone.ai/api/v1/knowledge/search?q=xAI+Grok+Operational+Quirks%3A+Error+Shapes%2C+Rate-Limit+HTML%2C+and+Per-Model+Tool+Surfaces --- # xAI Grok Operational Quirks xAI's Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn. This page is the production-confirmed quirks list, each as `Symptom → Cause → Fix → Verify`. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on `api.x.ai` and the heavy-tier POC. Full synthesis: `~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md` and `project_grok_matrix_v1_2026_05_19.md`. ## TL;DR — pattern-match before reading - If a 4xx response causes a whole-turn JSON decode crash, your adapter typed `error` as `string` — xAI returns an object on some errors - If you see `parse response: invalid character 'F' looking for beginning of value` during bursts, you got an HTML rate-limit page instead of JSON - If turn 2 of a tool loop returns HTTP 422 `"missing field 'content'"`, you have `omitempty` on `Content` in your wireMessage - If you set `reasoning_effort` on grok-4.20-reasoning and get 400, drop it — that model rejects the parameter; only grok-4.3 accepts it - If grok-4.3 defers on multi-file work with tier-2 phrasing, fix the heavy-tier prompt — it is prompt-bleed, not capability - If grok-4.20-reasoning is your heavy-tier pick, switch to grok-4.3 + tuned prompt — 4.20-reasoning is strictly worse for coding - If grok emits raw `` XML in content, your `read_file` is missing `offset`/`limit` params - If grok produces a confident PR body that doesn't match the diff, diff-verify every bullet — don't approve on the body alone - If a single LLM call hangs 21 minutes, your retry loop multiplied per-attempt timeout — bound at the call edge - If you have one tool list for "all grok models", split it — Roo Code's per-model `excludedTools` pattern is correct ## 1. `wireError` as string — whole-response decode crashes **Symptom**: a real provider error returns `parse response: invalid character '{' looking for beginning of string value`. The entire turn's response body is discarded. The agent thinks the call timed out. **Cause**: most providers return error as a string field: `"error": "rate limited"`. xAI sometimes returns an object: `"error": {"message": "...", "type": "...", "code": "..."}`. If your `wireResponse` has `Error string`, the object form crashes `json.Unmarshal` for the whole response — including the choices array, the usage, everything. The conversation state is lost. **Fix**: type the error field as `json.RawMessage` and decode in a helper: ```go type wireResponse struct { Choices []wireChoice `json:"choices"` Usage wireUsage `json:"usage"` Error json.RawMessage `json:"error,omitempty"` } func decodeError(raw json.RawMessage) string { if len(raw) == 0 { return "" } var s string if err := json.Unmarshal(raw, &s); err == nil { return s } var obj struct { Message string `json:"message"` Code string `json:"code"` Type string `json:"type"` } if err := json.Unmarshal(raw, &obj); err == nil { return obj.Message } return string(raw) } ``` This bug caused at least one lost grok-4.20-reasoning run during the 2026-05-19 POC. It looked like a network timeout in logs. **Verify**: send a request with a bad model name (`grok-nonexistent`). If your adapter returns "decode response: ..." instead of the provider's error message, you have the bug. ## 2. Non-JSON rate-limit responses (HTML pages) **Symptom**: under burst load (PARALLEL≥3) ~80% of requests fail with `xai: parse response: invalid character 'F' looking for beginning of value`. Calls succeed fine at low concurrency. **Cause**: xAI returns HTML error pages on 429s and some 5xxs ("=== rate limit ===" plain text body). The `Content-Type` header may or may not say `text/html`. The adapter's `json.Unmarshal` chokes on the first non-`{` byte. **Fix**: detect non-JSON bodies before decoding and feed them into the retry path: ```go func sendWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) { backoffs := []time.Duration{1 * time.Second, 3 * time.Second, 9 * time.Second} var lastErr error for attempt := 0; attempt <= len(backoffs); attempt++ { resp, err := http.DefaultClient.Do(req) if err != nil { lastErr = err // fall through to retry } else { body, _ := io.ReadAll(resp.Body) resp.Body.Close() if isRetryable(resp.StatusCode) || !looksLikeJSON(body) { lastErr = fmt.Errorf("retryable: %d %s", resp.StatusCode, snippet(body)) } else { resp.Body = io.NopCloser(bytes.NewReader(body)) return resp, nil } } if attempt < len(backoffs) { select { case <-time.After(backoffs[attempt]): case <-ctx.Done(): return nil, ctx.Err() } } } return nil, lastErr } func looksLikeJSON(b []byte) bool { b = bytes.TrimSpace(b) return len(b) > 0 && (b[0] == '{' || b[0] == '[') } func snippet(b []byte) string { if len(b) > 200 { b = b[:200] } return string(b) } ``` Important: do NOT retry 4xx non-429 status codes. A 422 validation error is deterministic — retrying burns budget on the same failure. The matrix v1 wasted 4 attempts per 422 before discovery. Also: keep PARALLEL low (1-2) on xAI even with retry. Bench runs with high concurrency waste both wall time (retry waits) and signal (cells differ in call rate, not capability). The grok matrix v1 was 82% contaminated by this exact issue. **Verify**: hit `api.x.ai` from 5 concurrent processes for 30 seconds. If you see HTML in error logs, the rate-limit page is firing. ## 3. `Content,omitempty` — HTTP 422 on tool-call-only assistant turns **Symptom**: multi-turn tool flows work the first call, then return HTTP 422 `"messages[N]: missing field 'content'"` on the second assistant-tool-call turn. **Cause**: your `wireMessage` has `Content string \`json:"content,omitempty"\``. When an assistant turn has only `tool_calls` and no text, `Content` is the empty string. `omitempty` drops the field. xAI rejects this shape with 422. Same trap as Moonshot. **Fix**: ```go // wrong: type chatMessage struct { Role string `json:"role"` Content string `json:"content,omitempty"` // ← drops on empty ToolCalls []toolCall `json:"tool_calls,omitempty"` } // right: type chatMessage struct { Role string `json:"role"` Content string `json:"content"` // ← always present ToolCalls []toolCall `json:"tool_calls,omitempty"` } ``` This bug contaminated 80%+ of the v2 and v3 grok matrices before discovery. Lesson: any field that the upstream requires-as-present must NOT be `omitempty`. **Verify**: marshal a synthetic assistant tool-call message with empty content. Inspect the JSON for `"content": ""`. If missing, you have the bug. ## 4. `reasoning_effort` accepted on grok-4.3, rejected on grok-4.20-reasoning **Symptom**: setting `reasoning_effort: "high"` on grok-4.20-reasoning returns HTTP 400 `"Model grok-4.20-0309-reasoning does not support parameter reasoningEffort."` Same parameter on grok-4.3 passes and improves quality (0/3 → 2/3 on the matrix tier-3 canary). **Cause**: grok-4.20-reasoning has always-on reasoning at a fixed effort level — the param is not meaningful and is rejected. grok-4.3 is non-reasoning by default; `reasoning_effort: "high"` engages a deeper reasoning pass on heavy work. **Fix**: gate the param by model: ```go func reasoningEffortFor(model string, requested string) string { if requested == "" { return "" } if strings.Contains(model, "reasoning") { return "" // grok-4.20-reasoning rejects this param } return requested // grok-4.3 accepts "low" | "medium" | "high" } ``` For coding-agent multi-file work, prefer `reasoning_effort: "high"` on grok-4.3. The community advice to use `"low"` for "agentic loops" was wrong for heavy tier in our data — `"low"` gave 0/3 fast-fails, `"high"` gave 2/3 pass. **Verify**: probe both models with the same param. If grok-4.20-reasoning 400s and grok-4.3 succeeds, the per-model gating is needed. ## 5. grok-4.3 prefers d4-rich prompt; defers without it **Symptom**: grok-4.3 dispatched on heavy-tier multi-file specs defers with phrasing like `"Complex multi-repo service implementation exceeds single-cycle scope"` or `"cannot complete full implementation in single cycle without risking incomplete status"`. The model is capable — it shipped a 9-turn multi-file PR ($0.33 cost) on a similar spec — but the prompt lets it off the hook. **Cause**: grok-4.3 is non-reasoning chat. Without explicit scaffolding ("multi-file IS the heavy-tier mandate"), it inherits tier-2-style "defer when uncertain" reasoning patterns. The same model with a tuned `heavy_scope_directive` prompt shipped end-to-end work in the POC. **Fix**: add to `CLAUDE-builder-heavy.md` (or your equivalent): ```markdown ## Heavy-tier scope (DO NOT defer on this alone) Multi-file changes ARE the heavy-tier mandate. A spec listing 8+ files across multiple repos is your assignment, not an over-scope warning. DO defer on: - Named blockers (missing file, ambiguous spec line, compile error you can't resolve in 2-3 attempts) - Acceptance criteria that genuinely don't fit the runtime DO NOT defer on: - "Multi-file scope" alone - "Risk of incomplete status" — incomplete IS still useful; ship it - "Complex spec" — every heavy-tier spec is complex by design ``` Heavy-tier prompt + grok-4.3 went 1/4 → 1/2 fair attempts in the POC (POC ran 8 attempts; 2 were lost to harness bugs since fixed). Don't conclude on a single PR; soak ≥24h before deciding. **Verify**: dispatch a known-multi-file spec to grok-4.3 with the default prompt vs the tuned prompt. If the tuned-prompt version ships and the default defers, prompt-bleed was the issue. ## 6. grok-4.20-reasoning is strictly worse than grok-4.3 for coding **Symptom**: heavy-tier POC: grok-4.20-reasoning 0/2 pass, slow, skips tests, timeout-prone. grok-4.3 with the same prompt: 1/2 fair attempts. Sonnet baseline: 2/2. **Cause**: grok-4.20-reasoning is positioned for "one-shot deep problems" not iterative agentic loops. Each turn takes 2-5 minutes; on a 30-turn multi-file refactor that compounds into wall-clock failure modes. It also tends to skip writing tests and produces PR bodies that don't match the diff (see quirk #8). **Fix**: default to grok-4.3 for any coding workload. Use grok-4.20-reasoning only for non-agentic single-shot reasoning queries. ```yaml # pod-builder-heavy-grok.yaml models: provider: xai main: grok-4.3 # NOT grok-4.20-reasoning reasoning_effort: high # accepted on grok-4.3 max_output_tokens: 32000 ``` Cost data: grok-4.20-reasoning shipped 0 PRs in 48h at $54.76 burn (bh-0, 2026-05-18). grok-4.3 shipped 1 PR in 3.7h at $0.33 (bh-3, same day). Per-PR economics: indefinite vs $0.33. **Verify**: A/B both models on the same canary. If grok-4.20-reasoning costs more and ships less, the recommendation holds. ## 7. Paginated `read_file` required — grok emits XML when missing **Symptom**: grok-4 family models emit raw XML in response content: `100`. The next-turn parser sees this as malformed text, conversation goes off the rails. **Cause**: grok-4.3 and grok-4.20-reasoning both expect a paged `read_file(path, offset?, limit?)` tool by default. When the harness only exposes `read_file(path)`, grok tries to call the paged API anyway by emitting XML in content. Grok itself flagged this in self-evaluation; Claude Code's Read tool has offset/limit, OpenHands' does too — pagination is the agentic-framework norm grok was trained against. **Fix**: advertise the paginated signature even when files are small: ```go toolDef := ToolDef{ Name: "read_file", Parameters: map[string]any{ "type": "object", "properties": map[string]any{ "path": map[string]any{"type": "string"}, "offset": map[string]any{"type": "integer", "default": 0}, "limit": map[string]any{"type": "integer", "default": 2000}, }, "required": []string{"path"}, }, } ``` The implementation can default to "read the whole file" if the params aren't useful for your workload. The point is the schema must advertise them. **Verify**: trace a grok-4.x conversation with a single-param `read_file`. If you see `` XML in content, the pagination expectation is unmet. ## 8. PR body claims don't match diff **Symptom**: grok-authored PRs (especially REQUIRED-FIX retries) have confident descriptions listing concrete changes ("renamed X to Y", "fixed N findings", "added function F"). Inspection of the diff shows some or all of those changes are absent. Concrete example: gotools PR #15 claimed 4 specific fixes; diff contained zero of them. **Cause**: not fully understood. Hypotheses: - Multi-turn reasoning loses track of which edits actually landed vs which were "decided" - Output-token budget tightening near the PR-body composition step drops the "did I actually do this?" check - Model bias toward affirmative phrasing during summarization High-cost reasoning (POC-02 retry: 37 turns, $3.09) does NOT correlate with diff fidelity — paying more does not fix it. **Fix**: at the reviewer-agent layer, diff-verify every bullet in grok-authored PR bodies before approving. Treat the body as a plan, not a description. ```go // In reviewer logic for grok-authored PRs: func verifyPRClaims(pr *PR, diff string) []string { var unverified []string for _, bullet := range extractClaims(pr.Body) { if !diffContains(diff, bullet) { unverified = append(unverified, bullet) } } return unverified } ``` Especially watch REQUIRED-FIX retries — the retry seems disproportionately likely to over-promise. **Verify**: pick 10 random grok-authored PRs, diff-verify every claim. If fidelity is <80%, apply the rule. ## 9. HTTP timeout multiplication — single turn hangs N× **Symptom**: a grok matrix run hung 21 minutes on a single `Generate()` call. Per-attempt timeout was 600s; max retries 4. Worst case multiplies to ~40 minutes. **Cause**: retry loop with `http.Client.Timeout = 600s` and `maxAttempts = 4` has worst-case stuck-time of `N × T` (not `T`). A dead upstream burns wall-time linearly across attempts. The matrix runner's bg job sat stuck for 20+min wasting budget on a clearly dead call. **Fix**: bind a single context at the call edge (Generate/Send). Reuse it for every retry attempt. Make the backoff ctx-aware: ```go func (c *Client) Send(ctx context.Context, req Request) (*Response, error) { // Bound TOTAL time at the call edge: totalBudget := 700 * time.Second // 600s call + 1+3+9s backoff slack ctx, cancel := context.WithTimeout(ctx, totalBudget) defer cancel() for attempt := 0; attempt < maxAttempts; attempt++ { httpReq, _ := http.NewRequestWithContext(ctx, "POST", c.endpoint, body) resp, err := c.httpClient.Do(httpReq) if err == nil { return decode(resp) } if !isRetryable(err) { return nil, err } select { case <-time.After(backoff(attempt)): case <-ctx.Done(): return nil, ctx.Err() } } return nil, fmt.Errorf("exhausted retries") } ``` This generalizes — same issue in any provider adapter that uses retry+timeout. Audit `internal/anthropic`, `internal/deepseek`, `internal/moonshot`, `internal/gemini` for the same pattern. **Verify**: simulate a flaky upstream (httptest server that always 503s). Measure end-to-end Send time. If it exceeds your declared per-call budget, the multiplication is happening. ## 10. Per-model `excludedTools` — `apply_diff` is a weak spot **Symptom**: tool calls to `apply_diff` (unified diff edit format) succeed less often on grok than on other models. Switching to `search_replace`-shape edits (string substitution) recovers reliability. **Cause**: grok-4 family models trained on string-replacement edit shapes, not unified-diff shapes. Roo Code's catalog confirms this — they `excludedTools: ["apply_diff"]` and `includedTools: ["search_replace"]` for EVERY grok model variant. Aider agrees (uses diff for grok-4 but moves grok-3-mini to whole-file rewrite). **Fix**: borrow Roo's per-model tool-surface pattern: ```go type ModelToolOverrides struct { Excluded []string Included []string } var perModelOverrides = map[string]ModelToolOverrides{ "grok-4.3": { Excluded: []string{"apply_diff"}, Included: []string{"search_replace", "insert_after", "append_to_file"}, }, "grok-4.20-reasoning": { Excluded: []string{"apply_diff"}, Included: []string{"search_replace", "insert_after", "append_to_file"}, }, } func toolsFor(model string, base []Tool) []Tool { o, ok := perModelOverrides[model] if !ok { return base } return applyOverrides(base, o) } ``` For grok, also include companion `insert_after` / `append_to_file` tools — grok reaches for "insert at end" intent that string-replacement doesn't naturally cover. **Verify**: A/B `apply_diff`-only vs `search_replace`+`insert_after` on a grok-4.3 multi-edit task. If the latter completes more edits successfully, the override is correct. ## Bonus: pricing summary | Model | Input | Output | Verdict | |---|---|---|---| | grok-4.3 | ~$0.50/M | ~$2.50/M | Use this for coding | | grok-4.20-reasoning | ~$3/M | ~$15/M | 6× pricier, strictly worse for coding — don't use | Verify rates via the trailing `usage.cost_in_usd_ticks` field (xAI-authoritative billed cost in 1e-10 USD units), not a constant. Variants and prices shift mid-month. ## Common Mistakes **Treating `wireError` as a string.** xAI returns object-shape errors on some 4xx responses. String-typed Error fields crash the whole-response decode. Use `json.RawMessage` + a decode helper. **Running PARALLEL≥3 grok bench cells.** ~80% rate-limit-poisoned in our matrices. Even with retry, the noise contaminates per-cell signal. PARALLEL=1 is the right default for xAI. **Picking grok-4.20-reasoning because "reasoning sounds better".** It is strictly worse for coding: slower, skipped tests, 6× more expensive, 0/2 vs 1/2 on heavy POC. Pick grok-4.3 with `reasoning_effort: "high"`. **Trusting grok-authored PR bodies.** Especially on REQUIRED-FIX retries, the body reads like a plan and the diff is missing claims. Diff-verify every bullet. **Per-attempt timeouts without a total-time cap.** N retries × T timeout = N×T worst-case stuck time. Bind a ctx at the call edge, not just at each attempt. The 21-minute hang was avoidable. **One tool list for "all grok models".** Per-model `excludedTools` is correct. Apply Roo's pattern: exclude `apply_diff`, include `search_replace`+`insert_after`+`append_to_file` for every grok variant.