---
title: "xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces"
description: "Ten concrete xAI Grok quirks observed in production matrix runs — wireError object vs string, HTML rate-limit responses, paginated read expectations, per-model tool exclusions, and the grok-4.3 vs grok-4.20-reasoning trade-off."
url: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["xai","grok","grok-4","llm-quirks","openai-compatible","production","reasoning-models"]
skills: ["llm-adapter-development","provider-integration","production-debugging"]
tools: ["xai","grok-4.3","grok-4.20-reasoning","go"]
levels: ["intermediate","advanced"]
word_count: 2586
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/xai-grok-operational-quirks/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=xAI+Grok+Operational+Quirks%3A+Error+Shapes%2C+Rate-Limit+HTML%2C+and+Per-Model+Tool+Surfaces
---


# xAI Grok Operational Quirks

xAI's Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn.

This page is the production-confirmed quirks list, each as `Symptom → Cause → Fix → Verify`. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on `api.x.ai` and the heavy-tier POC. Full synthesis: `~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md` and `project_grok_matrix_v1_2026_05_19.md`.

## TL;DR — pattern-match before reading

- If a 4xx response causes a whole-turn JSON decode crash, your adapter typed `error` as `string` — xAI returns an object on some errors
- If you see `parse response: invalid character 'F' looking for beginning of value` during bursts, you got an HTML rate-limit page instead of JSON
- If turn 2 of a tool loop returns HTTP 422 `"missing field 'content'"`, you have `omitempty` on `Content` in your wireMessage
- If you set `reasoning_effort` on grok-4.20-reasoning and get 400, drop it — that model rejects the parameter; only grok-4.3 accepts it
- If grok-4.3 defers on multi-file work with tier-2 phrasing, fix the heavy-tier prompt — it is prompt-bleed, not capability
- If grok-4.20-reasoning is your heavy-tier pick, switch to grok-4.3 + tuned prompt — 4.20-reasoning is strictly worse for coding
- If grok emits raw `<function_calls><parameter name="offset">` XML in content, your `read_file` is missing `offset`/`limit` params
- If grok produces a confident PR body that doesn't match the diff, diff-verify every bullet — don't approve on the body alone
- If a single LLM call hangs 21 minutes, your retry loop multiplied per-attempt timeout — bound at the call edge
- If you have one tool list for "all grok models", split it — Roo Code's per-model `excludedTools` pattern is correct

## 1. `wireError` as string — whole-response decode crashes

**Symptom**: a real provider error returns `parse response: invalid character '{' looking for beginning of string value`. The entire turn's response body is discarded. The agent thinks the call timed out.

**Cause**: most providers return error as a string field: `"error": "rate limited"`. xAI sometimes returns an object: `"error": {"message": "...", "type": "...", "code": "..."}`. If your `wireResponse` has `Error string`, the object form crashes `json.Unmarshal` for the whole response — including the choices array, the usage, everything. The conversation state is lost.

**Fix**: type the error field as `json.RawMessage` and decode in a helper:

```go
type wireResponse struct {
    Choices []wireChoice    `json:"choices"`
    Usage   wireUsage       `json:"usage"`
    Error   json.RawMessage `json:"error,omitempty"`
}

func decodeError(raw json.RawMessage) string {
    if len(raw) == 0 {
        return ""
    }
    var s string
    if err := json.Unmarshal(raw, &s); err == nil {
        return s
    }
    var obj struct {
        Message string `json:"message"`
        Code    string `json:"code"`
        Type    string `json:"type"`
    }
    if err := json.Unmarshal(raw, &obj); err == nil {
        return obj.Message
    }
    return string(raw)
}
```

This bug caused at least one lost grok-4.20-reasoning run during the 2026-05-19 POC. It looked like a network timeout in logs.

**Verify**: send a request with a bad model name (`grok-nonexistent`). If your adapter returns "decode response: ..." instead of the provider's error message, you have the bug.

## 2. Non-JSON rate-limit responses (HTML pages)

**Symptom**: under burst load (PARALLEL≥3) ~80% of requests fail with `xai: parse response: invalid character 'F' looking for beginning of value`. Calls succeed fine at low concurrency.

**Cause**: xAI returns HTML error pages on 429s and some 5xxs ("=== rate limit ===" plain text body). The `Content-Type` header may or may not say `text/html`. The adapter's `json.Unmarshal` chokes on the first non-`{` byte.

**Fix**: detect non-JSON bodies before decoding and feed them into the retry path:

```go
func sendWithRetry(ctx context.Context, req *http.Request) (*http.Response, error) {
    backoffs := []time.Duration{1 * time.Second, 3 * time.Second, 9 * time.Second}
    var lastErr error
    for attempt := 0; attempt <= len(backoffs); attempt++ {
        resp, err := http.DefaultClient.Do(req)
        if err != nil {
            lastErr = err
            // fall through to retry
        } else {
            body, _ := io.ReadAll(resp.Body)
            resp.Body.Close()
            if isRetryable(resp.StatusCode) || !looksLikeJSON(body) {
                lastErr = fmt.Errorf("retryable: %d %s", resp.StatusCode, snippet(body))
            } else {
                resp.Body = io.NopCloser(bytes.NewReader(body))
                return resp, nil
            }
        }
        if attempt < len(backoffs) {
            select {
            case <-time.After(backoffs[attempt]):
            case <-ctx.Done():
                return nil, ctx.Err()
            }
        }
    }
    return nil, lastErr
}

func looksLikeJSON(b []byte) bool {
    b = bytes.TrimSpace(b)
    return len(b) > 0 && (b[0] == '{' || b[0] == '[')
}

func snippet(b []byte) string {
    if len(b) > 200 { b = b[:200] }
    return string(b)
}
```

Important: do NOT retry 4xx non-429 status codes. A 422 validation error is deterministic — retrying burns budget on the same failure. The matrix v1 wasted 4 attempts per 422 before discovery.

Also: keep PARALLEL low (1-2) on xAI even with retry. Bench runs with high concurrency waste both wall time (retry waits) and signal (cells differ in call rate, not capability). The grok matrix v1 was 82% contaminated by this exact issue.

**Verify**: hit `api.x.ai` from 5 concurrent processes for 30 seconds. If you see HTML in error logs, the rate-limit page is firing.

## 3. `Content,omitempty` — HTTP 422 on tool-call-only assistant turns

**Symptom**: multi-turn tool flows work the first call, then return HTTP 422 `"messages[N]: missing field 'content'"` on the second assistant-tool-call turn.

**Cause**: your `wireMessage` has `Content string \`json:"content,omitempty"\``. When an assistant turn has only `tool_calls` and no text, `Content` is the empty string. `omitempty` drops the field. xAI rejects this shape with 422. Same trap as Moonshot.

**Fix**:

```go
// wrong:
type chatMessage struct {
    Role      string     `json:"role"`
    Content   string     `json:"content,omitempty"`  // ← drops on empty
    ToolCalls []toolCall `json:"tool_calls,omitempty"`
}

// right:
type chatMessage struct {
    Role      string     `json:"role"`
    Content   string     `json:"content"`            // ← always present
    ToolCalls []toolCall `json:"tool_calls,omitempty"`
}
```

This bug contaminated 80%+ of the v2 and v3 grok matrices before discovery. Lesson: any field that the upstream requires-as-present must NOT be `omitempty`.

**Verify**: marshal a synthetic assistant tool-call message with empty content. Inspect the JSON for `"content": ""`. If missing, you have the bug.

## 4. `reasoning_effort` accepted on grok-4.3, rejected on grok-4.20-reasoning

**Symptom**: setting `reasoning_effort: "high"` on grok-4.20-reasoning returns HTTP 400 `"Model grok-4.20-0309-reasoning does not support parameter reasoningEffort."` Same parameter on grok-4.3 passes and improves quality (0/3 → 2/3 on the matrix tier-3 canary).

**Cause**: grok-4.20-reasoning has always-on reasoning at a fixed effort level — the param is not meaningful and is rejected. grok-4.3 is non-reasoning by default; `reasoning_effort: "high"` engages a deeper reasoning pass on heavy work.

**Fix**: gate the param by model:

```go
func reasoningEffortFor(model string, requested string) string {
    if requested == "" {
        return ""
    }
    if strings.Contains(model, "reasoning") {
        return ""  // grok-4.20-reasoning rejects this param
    }
    return requested  // grok-4.3 accepts "low" | "medium" | "high"
}
```

For coding-agent multi-file work, prefer `reasoning_effort: "high"` on grok-4.3. The community advice to use `"low"` for "agentic loops" was wrong for heavy tier in our data — `"low"` gave 0/3 fast-fails, `"high"` gave 2/3 pass.

**Verify**: probe both models with the same param. If grok-4.20-reasoning 400s and grok-4.3 succeeds, the per-model gating is needed.

## 5. grok-4.3 prefers d4-rich prompt; defers without it

**Symptom**: grok-4.3 dispatched on heavy-tier multi-file specs defers with phrasing like `"Complex multi-repo service implementation exceeds single-cycle scope"` or `"cannot complete full implementation in single cycle without risking incomplete status"`. The model is capable — it shipped a 9-turn multi-file PR ($0.33 cost) on a similar spec — but the prompt lets it off the hook.

**Cause**: grok-4.3 is non-reasoning chat. Without explicit scaffolding ("multi-file IS the heavy-tier mandate"), it inherits tier-2-style "defer when uncertain" reasoning patterns. The same model with a tuned `heavy_scope_directive` prompt shipped end-to-end work in the POC.

**Fix**: add to `CLAUDE-builder-heavy.md` (or your equivalent):

```markdown
## Heavy-tier scope (DO NOT defer on this alone)

Multi-file changes ARE the heavy-tier mandate. A spec listing 8+ files
across multiple repos is your assignment, not an over-scope warning.

DO defer on:
- Named blockers (missing file, ambiguous spec line, compile error
  you can't resolve in 2-3 attempts)
- Acceptance criteria that genuinely don't fit the runtime

DO NOT defer on:
- "Multi-file scope" alone
- "Risk of incomplete status" — incomplete IS still useful; ship it
- "Complex spec" — every heavy-tier spec is complex by design
```

Heavy-tier prompt + grok-4.3 went 1/4 → 1/2 fair attempts in the POC (POC ran 8 attempts; 2 were lost to harness bugs since fixed). Don't conclude on a single PR; soak ≥24h before deciding.

**Verify**: dispatch a known-multi-file spec to grok-4.3 with the default prompt vs the tuned prompt. If the tuned-prompt version ships and the default defers, prompt-bleed was the issue.

## 6. grok-4.20-reasoning is strictly worse than grok-4.3 for coding

**Symptom**: heavy-tier POC: grok-4.20-reasoning 0/2 pass, slow, skips tests, timeout-prone. grok-4.3 with the same prompt: 1/2 fair attempts. Sonnet baseline: 2/2.

**Cause**: grok-4.20-reasoning is positioned for "one-shot deep problems" not iterative agentic loops. Each turn takes 2-5 minutes; on a 30-turn multi-file refactor that compounds into wall-clock failure modes. It also tends to skip writing tests and produces PR bodies that don't match the diff (see quirk #8).

**Fix**: default to grok-4.3 for any coding workload. Use grok-4.20-reasoning only for non-agentic single-shot reasoning queries.

```yaml
# pod-builder-heavy-grok.yaml
models:
  provider: xai
  main: grok-4.3              # NOT grok-4.20-reasoning
  reasoning_effort: high       # accepted on grok-4.3
  max_output_tokens: 32000
```

Cost data: grok-4.20-reasoning shipped 0 PRs in 48h at $54.76 burn (bh-0, 2026-05-18). grok-4.3 shipped 1 PR in 3.7h at $0.33 (bh-3, same day). Per-PR economics: indefinite vs $0.33.

**Verify**: A/B both models on the same canary. If grok-4.20-reasoning costs more and ships less, the recommendation holds.

## 7. Paginated `read_file` required — grok emits XML when missing

**Symptom**: grok-4 family models emit raw XML in response content: `<function_calls><parameter name="offset">100</parameter>`. The next-turn parser sees this as malformed text, conversation goes off the rails.

**Cause**: grok-4.3 and grok-4.20-reasoning both expect a paged `read_file(path, offset?, limit?)` tool by default. When the harness only exposes `read_file(path)`, grok tries to call the paged API anyway by emitting XML in content. Grok itself flagged this in self-evaluation; Claude Code's Read tool has offset/limit, OpenHands' does too — pagination is the agentic-framework norm grok was trained against.

**Fix**: advertise the paginated signature even when files are small:

```go
toolDef := ToolDef{
    Name: "read_file",
    Parameters: map[string]any{
        "type": "object",
        "properties": map[string]any{
            "path":   map[string]any{"type": "string"},
            "offset": map[string]any{"type": "integer", "default": 0},
            "limit":  map[string]any{"type": "integer", "default": 2000},
        },
        "required": []string{"path"},
    },
}
```

The implementation can default to "read the whole file" if the params aren't useful for your workload. The point is the schema must advertise them.

**Verify**: trace a grok-4.x conversation with a single-param `read_file`. If you see `<function_calls>` XML in content, the pagination expectation is unmet.

## 8. PR body claims don't match diff

**Symptom**: grok-authored PRs (especially REQUIRED-FIX retries) have confident descriptions listing concrete changes ("renamed X to Y", "fixed N findings", "added function F"). Inspection of the diff shows some or all of those changes are absent. Concrete example: gotools PR #15 claimed 4 specific fixes; diff contained zero of them.

**Cause**: not fully understood. Hypotheses:

- Multi-turn reasoning loses track of which edits actually landed vs which were "decided"
- Output-token budget tightening near the PR-body composition step drops the "did I actually do this?" check
- Model bias toward affirmative phrasing during summarization

High-cost reasoning (POC-02 retry: 37 turns, $3.09) does NOT correlate with diff fidelity — paying more does not fix it.

**Fix**: at the reviewer-agent layer, diff-verify every bullet in grok-authored PR bodies before approving. Treat the body as a plan, not a description.

```go
// In reviewer logic for grok-authored PRs:
func verifyPRClaims(pr *PR, diff string) []string {
    var unverified []string
    for _, bullet := range extractClaims(pr.Body) {
        if !diffContains(diff, bullet) {
            unverified = append(unverified, bullet)
        }
    }
    return unverified
}
```

Especially watch REQUIRED-FIX retries — the retry seems disproportionately likely to over-promise.

**Verify**: pick 10 random grok-authored PRs, diff-verify every claim. If fidelity is <80%, apply the rule.

## 9. HTTP timeout multiplication — single turn hangs N×

**Symptom**: a grok matrix run hung 21 minutes on a single `Generate()` call. Per-attempt timeout was 600s; max retries 4. Worst case multiplies to ~40 minutes.

**Cause**: retry loop with `http.Client.Timeout = 600s` and `maxAttempts = 4` has worst-case stuck-time of `N × T` (not `T`). A dead upstream burns wall-time linearly across attempts. The matrix runner's bg job sat stuck for 20+min wasting budget on a clearly dead call.

**Fix**: bind a single context at the call edge (Generate/Send). Reuse it for every retry attempt. Make the backoff ctx-aware:

```go
func (c *Client) Send(ctx context.Context, req Request) (*Response, error) {
    // Bound TOTAL time at the call edge:
    totalBudget := 700 * time.Second  // 600s call + 1+3+9s backoff slack
    ctx, cancel := context.WithTimeout(ctx, totalBudget)
    defer cancel()

    for attempt := 0; attempt < maxAttempts; attempt++ {
        httpReq, _ := http.NewRequestWithContext(ctx, "POST", c.endpoint, body)
        resp, err := c.httpClient.Do(httpReq)
        if err == nil { return decode(resp) }
        if !isRetryable(err) { return nil, err }

        select {
        case <-time.After(backoff(attempt)):
        case <-ctx.Done():
            return nil, ctx.Err()
        }
    }
    return nil, fmt.Errorf("exhausted retries")
}
```

This generalizes — same issue in any provider adapter that uses retry+timeout. Audit `internal/anthropic`, `internal/deepseek`, `internal/moonshot`, `internal/gemini` for the same pattern.

**Verify**: simulate a flaky upstream (httptest server that always 503s). Measure end-to-end Send time. If it exceeds your declared per-call budget, the multiplication is happening.

## 10. Per-model `excludedTools` — `apply_diff` is a weak spot

**Symptom**: tool calls to `apply_diff` (unified diff edit format) succeed less often on grok than on other models. Switching to `search_replace`-shape edits (string substitution) recovers reliability.

**Cause**: grok-4 family models trained on string-replacement edit shapes, not unified-diff shapes. Roo Code's catalog confirms this — they `excludedTools: ["apply_diff"]` and `includedTools: ["search_replace"]` for EVERY grok model variant. Aider agrees (uses diff for grok-4 but moves grok-3-mini to whole-file rewrite).

**Fix**: borrow Roo's per-model tool-surface pattern:

```go
type ModelToolOverrides struct {
    Excluded []string
    Included []string
}

var perModelOverrides = map[string]ModelToolOverrides{
    "grok-4.3": {
        Excluded: []string{"apply_diff"},
        Included: []string{"search_replace", "insert_after", "append_to_file"},
    },
    "grok-4.20-reasoning": {
        Excluded: []string{"apply_diff"},
        Included: []string{"search_replace", "insert_after", "append_to_file"},
    },
}

func toolsFor(model string, base []Tool) []Tool {
    o, ok := perModelOverrides[model]
    if !ok { return base }
    return applyOverrides(base, o)
}
```

For grok, also include companion `insert_after` / `append_to_file` tools — grok reaches for "insert at end" intent that string-replacement doesn't naturally cover.

**Verify**: A/B `apply_diff`-only vs `search_replace`+`insert_after` on a grok-4.3 multi-edit task. If the latter completes more edits successfully, the override is correct.

## Bonus: pricing summary

| Model | Input | Output | Verdict |
|---|---|---|---|
| grok-4.3 | ~$0.50/M | ~$2.50/M | Use this for coding |
| grok-4.20-reasoning | ~$3/M | ~$15/M | 6× pricier, strictly worse for coding — don't use |

Verify rates via the trailing `usage.cost_in_usd_ticks` field (xAI-authoritative billed cost in 1e-10 USD units), not a constant. Variants and prices shift mid-month.

## Common Mistakes

**Treating `wireError` as a string.** xAI returns object-shape errors on some 4xx responses. String-typed Error fields crash the whole-response decode. Use `json.RawMessage` + a decode helper.

**Running PARALLEL≥3 grok bench cells.** ~80% rate-limit-poisoned in our matrices. Even with retry, the noise contaminates per-cell signal. PARALLEL=1 is the right default for xAI.

**Picking grok-4.20-reasoning because "reasoning sounds better".** It is strictly worse for coding: slower, skipped tests, 6× more expensive, 0/2 vs 1/2 on heavy POC. Pick grok-4.3 with `reasoning_effort: "high"`.

**Trusting grok-authored PR bodies.** Especially on REQUIRED-FIX retries, the body reads like a plan and the diff is missing claims. Diff-verify every bullet.

**Per-attempt timeouts without a total-time cap.** N retries × T timeout = N×T worst-case stuck time. Bind a ctx at the call edge, not just at each attempt. The 21-minute hang was avoidable.

**One tool list for "all grok models".** Per-model `excludedTools` is correct. Apply Roo's pattern: exclude `apply_diff`, include `search_replace`+`insert_after`+`append_to_file` for every grok variant.

