OFAT Matrix LLM Tuning: A Methodology for Picking Sampling Params, Tool Configs, and Prompts Without Guessing

OFAT Matrix LLM Tuning#

When a new provider or model lands and you have to decide what temperature, max_tokens, tool_choice, prompt-shape, and turn budget to ship in production, the default is to pick by hunch. Read the model card, copy a partner adapter’s defaults, ship. A week later you find out reasoning_effort=high doubled cost for no quality gain, max_tokens=2048 silently truncated half your tier-3 runs, and the “prompt-rich” pattern you copied from grok-4.3 actively hurts kimi.

Reasoning-Model Tuning Asymmetry: Why Thin Prompts Beat Rich Prompts (and When They Don't)

Reasoning-Model Tuning Asymmetry#

Practitioners assume “better prompt = better output”. For one model class, that assumption is correct. For the other, the same prompt makes things measurably worse. This article documents the asymmetry, names the dividing line, and gives you a 4-cell test to confirm it on your own canary before you commit to a prompt.

The asymmetry is empirical, not theoretical. It shows up cleanly across four independent OFAT (one-factor-at-a-time) matrices run between 2026-05-18 and 2026-05-20: sonnet POC, grok matrix v1+v2, deepseek matrix v1, kimi matrix v1.

Stateful vs Stateless Agent Daemons: A-Mode /loop vs C-Mode cron

Stateful vs Stateless Agent Daemons#

Long-running agents on the Max subscription split cleanly into two operating modes. A-mode keeps a single /loop session alive across cycles, accumulating in-session context that gets cleared once a day. C-mode wraps claude -p in a bash sleep loop; every cycle is a fresh process with zero carryover. Both run forever in tmux. Both cost $0 of Anthropic API spend (the subscription pays). They behave very differently per cycle.

The d4-rich Prompt Pattern: Unlocking Non-Reasoning Models on Multi-File Tasks

The d4-rich Prompt Pattern#

Non-reasoning chat models (deepseek-V4-Flash, grok-4.3, kimi with thinking disabled) collapse on multi-file refactor tasks when given thin or baseline prompts. Pass rates of 0-33% on canaries that reasoning models clear at 67-100%. The cheap fix is a three-part prompt addendum: completion checklist, callsites-exhaustively-updated rule, and verify-before-push instruction. Drop it into the system prompt of a non-reasoning model and the canaries go green. Drop it into a reasoning model and you pay 12× more for 0% quality improvement.

The Self-Ask Trap: Why LLMs Are Unreliable Sources About Their Own Quirks

The Self-Ask Trap#

Practitioners ask the LLM about itself as a research shortcut: “What are your common quirks? What temperature should I use? Do you need reasoning_content echoed in multi-turn?” The output looks plausible, often cites specific behaviors, sometimes includes API parameter names. It is often wrong.

The 2026-05-20 kimi-k2.6 tuning research surfaced a clean example. Self-ask said one thing. Documentation, partner adapter source, GitHub issues, and direct API probes said the opposite. The model is provably wrong about itself, and the failure mode is structural — not specific to kimi.

xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces

xAI Grok Operational Quirks#

xAI’s Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn.

This page is the production-confirmed quirks list, each as Symptom → Cause → Fix → Verify. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on api.x.ai and the heavy-tier POC. Full synthesis: ~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md and project_grok_matrix_v1_2026_05_19.md.

CircleCI Pipeline Patterns: Orbs, Executors, Workspaces, Parallelism, and Approval Workflows

CircleCI Pipeline Patterns#

CircleCI pipelines are defined in .circleci/config.yml. The configuration model uses workflows to orchestrate jobs, jobs to define execution units, and steps to define commands within a job. Every job runs inside an executor – a Docker container, Linux VM, macOS VM, or Windows VM.

Config Structure and Executors#

A minimal config defines a job and a workflow:

version: 2.1

executors:
  go-executor:
    docker:
      - image: cimg/go:1.22
    resource_class: medium
    working_directory: ~/project

jobs:
  build:
    executor: go-executor
    steps:
      - checkout
      - run:
          name: Build application
          command: go build -o myapp ./cmd/myapp

workflows:
  main:
    jobs:
      - build

Named executors let you reuse environment definitions across jobs. The resource_class controls CPU and memory – small (1 vCPU/2GB), medium (2 vCPU/4GB), large (4 vCPU/8GB), xlarge (8 vCPU/16GB). Choose the smallest class that avoids OOM kills to keep costs down.

Golden Paths and Paved Roads

What Golden Paths Are#

A golden path is a pre-built, opinionated workflow that gets a developer from zero to a production-ready artifact with minimal decisions. The term comes from Spotify’s internal platform work. Netflix calls them “paved roads.” The idea is the same: provide a well-maintained, well-tested default path that handles 80% of use cases, while allowing teams to go off-road when they have legitimate reasons.

A golden path is not a mandate. It is a recommendation backed by automation. Create a new Go microservice using the golden path and you get a repository with CI/CD, Kubernetes manifests, observability, and a Backstage catalog entry — working in minutes. The golden path removes the 40+ decisions a developer would otherwise need to make.

Saga Pattern: Choreography, Orchestration, and Compensating Transactions

Saga Pattern#

In a monolith, a single database transaction can span multiple operations atomically. In microservices, each service owns its database. There is no distributed transaction that works reliably across services. The saga pattern solves this by breaking a transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes its work if a later step fails.

The Problem: No Distributed ACID#

Consider an order placement that must: (1) reserve inventory, (2) charge payment, (3) create shipment. In a monolith, this is one transaction. In microservices, these are three services with three databases. Two-phase commit (2PC) across these is fragile, slow, and most message brokers and modern databases do not support it across service boundaries.

On-Call Rotation Design

On-Call Is a System, Not a Schedule#

On-call done wrong burns out engineers and degrades reliability simultaneously. Exhausted responders make worse decisions, and teams that dread on-call avoid owning production systems. Done right, on-call is sustainable, well-compensated, and generates signal that drives real reliability improvements.

Rotation Schedule Types#

Weekly Rotation#

Each engineer is primary on-call for one full week, Monday to Monday. This is the simplest model and works for teams of 5 or more in a single timezone.