DeepSeek V4 Operational Quirks: Pro vs Flash, Reasoning Echo, and the Discount Cliff

DeepSeek V4 Operational Quirks#

DeepSeek V4 ships two models behind one OpenAI-compatible API: V4-Pro (reasoning) at $1.74/M input / $3.48/M output and V4-Flash (chat) at $0.28/M input / $1.10/M output. Until 2026-05-31 V4-Pro carries a 75% discount, putting it at $0.435/M input — cheap enough to use as a heavy-tier coding model. After that, the cost steps up 4×.

The two models live on the same endpoint but want very different things. V4-Pro behaves like a reasoning model (thin prompts, reasoning_content echo required, tool_choice restrictions). V4-Flash behaves like a chat model (rich prompts win dramatically; rejects nothing). Confuse them and your matrix lights up red.

LLM Adapter Audit Checklist: 10 Bugs That Hide in OpenAI-Compatible Providers

LLM Adapter Audit Checklist#

When you wrap an OpenAI-compatible LLM provider (Moonshot, DeepSeek, xAI, Together, Fireworks, OpenRouter, vLLM, anything else that exposes POST /v1/chat/completions) in a Go HTTP client, the same ten bug classes show up. They all silently degrade or break the agent — none of them crash loudly. Each was observed in production across at least one of xAI, DeepSeek, or Moonshot during a two-week audit period.

This checklist is the audit. Run it against any new adapter before shipping. Each entry is Symptom → Cause → Fix with a code shape you can grep your repo for.

Moonshot Kimi K2.6 Operational Quirks: What Breaks in Production

Moonshot Kimi K2.6 Operational Quirks#

Kimi K2.6 is one of the cheapest competent reasoning models — $0.95/M input cache-miss, $0.16/M cache-hit, $4.00/M output, 256K context. It is also one of the most opinionated. Half of what works on OpenAI breaks here, and the failures are silent: empty content, mid-reasoning truncation, 400 errors that don’t mention the actual problem, and a cache key parameter that makes cost go up instead of down.

xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces

xAI Grok Operational Quirks#

xAI’s Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn.

This page is the production-confirmed quirks list, each as Symptom → Cause → Fix → Verify. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on api.x.ai and the heavy-tier POC. Full synthesis: ~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md and project_grok_matrix_v1_2026_05_19.md.

Temporal High Availability: Multi-Component Cluster on Kubernetes

Temporal High Availability#

A single-replica Temporal deployment works for development, but any pod going down takes the workflow engine offline. This guide configures a multi-replica cluster with proper resource allocation, Elasticsearch visibility, and health monitoring.

For the single-replica setup this builds on, see Running Temporal Server on Minikube.

Why HA Matters#

ComponentWhat Breaks When It Goes Down
FrontendNo client can start, signal, query, or cancel workflows. Workers cannot poll.
HistoryRunning workflows stall. No state transitions. Timers do not fire.
MatchingTasks queue up but never dispatch. Workflows appear frozen.
WorkerInternal system workflows stop (archival, replication). Application workflows unaffected.

With multiple replicas, losing a pod triggers a brief rebalance (seconds), not an outage.

From Empty Cluster to Production-Ready: The Complete Setup Sequence

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

Kubernetes Production Readiness Checklist: Everything to Verify Before Going Live

Kubernetes Production Readiness Checklist#

This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.


Cluster Health#

These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.

Minikube to Cloud Migration: 10 Things That Change on EKS, GKE, and AKS

Minikube to Cloud Migration Guide#

Minikube is excellent for learning and local development. But almost everything that “just works” on minikube requires explicit configuration on a cloud cluster. Here are the 10 things that change.

1. Ingress Controller Becomes a Cloud Load Balancer#

On minikube: You enable the NGINX ingress addon with minikube addons enable ingress. Traffic reaches your services through minikube tunnel or minikube service.

On cloud: The ingress controller must be deployed explicitly, and it provisions a real cloud load balancer. On AWS, the AWS Load Balancer Controller creates ALBs or NLBs from Ingress resources. On GKE, the built-in GCE ingress controller creates Google Cloud Load Balancers. You pay per load balancer.

Sandbox to Production: The Complete Workflow for Verified Infrastructure Deliverables

Sandbox to Production#

An agent that produces infrastructure deliverables works in a sandbox. It does not touch production. It does not reach into someone else’s cluster, database, or cloud account. It works in an isolated environment, tests its work, captures the results, and hands the human a verified deliverable they can execute on their own infrastructure.

This is not a limitation – it is a design choice. The output is always a deliverable, never a direct action on someone else’s systems. This boundary is what makes the approach safe enough for production infrastructure work and trustworthy enough for enterprise change management.