Builder Pool Naming: The (role, tier, replica) Coordinate Decouples Identity From Model

May 20, 2026

Fleet-Architecture, Identity-Design, Pool-Management

Agent-Fleet, Pool-Naming, Identity, Kubernetes, Mattermost, Gitea, Operations

Builder Pool Naming: The (role, tier, replica) Coordinate#

Naming agent pools after the model they run today (kimi-N, deepseek-N, flash-N, lite-N) felt natural when each pool ran one model. It stopped feeling natural the third time a pool’s model churned — when the lite-tier swapped through qwen → gemma → gemini in six weeks and every rename cascaded through K8s manifests, secret names, MM bot accounts, Gitea identities, and helm values. The fix was to make pool names model-independent: builder-lite-0 runs whatever model the pool config says it runs today.

Claude Code /loop Daemon Hygiene: Daily Clear + Delete-Before-Create Crons

May 20, 2026

Agent-Tooling

Intermediate, Advanced

Daemon-Operations, Context-Window-Management, Agent-Runbook-Design

Claude-Code, Loop, Daemon, Context-Bloat, Cron, Tmux, Operations, Anthropic

Claude-Code, Tmux, Bash

Claude Code /loop Daemon Hygiene#

A claude /loop 5m /role-daemon daemon is the easiest way to run an autonomous agent on a Max subscription: tmux session, one command, comes back every five minutes forever. It works perfectly for the first hour. By hour six it has accumulated 50,000+ tokens of stale “in cycle 47 I posted to MM” history that ships to Anthropic on every prompt. By day two it has three overlapping cron entries firing the same daemon every two minutes instead of every five. By day three it has auto-compact-exited and the tmux session is bare.

Stateful vs Stateless Agent Daemons: A-Mode /loop vs C-Mode cron

May 20, 2026

Agent-Tooling

Intermediate, Advanced

Daemon-Design, Agent-Architecture, Cost-Management

Claude-Code, Daemon, Loop, Tmux, State-Management, Fleet-Design, Operations

Claude-Code, Tmux, Bash

Stateful vs Stateless Agent Daemons#

Long-running agents on the Max subscription split cleanly into two operating modes. A-mode keeps a single /loop session alive across cycles, accumulating in-session context that gets cleared once a day. C-mode wraps claude -p in a bash sleep loop; every cycle is a fresh process with zero carryover. Both run forever in tmux. Both cost $0 of Anthropic API spend (the subscription pays). They behave very differently per cycle.

Running 7 Helm-Managed Services on One Kubernetes Cluster: A Cross-Cutting Survey

May 7, 2026

Platform-Engineering

Intermediate

Helm-Multi-Service-Operation, Single-Node-Capacity-Planning, Helm-Values-Customization, K8s-Debugging

Helm, Kubernetes, Single-Node, Minikube, Arm64, Operations, Homelab

Helm, Kubectl, Minikube, Docker

A single-node Kubernetes cluster running seven Helm-managed services concurrently — Gitea, Mattermost, PostgreSQL, kube-prometheus-stack, Jenkins, Temporal, and NATS — looks tractable on paper. The charts are all upstream-maintained. The hardware is modest but adequate. The operational reality is that zero of the seven ran cleanly on out-of-the-box values. Every chart needed at least one customization to coexist with the others, and several needed substantial rewrites of the helm-values surface. This survey catalogs what those customizations are, why each was necessary, and what the common failure modes look like across the fleet.

CockroachDB Day-2 Operations

February 22, 2026

Databases

Intermediate

Cockroachdb-Administration, Database-Operations, Disaster-Recovery

Cockroachdb, Operations, Backup, Monitoring, Cdc, Multi-Region

Cockroach, Kubectl, Db-Console

Adding and Removing Nodes#

Adding a node: start a new cockroach process with --join pointing to existing nodes. CockroachDB automatically rebalances ranges to the new node.

cockroach start --insecure --store=node4-data \
  --advertise-addr=node4:26257 \
  --join=node1:26257,node2:26257,node3:26257

Watch rebalancing in the DB Console under Metrics > Replication, or query directly:

SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status;

Decommissioning a node moves all range replicas off before shutdown, preventing under-replication:

cockroach node decommission 4 --insecure --host=node1:26257

# Monitor progress
cockroach node status --insecure --host=node1:26257 --decommission

Do not simply kill a node. Without decommissioning, CockroachDB treats it as a failure and waits 5 minutes before re-replicating. On Kubernetes with the operator, scale by changing spec.nodes in the CrdbCluster resource.

From Empty Cluster to Production-Ready: The Complete Setup Sequence

February 22, 2026

Kubernetes

Intermediate

Cluster-Bootstrapping, Production-Hardening, Infrastructure-Automation

Cluster-Setup, Production, Operations, Rbac, Ingress, Cert-Manager, Observability, Security, Gitops, Disaster-Recovery

Kubectl, Helm, Argocd, Cert-Manager, Prometheus, Velero

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.

Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

Kubernetes Production Readiness Checklist: Everything to Verify Before Going Live

February 22, 2026

Kubernetes

Intermediate

Cluster-Auditing, Production-Readiness-Assessment, Pre-Launch-Verification

Production, Checklist, Audit, Security, Reliability, Observability, Operations

Kubectl, Helm, Trivy, Kube-Bench

Kubernetes Production Readiness Checklist#

This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.

Cluster Health#

These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.

MongoDB Operational Patterns

February 22, 2026

Databases

Intermediate

Mongodb-Administration, Replica-Set-Management, Sharding-Operations, Mongodb-Backup, Query-Optimization

Mongodb, Replica-Set, Sharding, Mongodump, Indexing, Explain, Mongostat, Mongotop, Operations

Mongosh, Mongod, Mongos, Mongodump, Mongorestore, Mongostat, Mongotop

MongoDB Operational Patterns#

MongoDB operations center on three areas: keeping the cluster healthy (replica sets and sharding), protecting data (backups), and keeping queries fast (indexes and explain plans). This reference covers the practical commands and patterns for each.

Replica Set Setup#

A replica set is the minimum production deployment – three data-bearing members that elect a primary and maintain identical copies of the data.

Launching Members#

Each member runs mongod with the same --replSet name:

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

February 22, 2026

Observability

Intermediate

Runbook-Authoring, Escalation-Design, Incident-Triage, Diagnostic-Decision-Trees

Runbooks, On-Call, Incident-Response, Escalation, Alerting, Operations, Sre, Pagerduty, Opsgenie

Alertmanager, Pagerduty, Opsgenie, Grafana, Prometheus, Kubectl

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.