CockroachDB Day-2 Operations

Adding and Removing Nodes#

Adding a node: start a new cockroach process with --join pointing to existing nodes. CockroachDB automatically rebalances ranges to the new node.

cockroach start --insecure --store=node4-data \
  --advertise-addr=node4:26257 \
  --join=node1:26257,node2:26257,node3:26257

Watch rebalancing in the DB Console under Metrics > Replication, or query directly:

SELECT node_id, range_count, lease_count FROM crdb_internal.kv_store_status;

Decommissioning a node moves all range replicas off before shutdown, preventing under-replication:

cockroach node decommission 4 --insecure --host=node1:26257

# Monitor progress
cockroach node status --insecure --host=node1:26257 --decommission

Do not simply kill a node. Without decommissioning, CockroachDB treats it as a failure and waits 5 minutes before re-replicating. On Kubernetes with the operator, scale by changing spec.nodes in the CrdbCluster resource.

From Empty Cluster to Production-Ready: The Complete Setup Sequence

From Empty Cluster to Production-Ready#

This is the definitive operational plan for taking a fresh Kubernetes cluster and making it production-ready. Each phase builds on the previous one, with verification steps between phases and rollback notes where applicable. An agent should be able to follow this sequence end-to-end.

Estimated timeline: 5 days for a single operator. Phases 1-2 are blocking prerequisites. Phases 3-6 can partially overlap.


Phase 1 – Foundation (Day 1)#

Everything else depends on a healthy cluster with proper namespacing and storage. Do not proceed until every verification step passes.

Kubernetes Production Readiness Checklist: Everything to Verify Before Going Live

Kubernetes Production Readiness Checklist#

This checklist is designed for agents to audit a Kubernetes cluster before production workloads run on it. Every item includes the verification command and what a passing result looks like. Work through each category sequentially. A failing item in Cluster Health should be fixed before checking Workload Configuration.


Cluster Health#

These are non-negotiable. If any of these fail, stop and fix them before evaluating anything else.

MongoDB Operational Patterns

MongoDB Operational Patterns#

MongoDB operations center on three areas: keeping the cluster healthy (replica sets and sharding), protecting data (backups), and keeping queries fast (indexes and explain plans). This reference covers the practical commands and patterns for each.

Replica Set Setup#

A replica set is the minimum production deployment – three data-bearing members that elect a primary and maintain identical copies of the data.

Launching Members#

Each member runs mongod with the same --replSet name:

Structuring Effective On-Call Runbooks: Format, Escalation, and Diagnostic Decision Trees

Why Runbooks Exist#

An on-call engineer paged at 3 AM has limited cognitive capacity. They may not be familiar with the specific service that is failing. They may have joined the team two weeks ago. A runbook bridges the gap between the alert firing and the correct human response. Without runbooks, incident response depends on tribal knowledge – the engineer who built the service and knows its failure modes. That engineer is on vacation when the incident hits.