---
title: "CockroachDB Debugging and Troubleshooting"
description: "Diagnosing and resolving common CockroachDB problems: node liveness, under-replicated ranges, slow queries, transaction retries, hot ranges, clock skew, and storage issues."
url: https://agent-zone.ai/knowledge/databases/cockroachdb-debugging/
section: knowledge
date: 2026-02-22
categories: ["databases"]
tags: ["cockroachdb","debugging","performance","transactions","troubleshooting"]
skills: ["cockroachdb-debugging","distributed-database-troubleshooting","query-optimization"]
tools: ["cockroach","db-console","psql"]
levels: ["intermediate"]
word_count: 843
formats:
  json: https://agent-zone.ai/knowledge/databases/cockroachdb-debugging/index.json
  html: https://agent-zone.ai/knowledge/databases/cockroachdb-debugging/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=CockroachDB+Debugging+and+Troubleshooting
---


## Node Liveness Issues

Every node must renew its liveness record every 4.5 seconds. Failure to renew marks the node suspect, then dead, triggering re-replication of its ranges.

```bash
cockroach node status --insecure --host=localhost:26257
```

Look at `is_live`. If a node shows `false`, check in order:

**Process crashed.** Check `cockroach-data/logs/` for `fatal` or `panic` entries. OOM kills are the most common cause -- check `dmesg | grep -i oom` on the host.

**Network partition.** The node runs but cannot reach peers. If `cockroach node status` succeeds locally but fails from other nodes, the problem is network-level (firewalls, security groups, DNS).

**Disk stall.** CockroachDB logs `disk stall detected` when storage I/O blocks liveness renewal:

```sql
SELECT node_id, store_id, available, used, capacity
FROM crdb_internal.kv_store_status;
```

## Under-Replicated Ranges

Ranges with fewer replicas than the replication factor. Normal briefly during failures, but persistent under-replication needs attention.

```sql
SELECT range_id, start_key, end_key, replicas, lease_holder
FROM crdb_internal.ranges_no_leases
WHERE array_length(replicas, 1) < 3;
```

Common causes: too few nodes for the replication factor, an incompletely drained decommission, or zone constraints that cannot be satisfied. If a node was lost permanently, re-replication starts after `server.time_until_store_dead` (default 5 minutes). If under-replication persists, check for stale locality constraints:

```sql
SHOW ZONE CONFIGURATIONS;
```

## Slow Queries: EXPLAIN ANALYZE

CockroachDB distributes query execution across nodes. `EXPLAIN ANALYZE` shows the distributed execution plan with per-node timing.

```sql
EXPLAIN ANALYZE SELECT o.id, o.total, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01' AND o.total > 100
ORDER BY o.total DESC LIMIT 50;
```

Key things to look for:

- **Full table scans.** `spans: FULL SCAN` means no index is being used. Add an appropriate index.
- **High intermediate row counts.** A join reading millions of rows to return 50 means filters are applied too late.
- **Network overhead.** Disproportionate `KV rows read` on one node suggests data locality issues.
- **Contention time.** Non-zero values mean the query waited on locks from other transactions.

Find the slowest queries cluster-wide:

```sql
SELECT fingerprint_id, metadata ->> 'query' AS query,
       statistics -> 'statistics' -> 'latencyInfo' ->> 'p99' AS p99_latency,
       statistics -> 'statistics' ->> 'cnt' AS exec_count
FROM crdb_internal.statement_statistics
ORDER BY (statistics -> 'statistics' -> 'latencyInfo' ->> 'p99')::FLOAT DESC
LIMIT 20;
```

## Transaction Retry Errors (SQLSTATE 40001)

Serializable isolation means conflicting transactions produce `40001` errors that must be retried. This is expected behavior, not a bug. Your application must implement retry loops:

```python
import psycopg2
import time

def run_transaction(conn, callback, max_retries=5):
    for attempt in range(max_retries):
        try:
            with conn.cursor() as cur:
                callback(cur)
                conn.commit()
                return
        except psycopg2.errors.SerializationFailure:
            conn.rollback()
            sleep_time = (2 ** attempt) * 0.01  # exponential backoff
            time.sleep(sleep_time)
    raise Exception("Transaction failed after max retries")

def transfer_funds(cur):
    cur.execute("UPDATE accounts SET balance = balance - 100 WHERE id = 1")
    cur.execute("UPDATE accounts SET balance = balance + 100 WHERE id = 2")

conn = psycopg2.connect("postgresql://appuser@localhost:26257/myapp?sslmode=disable")
conn.set_session(autocommit=False)
run_transaction(conn, transfer_funds)
```

If you see high retry rates, investigate contention:

```sql
-- Find transactions with the most contention
SELECT fingerprint_id,
       metadata ->> 'query' AS query,
       statistics -> 'statistics' -> 'contentionTime' ->> 'mean' AS avg_contention
FROM crdb_internal.transaction_statistics
WHERE (statistics -> 'statistics' -> 'contentionTime' ->> 'mean')::FLOAT > 0
ORDER BY (statistics -> 'statistics' -> 'contentionTime' ->> 'mean')::FLOAT DESC
LIMIT 10;
```

Reduce contention by keeping transactions small, avoiding long-running transactions, and not funneling concurrent updates through hot rows.

## Hot Ranges

A hot range receives disproportionate traffic, creating a bottleneck on the node holding its leaseholder. The DB Console "Hot Ranges" page shows the top ranges by QPS.

From SQL:

```sql
SELECT range_id, start_key, end_key, queries_per_second, lease_holder
FROM crdb_internal.ranges
ORDER BY queries_per_second DESC
LIMIT 10;
```

Common causes and fixes:

- **Sequential primary keys.** `SERIAL` or auto-incrementing IDs cause all inserts to hit the last range. Fix: use `UUID` primary keys with `gen_random_uuid()`, or use `HASH`-sharded indexes.
- **Single-row hot spot.** A counter or status row updated by every request. Fix: shard the counter across multiple rows and sum on read.
- **Hash-sharded index** for write-heavy tables:

```sql
CREATE TABLE events (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  created_at TIMESTAMPTZ DEFAULT now(),
  data JSONB,
  INDEX idx_events_created (created_at) USING HASH
);
```

## Clock Skew Issues

CockroachDB requires synchronized clocks across nodes. The default maximum clock offset is 500ms. If a node's clock drifts beyond this, it self-terminates to protect consistency.

Symptoms: a node crashes with `clock synchronization error` in logs, or transactions fail with timestamp-related errors.

```bash
# Check clock offset from the DB Console or via SQL
SELECT node_id, clock_offset_ns / 1e6 AS offset_ms
FROM crdb_internal.kv_node_status;
```

Fix: Run NTP or chrony on every node. On Kubernetes, node clocks come from the host -- ensure the host has NTP configured. On cloud VMs, the cloud provider typically handles this, but verify with `timedatectl status`.

## Storage Capacity Problems

CockroachDB uses a Pebble LSM-tree storage engine. When disk usage exceeds 95%, the node enters a ballast-recovery state and may refuse writes.

```sql
-- Check per-node storage
SELECT node_id, store_id,
       (used::FLOAT / capacity::FLOAT * 100)::INT AS pct_used,
       pg_size_pretty(available) AS available
FROM crdb_internal.kv_store_status;
```

CockroachDB pre-allocates a 1 GiB ballast file (`auxiliary/EMERGENCY_BALLAST`). When the disk is full, delete this file to free space for recovery:

```bash
rm cockroach-data/auxiliary/EMERGENCY_BALLAST
```

Then add storage, decommission the node, or drop unused tables/indexes. After cleanup, the node recreates the ballast on restart. Set alerts on `capacity_available` and note that space from deletes is reclaimed by compaction over time, not immediately.

