---
title: "Database Performance Investigation Runbook"
description: "Step-by-step operational sequence for investigating database performance issues — identifying slow queries, analyzing execution plans, checking lock contention, reviewing connection pools, analyzing I/O, and examining cache hit ratios for PostgreSQL and MySQL."
url: https://agent-zone.ai/knowledge/databases/database-performance-runbook/
section: knowledge
date: 2026-02-22
categories: ["databases"]
tags: ["performance","postgresql","mysql","slow-queries","execution-plans","locking","connection-pool","io","buffer-cache","runbook"]
skills: ["performance-diagnosis","query-optimization","database-troubleshooting","capacity-analysis"]
tools: ["psql","mysql","pg_stat_statements","performance_schema","pt-query-digest","iostat","vmstat"]
levels: ["intermediate","advanced"]
word_count: 1329
formats:
  json: https://agent-zone.ai/knowledge/databases/database-performance-runbook/index.json
  html: https://agent-zone.ai/knowledge/databases/database-performance-runbook/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Database+Performance+Investigation+Runbook
---


# Database Performance Investigation Runbook

When a database is slow, resist the urge to immediately tune configuration parameters. Follow this sequence: identify what is slow, understand why, then fix the specific bottleneck. Most performance problems are caused by missing indexes or a single bad query, not global configuration issues.

## Phase 1 -- Identify Slow Queries

The first step is always finding which queries are consuming the most time.

### PostgreSQL: pg_stat_statements

Enable the extension if not already loaded:

```sql
-- Check if loaded
SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements';

-- Enable (requires adding to shared_preload_libraries and restart)
ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';
-- Restart PostgreSQL, then:
CREATE EXTENSION pg_stat_statements;
```

Find the top queries by total time:

```sql
SELECT
  substring(query, 1, 80) AS query_preview,
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round(mean_exec_time::numeric, 2) AS avg_ms,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
```

This gives you the top 10 queries ranked by cumulative execution time. Focus on these -- they represent the biggest opportunities.

To reset stats after making changes (so you can measure the impact): `SELECT pg_stat_statements_reset();`

### MySQL: Slow Query Log and Performance Schema

Enable the slow query log dynamically:

```sql
SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 0.5;
SET GLOBAL log_queries_not_using_indexes = 1;
```

Analyze accumulated slow queries with `pt-query-digest`:

```bash
pt-query-digest /var/log/mysql/slow.log --limit 10
```

Or query Performance Schema directly for the top queries by total execution time:

```sql
SELECT
  LEFT(DIGEST_TEXT, 80) AS query_preview,
  COUNT_STAR AS calls,
  ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_sec,
  ROUND(AVG_TIMER_WAIT / 1e12, 4) AS avg_sec,
  SUM_ROWS_EXAMINED AS rows_examined,
  SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
```

A large gap between `rows_examined` and `rows_sent` signals a missing or ineffective index.

## Phase 2 -- Analyze Execution Plans

Once you have the slow queries, examine their execution plans.

### PostgreSQL: EXPLAIN (ANALYZE, BUFFERS)

```sql
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > now() - interval '7 days';
```

What to look for:

- **Seq Scan on a large table**: Missing index. Check if a WHERE clause column or JOIN column is unindexed.
- **Nested Loop with high row count in outer loop**: Consider if a Hash Join or Merge Join would be better. The planner may be choosing poorly due to stale statistics -- run `ANALYZE tablename;`.
- **Sort with external merge**: `work_mem` too low for this query, or add an index that matches the sort order.
- **Buffers: shared read** much larger than **shared hit**: Data is not in the buffer cache. Either the table is too large for `shared_buffers`, or this is a cold query path.

```sql
-- Check if table statistics are current
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC;

-- Force statistics refresh
ANALYZE orders;
```

### MySQL: EXPLAIN ANALYZE

```sql
EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > DATE_SUB(NOW(), INTERVAL 7 DAY);
```

What to look for:

- **type: ALL**: Full table scan. Needs an index on the filter columns.
- **type: index**: Full index scan (reading every row of the index). Better than ALL but still expensive.
- **Extra: Using temporary; Using filesort**: Query requires a temporary table and/or sort. Consider adding an index that covers the ORDER BY.
- **rows**: Estimated rows to examine. Compare against actual rows returned. Large disparity means statistics are stale: `ANALYZE TABLE orders;`.

## Phase 3 -- Check Lock Contention

Slow queries are not always caused by missing indexes. Lock contention can make fast queries wait.

### PostgreSQL: Lock Investigation

```sql
-- Find waiting queries and what they are waiting on
SELECT pid, wait_event_type, wait_event, state, query,
  age(now(), query_start) AS duration
FROM pg_stat_activity
WHERE wait_event IS NOT NULL AND state != 'idle'
ORDER BY query_start;

-- Find blocked/blocking pairs
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
  blocking.pid AS blocking_pid, blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype
  AND kl.relation IS NOT DISTINCT FROM bl.relation
  AND kl.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;
```

Common causes: long-running transactions holding `RowExclusiveLock`, `ALTER TABLE` holding `AccessExclusiveLock`, autovacuum blocking DDL.

### MySQL: Lock Investigation

```sql
-- Current lock waits
SELECT * FROM sys.innodb_lock_waits\G

-- If sys schema is unavailable
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_pid,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_pid,
  b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- Metadata locks (DDL vs DML contention)
SELECT * FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING';
```

Common causes: long-running `SELECT` preventing `ALTER TABLE`, uncommitted transactions, large `UPDATE` or `DELETE` holding row locks.

## Phase 4 -- Review Connection Pool Saturation

If the database itself is not slow but applications experience timeouts, the connection pool may be exhausted.

### PostgreSQL

```sql
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
SHOW max_connections;

-- Idle in transaction is the worst offender -- holds locks while doing nothing
SELECT pid, usename, query, age(now(), xact_start) AS txn_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;
```

If `idle in transaction` persists, set `idle_in_transaction_session_timeout = '30s'`.

### MySQL

```sql
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
```

If `Max_used_connections` approaches `max_connections`, the fix is usually a connection pooler (PgBouncer, ProxySQL) rather than raising the limit.

## Phase 5 -- Analyze I/O Patterns

If queries are well-indexed and there is no lock contention, the bottleneck may be disk I/O.

### Operating System Level

```bash
# I/O utilization per device
iostat -xz 2

# Key columns:
# %util: percentage of time device is busy (> 80% means saturation)
# await: average I/O wait time in ms (> 10ms for SSD = problem)
# r/s, w/s: reads and writes per second
```

### PostgreSQL: I/O Statistics

```sql
-- Table I/O: sequential vs index scans
SELECT relname,
  seq_scan, seq_tup_read,
  idx_scan, idx_tup_fetch,
  n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC
LIMIT 10;
```

Tables with high `seq_scan` and high `seq_tup_read` are candidates for index creation. Tables with high `n_tup_upd` and `n_tup_del` generate dead tuples and need regular vacuuming.

```sql
-- Check for tables needing vacuum
SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
```

### MySQL: I/O Statistics

```sql
-- Table I/O waits
SELECT object_schema, object_name,
  count_read, count_write,
  sum_timer_read / 1e12 AS read_sec,
  sum_timer_write / 1e12 AS write_sec
FROM performance_schema.table_io_waits_summary_by_table
WHERE object_schema NOT IN ('mysql', 'performance_schema', 'sys')
ORDER BY sum_timer_wait DESC
LIMIT 10;
```

## Phase 6 -- Check Buffer and Cache Hit Ratios

Low cache hit ratios mean the database is reading from disk when it should be reading from memory.

### PostgreSQL: Buffer Cache Hit Ratio

```sql
-- Overall hit ratio (should be > 99% for OLTP)
SELECT
  round(sum(blks_hit)::numeric / (sum(blks_hit) + sum(blks_read)) * 100, 2) AS hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
```

If below 99%, `shared_buffers` is likely too small. Increase it (standard guidance: 25% of total RAM). If a specific table has a low hit ratio, consider whether the working set fits in memory -- options are more RAM, partitioning, or archiving old data.

### MySQL: InnoDB Buffer Pool Hit Ratio

```sql
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';  -- logical reads (from cache)
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';           -- physical disk reads
-- hit_ratio = 1 - (reads / read_requests). Should be > 99%.
```

If below 99%, increase `innodb_buffer_pool_size`. On MySQL 8.x, this can be done dynamically:

```sql
SET GLOBAL innodb_buffer_pool_size = 8 * 1024 * 1024 * 1024;  -- 8 GB
```

## Summary: Investigation Order

1. **Slow queries** (pg_stat_statements / Performance Schema) -- find the top offenders.
2. **Execution plans** (EXPLAIN ANALYZE) -- understand why they are slow.
3. **Lock contention** (pg_locks / innodb_lock_waits) -- rule out waiting as the cause.
4. **Connection pool** (pg_stat_activity / processlist) -- ensure connections are not exhausted.
5. **I/O patterns** (iostat + database I/O stats) -- check disk saturation.
6. **Cache hit ratios** (buffer cache stats) -- verify the working set fits in memory.

Most investigations end at step 1 or 2. A missing index or a bad query plan is the cause in the majority of cases. Steps 3 through 6 matter when the queries themselves are well-optimized but the system is still slow.