---
title: "Backup Verification and Restore Testing: Proving Your Backups Actually Work"
description: "Automated restore verification pipelines, backup integrity validation, restore time measurement, backup monitoring for missed windows and size anomalies, and database-specific restore testing for PostgreSQL, MySQL, and etcd. Concrete scripts and cron jobs."
url: https://agent-zone.ai/knowledge/infrastructure/backup-verification-restore-testing/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["backup","restore-testing","backup-verification","postgresql","mysql","etcd","monitoring","data-integrity","automation"]
skills: ["backup-validation","restore-testing","backup-monitoring","database-recovery"]
tools: ["pg_restore","pg_dump","mysql","mysqldump","etcdctl","aws-cli","prometheus","cron","bash"]
levels: ["intermediate","advanced"]
word_count: 1241
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/backup-verification-restore-testing/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/backup-verification-restore-testing/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Backup+Verification+and+Restore+Testing%3A+Proving+Your+Backups+Actually+Work
---


# Backup Verification and Restore Testing

An untested backup is not a backup. It is a file that might contain your data and might be restorable. Teams discover the difference during an actual incident, when the database backup turns out to be corrupted, the restore takes 6 hours instead of the expected 30 minutes, or the backup process silently stopped running three weeks ago.

Backup verification is the practice of regularly proving that your backups contain valid data and can be restored within your required RTO.

## The Restore Verification Pipeline

A proper backup verification pipeline runs automatically, on a schedule, and alerts when anything fails. The core loop is: take the most recent backup, restore it to a throwaway environment, validate the data, measure the restore time, tear down the environment.

### Pipeline Architecture

```
Backup Storage (S3/GCS)
        |
   [Scheduled Trigger - cron/CloudWatch/CronJob]
        |
   Pull latest backup
        |
   Provision test instance (RDS snapshot restore / Docker container / temp VM)
        |
   Restore backup into test instance
        |
   Run validation queries (row counts, recent timestamps, checksums)
        |
   Record restore time + results to monitoring
        |
   Tear down test instance
        |
   Alert on failure
```

### Automated PostgreSQL Restore Verification

This script pulls the latest PostgreSQL backup, restores it, runs validation queries, and reports results. Run it nightly via cron.

```bash
#!/bin/bash
# restore-verify-pg.sh - Automated PostgreSQL backup restore verification
set -euo pipefail

BACKUP_BUCKET="s3://myapp-db-backups/postgresql"
RESTORE_HOST="localhost"
RESTORE_PORT="5433"
RESTORE_DB="restore_test"
METRICS_FILE="/var/log/backup-verify/pg-restore-$(date +%Y%m%d).json"
ALERT_WEBHOOK="https://hooks.slack.com/services/XXX/YYY/ZZZ"

mkdir -p /var/log/backup-verify

# Find the latest backup
LATEST_BACKUP=$(aws s3 ls "${BACKUP_BUCKET}/" --recursive | sort | tail -1 | awk '{print $4}')
if [ -z "$LATEST_BACKUP" ]; then
    curl -s -X POST "$ALERT_WEBHOOK" -d '{"text":"CRITICAL: No PostgreSQL backup found in '"$BACKUP_BUCKET"'"}'
    exit 1
fi

# Check backup age - alert if older than 26 hours (allows for schedule drift)
BACKUP_DATE=$(aws s3 ls "${BACKUP_BUCKET}/${LATEST_BACKUP}" | awk '{print $1" "$2}')
BACKUP_EPOCH=$(date -d "$BACKUP_DATE" +%s 2>/dev/null || date -j -f "%Y-%m-%d %H:%M:%S" "$BACKUP_DATE" +%s)
NOW_EPOCH=$(date +%s)
AGE_HOURS=$(( (NOW_EPOCH - BACKUP_EPOCH) / 3600 ))

if [ "$AGE_HOURS" -gt 26 ]; then
    curl -s -X POST "$ALERT_WEBHOOK" \
        -d '{"text":"WARNING: Latest PostgreSQL backup is '"$AGE_HOURS"' hours old (expected < 26h)"}'
fi

# Download and restore
RESTORE_START=$(date +%s)
aws s3 cp "${BACKUP_BUCKET}/${LATEST_BACKUP}" /tmp/pg-restore-test.dump

dropdb --if-exists -h "$RESTORE_HOST" -p "$RESTORE_PORT" "$RESTORE_DB" 2>/dev/null || true
createdb -h "$RESTORE_HOST" -p "$RESTORE_PORT" "$RESTORE_DB"
pg_restore -h "$RESTORE_HOST" -p "$RESTORE_PORT" -d "$RESTORE_DB" \
    --no-owner --no-privileges --jobs=4 /tmp/pg-restore-test.dump

RESTORE_END=$(date +%s)
RESTORE_SECONDS=$(( RESTORE_END - RESTORE_START ))

# Validate: check row counts and most recent timestamp
USERS_COUNT=$(psql -h "$RESTORE_HOST" -p "$RESTORE_PORT" -d "$RESTORE_DB" -t \
    -c "SELECT count(*) FROM users;")
ORDERS_COUNT=$(psql -h "$RESTORE_HOST" -p "$RESTORE_PORT" -d "$RESTORE_DB" -t \
    -c "SELECT count(*) FROM orders;")
LATEST_ORDER=$(psql -h "$RESTORE_HOST" -p "$RESTORE_PORT" -d "$RESTORE_DB" -t \
    -c "SELECT max(created_at) FROM orders;")

# Write metrics
cat > "$METRICS_FILE" <<EOF
{
  "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "backup_file": "$LATEST_BACKUP",
  "backup_age_hours": $AGE_HOURS,
  "restore_time_seconds": $RESTORE_SECONDS,
  "validation": {
    "users_count": $USERS_COUNT,
    "orders_count": $ORDERS_COUNT,
    "latest_order_timestamp": "$LATEST_ORDER"
  },
  "status": "success"
}
EOF

# Clean up
dropdb -h "$RESTORE_HOST" -p "$RESTORE_PORT" "$RESTORE_DB"
rm /tmp/pg-restore-test.dump

echo "Restore verified: ${RESTORE_SECONDS}s, ${USERS_COUNT} users, ${ORDERS_COUNT} orders"
```

**Cron entry:**
```
# Run restore verification every night at 4 AM
0 4 * * * /opt/scripts/restore-verify-pg.sh >> /var/log/backup-verify/cron.log 2>&1
```

### MySQL Point-in-Time Recovery Verification

MySQL point-in-time recovery (PITR) depends on binary logs being intact and continuous from the last full backup. The verification must test both the full restore and the binlog replay.

```bash
#!/bin/bash
# verify-mysql-pitr.sh - Verify MySQL full backup + binlog replay
set -euo pipefail

BACKUP_DIR="/backup/mysql"
RESTORE_DIR="/tmp/mysql-restore-test"
MYSQL_PORT=3307

LATEST_FULL=$(ls -t ${BACKUP_DIR}/full-*.xbstream 2>/dev/null | head -1)
if [ -z "$LATEST_FULL" ]; then
    echo "CRITICAL: No full backup found" >&2
    exit 1
fi

RESTORE_START=$(date +%s)

# Decompress and prepare
mkdir -p "$RESTORE_DIR"
xbstream -x -C "$RESTORE_DIR" < "$LATEST_FULL"
xtrabackup --prepare --target-dir="$RESTORE_DIR"

# Start a temporary MySQL instance with the restored data
mysqld_safe --datadir="$RESTORE_DIR" --port="$MYSQL_PORT" --socket=/tmp/mysql-restore.sock &
MYSQL_PID=$!
sleep 10

# Apply binary logs up to 5 minutes ago (test PITR capability)
TARGET_TIME=$(date -d '5 minutes ago' '+%Y-%m-%d %H:%M:%S' 2>/dev/null || \
              date -v-5M '+%Y-%m-%d %H:%M:%S')
mysqlbinlog --stop-datetime="$TARGET_TIME" ${BACKUP_DIR}/binlog.* | \
    mysql --socket=/tmp/mysql-restore.sock

RESTORE_END=$(date +%s)
echo "PITR restore completed in $(( RESTORE_END - RESTORE_START )) seconds"

# Validate
mysql --socket=/tmp/mysql-restore.sock -e "SELECT COUNT(*) FROM myapp.orders;" 2>/dev/null

# Teardown
kill "$MYSQL_PID" 2>/dev/null
rm -rf "$RESTORE_DIR"
```

### etcd Snapshot Restore Verification

```bash
#!/bin/bash
# verify-etcd-restore.sh - Verify etcd snapshot is restorable
set -euo pipefail

SNAPSHOT="/backup/etcd/snapshot-latest.db"

# Verify snapshot integrity
ETCDCTL_API=3 etcdctl snapshot status "$SNAPSHOT" --write-out=table
if [ $? -ne 0 ]; then
    echo "CRITICAL: etcd snapshot integrity check failed" >&2
    exit 1
fi

# Test restore to temporary directory
RESTORE_DIR=$(mktemp -d)
ETCDCTL_API=3 etcdctl snapshot restore "$SNAPSHOT" \
    --data-dir="$RESTORE_DIR/etcd-data" \
    --name=restore-test \
    --initial-cluster=restore-test=http://localhost:2390 \
    --initial-advertise-peer-urls=http://localhost:2390

# Count keys to verify data
etcd --data-dir="$RESTORE_DIR/etcd-data" \
    --listen-client-urls=http://localhost:2389 \
    --advertise-client-urls=http://localhost:2389 &
ETCD_PID=$!
sleep 3

KEY_COUNT=$(ETCDCTL_API=3 etcdctl --endpoints=http://localhost:2389 \
    get "" --prefix --keys-only | wc -l)
echo "Restored ${KEY_COUNT} keys from etcd snapshot"

kill "$ETCD_PID"
rm -rf "$RESTORE_DIR"
```

## Backup Monitoring

Automated restore testing catches corruption. Monitoring catches operational failures: backups that never ran, backups that are suspiciously small, and retention policies that are not being enforced.

### Key Metrics to Monitor

**Backup freshness.** Alert when the most recent backup is older than expected. If your RPO is 1 hour, alert at 90 minutes.

**Backup size.** Track backup size over time. A sudden 50% drop in size probably means a table was dropped or the backup is incomplete. A sudden 200% increase might mean a data explosion or a backup scope change.

```yaml
# Prometheus alerting rules for backup monitoring
groups:
  - name: backup_alerts
    rules:
      - alert: BackupTooOld
        expr: (time() - backup_last_success_timestamp_seconds) > 93600  # 26 hours
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Backup older than 26 hours for {{ $labels.database }}"

      - alert: BackupSizeAnomaly
        expr: |
          abs(backup_size_bytes - backup_size_bytes offset 1d)
          / backup_size_bytes offset 1d > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Backup size changed >50% for {{ $labels.database }}"

      - alert: RestoreTimeDegraded
        expr: backup_restore_duration_seconds > 1800  # 30 minutes
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Restore time exceeds 30 minutes for {{ $labels.database }}"
```

**Restore time trending.** Track how long restores take over time. If your database grows 20% per quarter, your restore time grows too. If your RTO is 30 minutes and your current restore time is 25 minutes, you have a few months before you violate your RTO. This is a capacity planning problem.

**Retention compliance.** Verify that backups exist for the required retention period. If policy requires 90 days of daily backups, count the backups and alert when any day is missing.

## Measuring Restore Time Accurately

Restore time is not just "how long pg_restore takes." The real RTO includes:

1. **Detection time:** How long until you know you need to restore (minutes to hours)
2. **Decision time:** How long until someone authorizes the restore (minutes)
3. **Infrastructure provisioning:** Spinning up a new database instance (5-45 minutes for cloud-managed databases)
4. **Data transfer:** Downloading the backup from storage (depends on size and network)
5. **Restore execution:** The actual pg_restore/mysql import (depends on data size and instance type)
6. **Validation:** Confirming the restore is correct (minutes)
7. **Traffic cutover:** Pointing the application at the restored database (minutes)

Measure each component separately. The automated restore test gives you components 4-5. The tabletop exercise reveals the real numbers for 1-3. Most teams underestimate their actual RTO by 2-4x because they only measure the restore execution time.

## Checksum Validation

For file-level backups, verify integrity with checksums:

```bash
# Generate checksum during backup
sha256sum /backup/db-dump-20260222.sql.gz > /backup/db-dump-20260222.sql.gz.sha256

# Verify before restore
sha256sum -c /backup/db-dump-20260222.sql.gz.sha256
```

For S3 backups, enable Content-MD5 verification on upload and verify the ETag on download. For critical backups, use S3 Object Lock to prevent accidental or malicious deletion.

