{"page":{"agent_metadata":{"content_type":"operational_sequence","outputs":["restore-verification-pipeline","backup-monitoring-alerts","restore-time-baseline","backup-integrity-report"],"prerequisites":["backup-fundamentals","database-administration-basics","linux-scripting"]},"categories":["infrastructure"],"content_plain":"Backup Verification and Restore Testing# An untested backup is not a backup. It is a file that might contain your data and might be restorable. Teams discover the difference during an actual incident, when the database backup turns out to be corrupted, the restore takes 6 hours instead of the expected 30 minutes, or the backup process silently stopped running three weeks ago.\nBackup verification is the practice of regularly proving that your backups contain valid data and can be restored within your required RTO.\nThe Restore Verification Pipeline# A proper backup verification pipeline runs automatically, on a schedule, and alerts when anything fails. The core loop is: take the most recent backup, restore it to a throwaway environment, validate the data, measure the restore time, tear down the environment.\nPipeline Architecture# Backup Storage (S3/GCS) | [Scheduled Trigger - cron/CloudWatch/CronJob] | Pull latest backup | Provision test instance (RDS snapshot restore / Docker container / temp VM) | Restore backup into test instance | Run validation queries (row counts, recent timestamps, checksums) | Record restore time + results to monitoring | Tear down test instance | Alert on failureAutomated PostgreSQL Restore Verification# This script pulls the latest PostgreSQL backup, restores it, runs validation queries, and reports results. Run it nightly via cron.\n#!/bin/bash # restore-verify-pg.sh - Automated PostgreSQL backup restore verification set -euo pipefail BACKUP_BUCKET=\u0026#34;s3://myapp-db-backups/postgresql\u0026#34; RESTORE_HOST=\u0026#34;localhost\u0026#34; RESTORE_PORT=\u0026#34;5433\u0026#34; RESTORE_DB=\u0026#34;restore_test\u0026#34; METRICS_FILE=\u0026#34;/var/log/backup-verify/pg-restore-$(date +%Y%m%d).json\u0026#34; ALERT_WEBHOOK=\u0026#34;https://hooks.slack.com/services/XXX/YYY/ZZZ\u0026#34; mkdir -p /var/log/backup-verify # Find the latest backup LATEST_BACKUP=$(aws s3 ls \u0026#34;${BACKUP_BUCKET}/\u0026#34; --recursive | sort | tail -1 | awk \u0026#39;{print $4}\u0026#39;) if [ -z \u0026#34;$LATEST_BACKUP\u0026#34; ]; then curl -s -X POST \u0026#34;$ALERT_WEBHOOK\u0026#34; -d \u0026#39;{\u0026#34;text\u0026#34;:\u0026#34;CRITICAL: No PostgreSQL backup found in \u0026#39;\u0026#34;$BACKUP_BUCKET\u0026#34;\u0026#39;\u0026#34;}\u0026#39; exit 1 fi # Check backup age - alert if older than 26 hours (allows for schedule drift) BACKUP_DATE=$(aws s3 ls \u0026#34;${BACKUP_BUCKET}/${LATEST_BACKUP}\u0026#34; | awk \u0026#39;{print $1\u0026#34; \u0026#34;$2}\u0026#39;) BACKUP_EPOCH=$(date -d \u0026#34;$BACKUP_DATE\u0026#34; +%s 2\u0026gt;/dev/null || date -j -f \u0026#34;%Y-%m-%d %H:%M:%S\u0026#34; \u0026#34;$BACKUP_DATE\u0026#34; +%s) NOW_EPOCH=$(date +%s) AGE_HOURS=$(( (NOW_EPOCH - BACKUP_EPOCH) / 3600 )) if [ \u0026#34;$AGE_HOURS\u0026#34; -gt 26 ]; then curl -s -X POST \u0026#34;$ALERT_WEBHOOK\u0026#34; \\ -d \u0026#39;{\u0026#34;text\u0026#34;:\u0026#34;WARNING: Latest PostgreSQL backup is \u0026#39;\u0026#34;$AGE_HOURS\u0026#34;\u0026#39; hours old (expected \u0026lt; 26h)\u0026#34;}\u0026#39; fi # Download and restore RESTORE_START=$(date +%s) aws s3 cp \u0026#34;${BACKUP_BUCKET}/${LATEST_BACKUP}\u0026#34; /tmp/pg-restore-test.dump dropdb --if-exists -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; \u0026#34;$RESTORE_DB\u0026#34; 2\u0026gt;/dev/null || true createdb -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; \u0026#34;$RESTORE_DB\u0026#34; pg_restore -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; -d \u0026#34;$RESTORE_DB\u0026#34; \\ --no-owner --no-privileges --jobs=4 /tmp/pg-restore-test.dump RESTORE_END=$(date +%s) RESTORE_SECONDS=$(( RESTORE_END - RESTORE_START )) # Validate: check row counts and most recent timestamp USERS_COUNT=$(psql -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; -d \u0026#34;$RESTORE_DB\u0026#34; -t \\ -c \u0026#34;SELECT count(*) FROM users;\u0026#34;) ORDERS_COUNT=$(psql -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; -d \u0026#34;$RESTORE_DB\u0026#34; -t \\ -c \u0026#34;SELECT count(*) FROM orders;\u0026#34;) LATEST_ORDER=$(psql -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; -d \u0026#34;$RESTORE_DB\u0026#34; -t \\ -c \u0026#34;SELECT max(created_at) FROM orders;\u0026#34;) # Write metrics cat \u0026gt; \u0026#34;$METRICS_FILE\u0026#34; \u0026lt;\u0026lt;EOF { \u0026#34;timestamp\u0026#34;: \u0026#34;$(date -u +%Y-%m-%dT%H:%M:%SZ)\u0026#34;, \u0026#34;backup_file\u0026#34;: \u0026#34;$LATEST_BACKUP\u0026#34;, \u0026#34;backup_age_hours\u0026#34;: $AGE_HOURS, \u0026#34;restore_time_seconds\u0026#34;: $RESTORE_SECONDS, \u0026#34;validation\u0026#34;: { \u0026#34;users_count\u0026#34;: $USERS_COUNT, \u0026#34;orders_count\u0026#34;: $ORDERS_COUNT, \u0026#34;latest_order_timestamp\u0026#34;: \u0026#34;$LATEST_ORDER\u0026#34; }, \u0026#34;status\u0026#34;: \u0026#34;success\u0026#34; } EOF # Clean up dropdb -h \u0026#34;$RESTORE_HOST\u0026#34; -p \u0026#34;$RESTORE_PORT\u0026#34; \u0026#34;$RESTORE_DB\u0026#34; rm /tmp/pg-restore-test.dump echo \u0026#34;Restore verified: ${RESTORE_SECONDS}s, ${USERS_COUNT} users, ${ORDERS_COUNT} orders\u0026#34;Cron entry:\n# Run restore verification every night at 4 AM 0 4 * * * /opt/scripts/restore-verify-pg.sh \u0026gt;\u0026gt; /var/log/backup-verify/cron.log 2\u0026gt;\u0026amp;1MySQL Point-in-Time Recovery Verification# MySQL point-in-time recovery (PITR) depends on binary logs being intact and continuous from the last full backup. The verification must test both the full restore and the binlog replay.\n#!/bin/bash # verify-mysql-pitr.sh - Verify MySQL full backup + binlog replay set -euo pipefail BACKUP_DIR=\u0026#34;/backup/mysql\u0026#34; RESTORE_DIR=\u0026#34;/tmp/mysql-restore-test\u0026#34; MYSQL_PORT=3307 LATEST_FULL=$(ls -t ${BACKUP_DIR}/full-*.xbstream 2\u0026gt;/dev/null | head -1) if [ -z \u0026#34;$LATEST_FULL\u0026#34; ]; then echo \u0026#34;CRITICAL: No full backup found\u0026#34; \u0026gt;\u0026amp;2 exit 1 fi RESTORE_START=$(date +%s) # Decompress and prepare mkdir -p \u0026#34;$RESTORE_DIR\u0026#34; xbstream -x -C \u0026#34;$RESTORE_DIR\u0026#34; \u0026lt; \u0026#34;$LATEST_FULL\u0026#34; xtrabackup --prepare --target-dir=\u0026#34;$RESTORE_DIR\u0026#34; # Start a temporary MySQL instance with the restored data mysqld_safe --datadir=\u0026#34;$RESTORE_DIR\u0026#34; --port=\u0026#34;$MYSQL_PORT\u0026#34; --socket=/tmp/mysql-restore.sock \u0026amp; MYSQL_PID=$! sleep 10 # Apply binary logs up to 5 minutes ago (test PITR capability) TARGET_TIME=$(date -d \u0026#39;5 minutes ago\u0026#39; \u0026#39;+%Y-%m-%d %H:%M:%S\u0026#39; 2\u0026gt;/dev/null || \\ date -v-5M \u0026#39;+%Y-%m-%d %H:%M:%S\u0026#39;) mysqlbinlog --stop-datetime=\u0026#34;$TARGET_TIME\u0026#34; ${BACKUP_DIR}/binlog.* | \\ mysql --socket=/tmp/mysql-restore.sock RESTORE_END=$(date +%s) echo \u0026#34;PITR restore completed in $(( RESTORE_END - RESTORE_START )) seconds\u0026#34; # Validate mysql --socket=/tmp/mysql-restore.sock -e \u0026#34;SELECT COUNT(*) FROM myapp.orders;\u0026#34; 2\u0026gt;/dev/null # Teardown kill \u0026#34;$MYSQL_PID\u0026#34; 2\u0026gt;/dev/null rm -rf \u0026#34;$RESTORE_DIR\u0026#34;etcd Snapshot Restore Verification# #!/bin/bash # verify-etcd-restore.sh - Verify etcd snapshot is restorable set -euo pipefail SNAPSHOT=\u0026#34;/backup/etcd/snapshot-latest.db\u0026#34; # Verify snapshot integrity ETCDCTL_API=3 etcdctl snapshot status \u0026#34;$SNAPSHOT\u0026#34; --write-out=table if [ $? -ne 0 ]; then echo \u0026#34;CRITICAL: etcd snapshot integrity check failed\u0026#34; \u0026gt;\u0026amp;2 exit 1 fi # Test restore to temporary directory RESTORE_DIR=$(mktemp -d) ETCDCTL_API=3 etcdctl snapshot restore \u0026#34;$SNAPSHOT\u0026#34; \\ --data-dir=\u0026#34;$RESTORE_DIR/etcd-data\u0026#34; \\ --name=restore-test \\ --initial-cluster=restore-test=http://localhost:2390 \\ --initial-advertise-peer-urls=http://localhost:2390 # Count keys to verify data etcd --data-dir=\u0026#34;$RESTORE_DIR/etcd-data\u0026#34; \\ --listen-client-urls=http://localhost:2389 \\ --advertise-client-urls=http://localhost:2389 \u0026amp; ETCD_PID=$! sleep 3 KEY_COUNT=$(ETCDCTL_API=3 etcdctl --endpoints=http://localhost:2389 \\ get \u0026#34;\u0026#34; --prefix --keys-only | wc -l) echo \u0026#34;Restored ${KEY_COUNT} keys from etcd snapshot\u0026#34; kill \u0026#34;$ETCD_PID\u0026#34; rm -rf \u0026#34;$RESTORE_DIR\u0026#34;Backup Monitoring# Automated restore testing catches corruption. Monitoring catches operational failures: backups that never ran, backups that are suspiciously small, and retention policies that are not being enforced.\nKey Metrics to Monitor# Backup freshness. Alert when the most recent backup is older than expected. If your RPO is 1 hour, alert at 90 minutes.\nBackup size. Track backup size over time. A sudden 50% drop in size probably means a table was dropped or the backup is incomplete. A sudden 200% increase might mean a data explosion or a backup scope change.\n# Prometheus alerting rules for backup monitoring groups: - name: backup_alerts rules: - alert: BackupTooOld expr: (time() - backup_last_success_timestamp_seconds) \u0026gt; 93600 # 26 hours for: 5m labels: severity: critical annotations: summary: \u0026#34;Backup older than 26 hours for {{ $labels.database }}\u0026#34; - alert: BackupSizeAnomaly expr: | abs(backup_size_bytes - backup_size_bytes offset 1d) / backup_size_bytes offset 1d \u0026gt; 0.5 for: 5m labels: severity: warning annotations: summary: \u0026#34;Backup size changed \u0026gt;50% for {{ $labels.database }}\u0026#34; - alert: RestoreTimeDegraded expr: backup_restore_duration_seconds \u0026gt; 1800 # 30 minutes for: 5m labels: severity: warning annotations: summary: \u0026#34;Restore time exceeds 30 minutes for {{ $labels.database }}\u0026#34;Restore time trending. Track how long restores take over time. If your database grows 20% per quarter, your restore time grows too. If your RTO is 30 minutes and your current restore time is 25 minutes, you have a few months before you violate your RTO. This is a capacity planning problem.\nRetention compliance. Verify that backups exist for the required retention period. If policy requires 90 days of daily backups, count the backups and alert when any day is missing.\nMeasuring Restore Time Accurately# Restore time is not just \u0026ldquo;how long pg_restore takes.\u0026rdquo; The real RTO includes:\nDetection time: How long until you know you need to restore (minutes to hours) Decision time: How long until someone authorizes the restore (minutes) Infrastructure provisioning: Spinning up a new database instance (5-45 minutes for cloud-managed databases) Data transfer: Downloading the backup from storage (depends on size and network) Restore execution: The actual pg_restore/mysql import (depends on data size and instance type) Validation: Confirming the restore is correct (minutes) Traffic cutover: Pointing the application at the restored database (minutes) Measure each component separately. The automated restore test gives you components 4-5. The tabletop exercise reveals the real numbers for 1-3. Most teams underestimate their actual RTO by 2-4x because they only measure the restore execution time.\nChecksum Validation# For file-level backups, verify integrity with checksums:\n# Generate checksum during backup sha256sum /backup/db-dump-20260222.sql.gz \u0026gt; /backup/db-dump-20260222.sql.gz.sha256 # Verify before restore sha256sum -c /backup/db-dump-20260222.sql.gz.sha256For S3 backups, enable Content-MD5 verification on upload and verify the ETag on download. For critical backups, use S3 Object Lock to prevent accidental or malicious deletion.\n","date":"2026-02-22","description":"Automated restore verification pipelines, backup integrity validation, restore time measurement, backup monitoring for missed windows and size anomalies, and database-specific restore testing for PostgreSQL, MySQL, and etcd. Concrete scripts and cron jobs.","lastmod":"2026-02-22","levels":["intermediate","advanced"],"reading_time_minutes":6,"section":"knowledge","skills":["backup-validation","restore-testing","backup-monitoring","database-recovery"],"tags":["backup","restore-testing","backup-verification","postgresql","mysql","etcd","monitoring","data-integrity","automation"],"title":"Backup Verification and Restore Testing: Proving Your Backups Actually Work","tools":["pg_restore","pg_dump","mysql","mysqldump","etcdctl","aws-cli","prometheus","cron","bash"],"url":"https://agent-zone.ai/knowledge/infrastructure/backup-verification-restore-testing/","word_count":1241}}