Operations & Reliability
Disaster Recovery Runbook
PostgreSQL PITR, snapshots, S3 replication, and recovery procedures for AACsearch.
Disaster Recovery Runbook
This runbook covers recovery procedures for the AACsearch platform. Target RTO: 1 hour. Target RPO: 15 minutes.
Backup Architecture
| Component | Method | Schedule | Storage |
|---|---|---|---|
| PostgreSQL | WAL-G continuous archiving | Continuous | S3 (WAL segments) |
| AACSearch | Snapshot API | Every 6h | S3 |
| Application | Git (code) + Coolify | Per deploy | GitHub + Coolify |
| Uploaded files | Direct S3 (replicated) | Real-time | Cross-region S3 |
Prerequisites
Before any recovery operation, ensure:
- WAL-G installed on the database server
- S3 bucket with proper IAM permissions is accessible
- search API key is available
- Coolify admin access is available
Quick Recovery (RTO < 1h)
1. Restore PostgreSQL from WAL-G
# Set environment
export PGDATABASE=aacsearch
export PGHOST=localhost
export PGUSER=postgres
export WALG_S3_PREFIX=s3://aacsearch-backups/postgres/
# List available backups
wal-g backup-list
# Restore latest full backup
wal-g backup-fetch /var/lib/postgresql/data LATEST
# Create recovery signal
touch /var/lib/postgresql/data/recovery.signal
# Configure recovery target (omit for latest)
# echo "recovery_target_time = '2026-05-03 12:00:00 UTC'" >> /var/lib/postgresql/data/postgresql.conf
# Start PostgreSQL
systemctl start postgresql
# Verify recovery
psql -c "SELECT pg_is_in_recovery();"
# Should return 't' during recovery, then 'f' when complete2. Restore AACSearch from Snapshot
# Download latest snapshot from S3
aws s3 cp s3://aacsearch-backups/AACSearch/latest-snapshot.tar.gz /tmp/
# Stop AACSearch
systemctl stop AACSearch-server
# Extract snapshot to data directory
tar -xzf /tmp/latest-snapshot.tar.gz -C /var/lib/AACSearch/data/
# Start AACSearch
systemctl start AACSearch-server
# Verify restoration
curl "http://localhost:8108/health" -H "X-AACSEARCH-API-KEY: $AACSEARCH_ADMIN_KEY"
# Should return {"ok": true}3. Verify Application Health
# Check API health endpoint
curl -s http://localhost:3000/api/health
# Expected: "OK"
# Check Prometheus metrics endpoint
curl -s http://localhost:3000/api/metrics | grep aacsearch
# Verify search works
curl -X POST http://localhost:3000/api/v1/indexes/example/search \
-H "Content-Type: application/json" \
-d '{"q": "*", "perPage": 1}'Scheduled Backups
PostgreSQL (WAL-G)
WAL-G is configured for continuous WAL archiving. Full backups run daily via cron:
# Daily full backup at 02:00 UTC
0 2 * * * wal-g backup-push /var/lib/postgresql/data
# Verify last backup
wal-g backup-listAACSearch Snapshot
Snapshot and upload script (/usr/local/bin/AACSearch-snapshot.sh):
#!/bin/bash
set -euo pipefail
SNAPSHOT_DIR=/tmp/AACSearch-snapshot
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
S3_BUCKET=s3://aacsearch-backups/AACSearch/
# Trigger snapshot via API
curl -X POST "http://localhost:8108/snapshots" \
-H "X-AACSEARCH-API-KEY: $AACSEARCH_ADMIN_KEY" \
-H "Content-Type: application/json" \
-d "{\"snapshot_path\": \"$SNAPSHOT_DIR\"}"
# Compress and upload
tar -czf /tmp/AACSearch-$TIMESTAMP.tar.gz -C $SNAPSHOT_DIR .
aws s3 cp /tmp/AACSearch-$TIMESTAMP.tar.gz $S3_BUCKET
# Cleanup
rm -rf $SNAPSHOT_DIR /tmp/AACSearch-$TIMESTAMP.tar.gz
# Update latest pointer
echo $TIMESTAMP | aws s3 cp - $S3_BUCKET/latest-snapshot.txtInstall via cron (every 6 hours):
0 */6 * * * /usr/local/bin/AACSearch-snapshot.shCoolify Backup Configuration
In Coolify, configure the following backup schedule:
- Navigate to your resource → Backups
- Enable Automated Backups
- Set Schedule:
0 2 * * *(daily at 02:00 UTC) - Set Number of backups to keep: 7
- Set Storage: S3-compatible
- Configure S3 credentials:
- Endpoint:
https://s3.amazonaws.com - Bucket:
aacsearch-backups - Region:
us-east-1 - Access Key:
<from 1Password> - Secret Key:
<from 1Password>
- Endpoint:
Monthly DR Drill Checklist
Run this drill on the first Monday of every month.
-
Spare environment check
- Verify spare environment exists and is accessible
- Check Coolify can deploy from the current git ref
-
Restore PostgreSQL
- Download latest WAL-G backup
- Restore and verify database integrity
- Run
VACUUM ANALYZEand check for corruption
-
Restore AACSearch
- Download latest snapshot
- Restore and verify search results
-
Application test
- Verify
/api/healthreturns OK - Run a test search query
- Verify API key authentication
- Check widget JS is served correctly
- Verify
-
Metrics verification
- Prometheus targets are up
- Grafana dashboards show data
- No critical alerts firing
-
Sign-off
- Log drill results in company wiki
- Note any issues found and file follow-up issues
- Update runbook if gaps were found
Alert Contacts
| Role | Contact |
|---|---|
| Infrastructure | Via PagerDuty (schedule: infra) |
| Database Admin | Via PagerDuty (schedule: dba) |
| Security | security@aacsearch.com |
Related
- Prometheus Alert Rules — Alertmanager configuration
- Grafana Dashboards — Infrastructure dashboards
- WAL-G Documentation — PostgreSQL backup tool
- AACSearch Snapshots — AACSearch backup guide