Eight Steps to Bullet-Proof Database Disaster Recovery.

The presentation will take place in Ballroom G on Friday, March 6, 2026 - 11:15 to 12:15

Disks fail, RAM runs short, software breaks, and human error introduces faults that spread through a PostgreSQL cluster without warning. When these events occur, data integrity depends on a disciplined recovery process rather than ad hoc fixes. This talk provides a structured approach to handling corruption and service failures in production environments. The session begins with early-detection methods based on log analysis, checksum validation, page header inspection, and common indicators of broken storage or inconsistent WAL records. Once failure signals appear, the next step is to stop service immediately to prevent further changes to damaged files. Recovery then moves to restoration from verified backups, with emphasis on base backups checked for integrity and WAL archives stored with consistent retention rules. After restoration, point-in-time recovery establishes a clean state by selecting precise timestamps or LSN markers that precede the corrupt event. When backups are incomplete or missing, salvage techniques extract healthy tables via targeted dumps or low-level page inspection, enabling partial recovery when complete rollback is not possible. In situations where the cluster cannot proceed with standard recovery, pg_resetwal remains an option of last resort, used only to regain startup access while accepting loss of recent transactions. The session concludes with a practical set of measures to reduce future risk, including routine checksum use, scheduled integrity checks, durable backup policies, WAL archiving discipline, and the addition of high-availability replicas to support failover during critical events. The focus stays on established commands, stable operational habits, and recovery actions proven to limit data loss in real deployments.