176 lines
5.1 KiB
Markdown
176 lines
5.1 KiB
Markdown
# 🔒 Disaster Recovery Procedures
|
|
|
|
This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.
|
|
|
|
## 🎯 Recovery Objectives
|
|
|
|
### Recovery Time Objective (RTO)
|
|
- **Critical Services**: 30 minutes
|
|
- **Standard Services**: 2 hours
|
|
- **Non-Critical**: 1 day
|
|
|
|
### Recovery Point Objective (RPO)
|
|
- **Critical Data**: 1 hour
|
|
- **Standard Data**: 24 hours
|
|
- **Non-Critical**: 7 days
|
|
|
|
## 🧰 Recovery Resources
|
|
|
|
### Backup Locations
|
|
1. **Local NAS Copies**: Hyper Backup to Calypso
|
|
2. **Cloud Storage**: Backblaze B2 (primary)
|
|
3. **Offsite Replication**: Syncthing to Setillo
|
|
4. **Docker Configs**: Git repository with Syncthing sync
|
|
|
|
### Emergency Access
|
|
- Tailscale VPN access (primary)
|
|
- Physical console access to hosts
|
|
- SSH keys stored in Vaultwarden
|
|
- Emergency USB drives with recovery tools
|
|
|
|
## 🚨 Incident Response Workflow
|
|
|
|
### 1. **Initial Assessment**
|
|
```
|
|
1. Confirm nature of incident
|
|
2. Determine scope and impact
|
|
3. Notify team members
|
|
4. Document incident time and details
|
|
5. Activate appropriate recovery procedures
|
|
```
|
|
|
|
### 2. **Service Restoration Priority**
|
|
```
|
|
Critical (1-2 hours):
|
|
├── Authentik SSO
|
|
├── Gitea Git hosting
|
|
├── Vaultwarden password manager
|
|
└── Nginx Proxy Manager
|
|
|
|
Standard (6-24 hours):
|
|
├── Docker configurations
|
|
├── Database services
|
|
├── Media servers
|
|
└── Monitoring stack
|
|
|
|
Non-Critical (1 week):
|
|
├── Development instances
|
|
└── Test environments
|
|
```
|
|
|
|
### 3. **Recovery Steps**
|
|
|
|
#### Docker Stack Recovery
|
|
1. Navigate to corresponding Git repository
|
|
2. Verify stack compose file integrity
|
|
3. Deploy using GitOps in Portainer
|
|
4. Restore any required data from backups
|
|
5. Validate container status and service access
|
|
|
|
#### Data Restoration
|
|
1. Identify backup source (Backblaze B2, NAS)
|
|
2. Confirm available restore points
|
|
3. Select appropriate backup version
|
|
4. Execute restoration process
|
|
5. Verify data integrity
|
|
|
|
## 📦 Service-Specific Recovery
|
|
|
|
### Authentik SSO Recovery
|
|
- Source: Calypso B2 daily backups
|
|
- Restoration time: <30 minutes
|
|
- Key files: PostgreSQL database and config files
|
|
- Required permissions for restore access
|
|
|
|
### Gitea Git Hosting
|
|
- Source: Calypso B2 daily backups
|
|
- Restoration time: <30 minutes
|
|
- Key files: MariaDB database, repository data
|
|
- Ensure service accounts are recreated post-restore
|
|
|
|
### Backup Systems
|
|
- Local Hyper Backup: Calypso /volume1/backups/
|
|
- Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava
|
|
- Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
|
|
- Restore method: Manual process using existing tasks or restore from other sources
|
|
|
|
### Media Services
|
|
- Plex: Local storage + metadata backed up
|
|
- Jellyfin: Local storage with metadata recovery
|
|
- Immich: Photo DB plus media backup
|
|
- Recovery time: <1 hour for basic access
|
|
|
|
## 🎯 Recovery Testing
|
|
|
|
### Quarterly Tests
|
|
1. Simulate hardware failures
|
|
2. Conduct full data restores
|
|
3. Verify service availability post-restore
|
|
4. Document test results and improvements
|
|
|
|
### Automation Testing
|
|
- Scripted recovery workflows
|
|
- Docker compose file validation
|
|
- Backup integrity checks
|
|
- Restoration time measurements
|
|
|
|
## 📋 Recovery Checklists
|
|
|
|
### Complete Infrastructure Restore
|
|
□ Power cycle failed hardware
|
|
□ Reinstall operating system (DSM for Synology)
|
|
□ Configure basic network settings
|
|
□ Initialize storage volumes
|
|
□ Install Docker and Portainer
|
|
□ Clone Git repository to local directory
|
|
□ Deploy stacks from Git (Portainer GitOps)
|
|
□ Restore service-specific data from backups
|
|
□ Test all services through Tailscale
|
|
□ Verify external access through Cloudflare
|
|
|
|
### Critical Service Restore
|
|
□ Confirm service is down
|
|
□ Validate backup availability for service
|
|
□ Initiate restore process
|
|
□ Monitor progress
|
|
□ Resume service configuration
|
|
□ Test functionality
|
|
□ Update monitoring
|
|
|
|
## 🔄 Failover Procedures
|
|
|
|
### Host-Level Failover
|
|
1. Identify primary host failure
|
|
2. Deploy stack to alternative host
|
|
3. Validate access via Tailscale
|
|
4. Update DNS if needed (Cloudflare)
|
|
5. Confirm service availability from external access
|
|
|
|
### Network-Level Failover
|
|
1. Switch traffic routing via Cloudflare
|
|
2. Update DNS records for affected services
|
|
3. Test connectivity from multiple sources
|
|
4. Monitor service health in Uptime Kuma
|
|
5. Document routing changes
|
|
|
|
## ⚠️ Known Limitations
|
|
|
|
### Unbacked Data
|
|
- **Jellyfish (RPi 5)**: Photos-only backup, no cloud sync
|
|
- **Homelab VM**: Monitoring databases are stateless and rebuildable
|
|
- **Concord NUC**: Small config files that can be regenerated
|
|
|
|
### Recovery Dependencies
|
|
- Some services require Tailscale access for proper operation
|
|
- External DNS resolution depends on Cloudflare being operational
|
|
- Backup restoration assumes sufficient disk space is available
|
|
|
|
## 📚 Related Documentation
|
|
|
|
- [Backup Strategy](../infrastructure/backup-strategy.md)
|
|
- [Security Model](../infrastructure/security.md)
|
|
- [Monitoring Stack](../infrastructure/monitoring/README.md)
|
|
- [Troubleshooting Guide](../troubleshooting/comprehensive-troubleshooting.md)
|
|
|
|
---
|
|
*Last updated: 2026* |