Sanitized mirror from private repository - 2026-04-05 05:34:18 UTC
This commit is contained in:
176
docs/admin/disaster-recovery.md
Normal file
176
docs/admin/disaster-recovery.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# 🔒 Disaster Recovery Procedures
|
||||
|
||||
This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.
|
||||
|
||||
## 🎯 Recovery Objectives
|
||||
|
||||
### Recovery Time Objective (RTO)
|
||||
- **Critical Services**: 30 minutes
|
||||
- **Standard Services**: 2 hours
|
||||
- **Non-Critical**: 1 day
|
||||
|
||||
### Recovery Point Objective (RPO)
|
||||
- **Critical Data**: 1 hour
|
||||
- **Standard Data**: 24 hours
|
||||
- **Non-Critical**: 7 days
|
||||
|
||||
## 🧰 Recovery Resources
|
||||
|
||||
### Backup Locations
|
||||
1. **Local NAS Copies**: Hyper Backup to Calypso
|
||||
2. **Cloud Storage**: Backblaze B2 (primary)
|
||||
3. **Offsite Replication**: Syncthing to Setillo
|
||||
4. **Docker Configs**: Git repository with Syncthing sync
|
||||
|
||||
### Emergency Access
|
||||
- Tailscale VPN access (primary)
|
||||
- Physical console access to hosts
|
||||
- SSH keys stored in Vaultwarden
|
||||
- Emergency USB drives with recovery tools
|
||||
|
||||
## 🚨 Incident Response Workflow
|
||||
|
||||
### 1. **Initial Assessment**
|
||||
```
|
||||
1. Confirm nature of incident
|
||||
2. Determine scope and impact
|
||||
3. Notify team members
|
||||
4. Document incident time and details
|
||||
5. Activate appropriate recovery procedures
|
||||
```
|
||||
|
||||
### 2. **Service Restoration Priority**
|
||||
```
|
||||
Critical (1-2 hours):
|
||||
├── Authentik SSO
|
||||
├── Gitea Git hosting
|
||||
├── Vaultwarden password manager
|
||||
└── Nginx Proxy Manager
|
||||
|
||||
Standard (6-24 hours):
|
||||
├── Docker configurations
|
||||
├── Database services
|
||||
├── Media servers
|
||||
└── Monitoring stack
|
||||
|
||||
Non-Critical (1 week):
|
||||
├── Development instances
|
||||
└── Test environments
|
||||
```
|
||||
|
||||
### 3. **Recovery Steps**
|
||||
|
||||
#### Docker Stack Recovery
|
||||
1. Navigate to corresponding Git repository
|
||||
2. Verify stack compose file integrity
|
||||
3. Deploy using GitOps in Portainer
|
||||
4. Restore any required data from backups
|
||||
5. Validate container status and service access
|
||||
|
||||
#### Data Restoration
|
||||
1. Identify backup source (Backblaze B2, NAS)
|
||||
2. Confirm available restore points
|
||||
3. Select appropriate backup version
|
||||
4. Execute restoration process
|
||||
5. Verify data integrity
|
||||
|
||||
## 📦 Service-Specific Recovery
|
||||
|
||||
### Authentik SSO Recovery
|
||||
- Source: Calypso B2 daily backups
|
||||
- Restoration time: <30 minutes
|
||||
- Key files: PostgreSQL database and config files
|
||||
- Required permissions for restore access
|
||||
|
||||
### Gitea Git Hosting
|
||||
- Source: Calypso B2 daily backups
|
||||
- Restoration time: <30 minutes
|
||||
- Key files: MariaDB database, repository data
|
||||
- Ensure service accounts are recreated post-restore
|
||||
|
||||
### Backup Systems
|
||||
- Local Hyper Backup: Calypso /volume1/backups/
|
||||
- Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava
|
||||
- Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
|
||||
- Restore method: Manual process using existing tasks or restore from other sources
|
||||
|
||||
### Media Services
|
||||
- Plex: Local storage + metadata backed up
|
||||
- Jellyfin: Local storage with metadata recovery
|
||||
- Immich: Photo DB plus media backup
|
||||
- Recovery time: <1 hour for basic access
|
||||
|
||||
## 🎯 Recovery Testing
|
||||
|
||||
### Quarterly Tests
|
||||
1. Simulate hardware failures
|
||||
2. Conduct full data restores
|
||||
3. Verify service availability post-restore
|
||||
4. Document test results and improvements
|
||||
|
||||
### Automation Testing
|
||||
- Scripted recovery workflows
|
||||
- Docker compose file validation
|
||||
- Backup integrity checks
|
||||
- Restoration time measurements
|
||||
|
||||
## 📋 Recovery Checklists
|
||||
|
||||
### Complete Infrastructure Restore
|
||||
□ Power cycle failed hardware
|
||||
□ Reinstall operating system (DSM for Synology)
|
||||
□ Configure basic network settings
|
||||
□ Initialize storage volumes
|
||||
□ Install Docker and Portainer
|
||||
□ Clone Git repository to local directory
|
||||
□ Deploy stacks from Git (Portainer GitOps)
|
||||
□ Restore service-specific data from backups
|
||||
□ Test all services through Tailscale
|
||||
□ Verify external access through Cloudflare
|
||||
|
||||
### Critical Service Restore
|
||||
□ Confirm service is down
|
||||
□ Validate backup availability for service
|
||||
□ Initiate restore process
|
||||
□ Monitor progress
|
||||
□ Resume service configuration
|
||||
□ Test functionality
|
||||
□ Update monitoring
|
||||
|
||||
## 🔄 Failover Procedures
|
||||
|
||||
### Host-Level Failover
|
||||
1. Identify primary host failure
|
||||
2. Deploy stack to alternative host
|
||||
3. Validate access via Tailscale
|
||||
4. Update DNS if needed (Cloudflare)
|
||||
5. Confirm service availability from external access
|
||||
|
||||
### Network-Level Failover
|
||||
1. Switch traffic routing via Cloudflare
|
||||
2. Update DNS records for affected services
|
||||
3. Test connectivity from multiple sources
|
||||
4. Monitor service health in Uptime Kuma
|
||||
5. Document routing changes
|
||||
|
||||
## ⚠️ Known Limitations
|
||||
|
||||
### Unbacked Data
|
||||
- **Jellyfish (RPi 5)**: Photos-only backup, no cloud sync
|
||||
- **Homelab VM**: Monitoring databases are stateless and rebuildable
|
||||
- **Concord NUC**: Small config files that can be regenerated
|
||||
|
||||
### Recovery Dependencies
|
||||
- Some services require Tailscale access for proper operation
|
||||
- External DNS resolution depends on Cloudflare being operational
|
||||
- Backup restoration assumes sufficient disk space is available
|
||||
|
||||
## 📚 Related Documentation
|
||||
|
||||
- [Backup Strategy](../infrastructure/backup-strategy.md)
|
||||
- [Security Model](../infrastructure/security.md)
|
||||
- [Monitoring Stack](../infrastructure/monitoring/README.md)
|
||||
- [Troubleshooting Guide](../troubleshooting/comprehensive-troubleshooting.md)
|
||||
|
||||
---
|
||||
*Last updated: 2026*
|
||||
Reference in New Issue
Block a user