Sanitized mirror from private repository - 2026-04-05 05:34:18 UTC

2026-04-05 05:34:18 +00:00
commit 3406d7ce05
1390 changed files with 353978 additions and 0 deletions
--- a/docs/admin/disaster-recovery.md
+++ b/docs/admin/disaster-recovery.md
@@ -0,0 +1,176 @@
+# 🔒 Disaster Recovery Procedures
+
+This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.
+
+## 🎯 Recovery Objectives
+
+### Recovery Time Objective (RTO)
+- **Critical Services**: 30 minutes
+- **Standard Services**: 2 hours  
+- **Non-Critical**: 1 day
+
+### Recovery Point Objective (RPO)
+- **Critical Data**: 1 hour
+- **Standard Data**: 24 hours
+- **Non-Critical**: 7 days
+
+## 🧰 Recovery Resources
+
+### Backup Locations
+1. **Local NAS Copies**: Hyper Backup to Calypso
+2. **Cloud Storage**: Backblaze B2 (primary)
+3. **Offsite Replication**: Syncthing to Setillo
+4. **Docker Configs**: Git repository with Syncthing sync
+
+### Emergency Access
+- Tailscale VPN access (primary)
+- Physical console access to hosts
+- SSH keys stored in Vaultwarden
+- Emergency USB drives with recovery tools
+
+## 🚨 Incident Response Workflow
+
+### 1. **Initial Assessment**
+```
+1. Confirm nature of incident
+2. Determine scope and impact
+3. Notify team members
+4. Document incident time and details
+5. Activate appropriate recovery procedures
+```
+
+### 2. **Service Restoration Priority**
+```
+Critical (1-2 hours): 
+├── Authentik SSO 
+├── Gitea Git hosting
+├── Vaultwarden password manager
+└── Nginx Proxy Manager
+
+Standard (6-24 hours):
+├── Docker configurations
+├── Database services
+├── Media servers
+└── Monitoring stack
+
+Non-Critical (1 week):
+├── Development instances
+└── Test environments
+```
+
+### 3. **Recovery Steps**
+
+#### Docker Stack Recovery
+1. Navigate to corresponding Git repository
+2. Verify stack compose file integrity
+3. Deploy using GitOps in Portainer  
+4. Restore any required data from backups
+5. Validate container status and service access
+
+#### Data Restoration
+1. Identify backup source (Backblaze B2, NAS)
+2. Confirm available restore points
+3. Select appropriate backup version
+4. Execute restoration process
+5. Verify data integrity
+
+## 📦 Service-Specific Recovery
+
+### Authentik SSO Recovery
+- Source: Calypso B2 daily backups
+- Restoration time: <30 minutes  
+- Key files: PostgreSQL database and config files
+- Required permissions for restore access
+
+### Gitea Git Hosting
+- Source: Calypso B2 daily backups
+- Restoration time: <30 minutes
+- Key files: MariaDB database, repository data
+- Ensure service accounts are recreated post-restore
+
+### Backup Systems 
+- Local Hyper Backup: Calypso /volume1/backups/
+- Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava  
+- Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
+- Restore method: Manual process using existing tasks or restore from other sources
+
+### Media Services 
+- Plex: Local storage + metadata backed up
+- Jellyfin: Local storage with metadata recovery
+- Immich: Photo DB plus media backup  
+- Recovery time: <1 hour for basic access
+
+## 🎯 Recovery Testing
+
+### Quarterly Tests
+1. Simulate hardware failures
+2. Conduct full data restores
+3. Verify service availability post-restore
+4. Document test results and improvements
+
+### Automation Testing
+- Scripted recovery workflows 
+- Docker compose file validation
+- Backup integrity checks  
+- Restoration time measurements
+
+## 📋 Recovery Checklists
+
+### Complete Infrastructure Restore
+□ Power cycle failed hardware  
+□ Reinstall operating system (DSM for Synology)  
+□ Configure basic network settings  
+□ Initialize storage volumes  
+□ Install Docker and Portainer  
+□ Clone Git repository to local directory  
+□ Deploy stacks from Git (Portainer GitOps)  
+□ Restore service-specific data from backups  
+□ Test all services through Tailscale  
+□ Verify external access through Cloudflare  
+
+### Critical Service Restore  
+□ Confirm service is down
+□ Validate backup availability for service  
+□ Initiate restore process 
+□ Monitor progress
+□ Resume service configuration
+□ Test functionality
+□ Update monitoring
+
+## 🔄 Failover Procedures
+
+### Host-Level Failover
+1. Identify primary host failure
+2. Deploy stack to alternative host
+3. Validate access via Tailscale  
+4. Update DNS if needed (Cloudflare)
+5. Confirm service availability from external access
+
+### Network-Level Failover
+1. Switch traffic routing via Cloudflare
+2. Update DNS records for affected services  
+3. Test connectivity from multiple sources
+4. Monitor service health in Uptime Kuma
+5. Document routing changes
+
+## ⚠️ Known Limitations
+
+### Unbacked Data
+- **Jellyfish (RPi 5)**: Photos-only backup, no cloud sync
+- **Homelab VM**: Monitoring databases are stateless and rebuildable  
+- **Concord NUC**: Small config files that can be regenerated
+
+### Recovery Dependencies
+- Some services require Tailscale access for proper operation
+- External DNS resolution depends on Cloudflare being operational  
+- Backup restoration assumes sufficient disk space is available
+
+## 📚 Related Documentation
+
+- [Backup Strategy](../infrastructure/backup-strategy.md)
+- [Security Model](../infrastructure/security.md) 
+- [Monitoring Stack](../infrastructure/monitoring/README.md)
+- [Troubleshooting Guide](../troubleshooting/comprehensive-troubleshooting.md)
+
+---
+*Last updated: 2026*