homelab-optimized/docs/admin/disaster-recovery.md

# 🔒 Disaster Recovery Procedures

This document outlines comprehensive disaster recovery procedures for the homelab infrastructure. These procedures should be followed when dealing with catastrophic failures or data loss events.

## 🎯 Recovery Objectives

### Recovery Time Objective (RTO)
- **Critical Services**: 30 minutes
- **Standard Services**: 2 hours
- **Non-Critical**: 1 day

### Recovery Point Objective (RPO)
- **Critical Data**: 1 hour
- **Standard Data**: 24 hours
- **Non-Critical**: 7 days

## 🧰 Recovery Resources

### Backup Locations
1. **Local NAS Copies**: Hyper Backup to Calypso
2. **Cloud Storage**: Backblaze B2 (primary)
3. **Offsite Replication**: Syncthing to Setillo
4. **Docker Configs**: Git repository with Syncthing sync

### Emergency Access
- Tailscale VPN access (primary)
- Physical console access to hosts
- SSH keys stored in Vaultwarden
- Emergency USB drives with recovery tools

## 🚨 Incident Response Workflow

### 1. **Initial Assessment**
```
1. Confirm nature of incident
2. Determine scope and impact
3. Notify team members
4. Document incident time and details
5. Activate appropriate recovery procedures
```

### 2. **Service Restoration Priority**
```
Critical (1-2 hours):
├── Authentik SSO
├── Gitea Git hosting
├── Vaultwarden password manager
└── Nginx Proxy Manager

Standard (6-24 hours):
├── Docker configurations
├── Database services
├── Media servers
└── Monitoring stack

Non-Critical (1 week):
├── Development instances
└── Test environments
```

### 3. **Recovery Steps**

#### Docker Stack Recovery
1. Navigate to corresponding Git repository
2. Verify stack compose file integrity
3. Deploy using GitOps in Portainer
4. Restore any required data from backups
5. Validate container status and service access

#### Data Restoration
1. Identify backup source (Backblaze B2, NAS)
2. Confirm available restore points
3. Select appropriate backup version
4. Execute restoration process
5. Verify data integrity

## 📦 Service-Specific Recovery

### Authentik SSO Recovery
- Source: Calypso B2 daily backups
- Restoration time: <30 minutes
- Key files: PostgreSQL database and config files
- Required permissions for restore access

### Gitea Git Hosting
- Source: Calypso B2 daily backups
- Restoration time: <30 minutes
- Key files: MariaDB database, repository data
- Ensure service accounts are recreated post-restore

### Backup Systems
- Local Hyper Backup: Calypso /volume1/backups/
- Cloud B2: vk-atlantis, vk-concord-1, vk-setillo, vk-guava
- Critical services: Atlantis NAS, Calypso NAS, Setillo NAS, Guava TrueNAS
- Restore method: Manual process using existing tasks or restore from other sources

### Media Services
- Plex: Local storage + metadata backed up
- Jellyfin: Local storage with metadata recovery
- Immich: Photo DB plus media backup
- Recovery time: <1 hour for basic access

## 🎯 Recovery Testing

### Quarterly Tests
1. Simulate hardware failures
2. Conduct full data restores
3. Verify service availability post-restore
4. Document test results and improvements

### Automation Testing
- Scripted recovery workflows
- Docker compose file validation
- Backup integrity checks
- Restoration time measurements

## 📋 Recovery Checklists

### Complete Infrastructure Restore
□ Power cycle failed hardware
□ Reinstall operating system (DSM for Synology)
□ Configure basic network settings
□ Initialize storage volumes
□ Install Docker and Portainer
□ Clone Git repository to local directory
□ Deploy stacks from Git (Portainer GitOps)
□ Restore service-specific data from backups
□ Test all services through Tailscale
□ Verify external access through Cloudflare

### Critical Service Restore
□ Confirm service is down
□ Validate backup availability for service
□ Initiate restore process
□ Monitor progress
□ Resume service configuration
□ Test functionality
□ Update monitoring

## 🔄 Failover Procedures

### Host-Level Failover
1. Identify primary host failure
2. Deploy stack to alternative host
3. Validate access via Tailscale
4. Update DNS if needed (Cloudflare)
5. Confirm service availability from external access

### Network-Level Failover
1. Switch traffic routing via Cloudflare
2. Update DNS records for affected services
3. Test connectivity from multiple sources
4. Monitor service health in Uptime Kuma
5. Document routing changes

## ⚠️ Known Limitations

### Unbacked Data
- **Jellyfish (RPi 5)**: Photos-only backup, no cloud sync
- **Homelab VM**: Monitoring databases are stateless and rebuildable
- **Concord NUC**: Small config files that can be regenerated

### Recovery Dependencies
- Some services require Tailscale access for proper operation
- External DNS resolution depends on Cloudflare being operational
- Backup restoration assumes sufficient disk space is available

## 📚 Related Documentation

- [Backup Strategy](../infrastructure/backup-strategy.md)
- [Security Model](../infrastructure/security.md)
- [Monitoring Stack](../infrastructure/monitoring/README.md)
- [Troubleshooting Guide](../troubleshooting/comprehensive-troubleshooting.md)

---
*Last updated: 2026*