Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC
This commit is contained in:
261
docs/troubleshooting/DISASTER_RECOVERY.md
Normal file
261
docs/troubleshooting/DISASTER_RECOVERY.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# Homelab Disaster Recovery Guide
|
||||
|
||||
## 🚨 Avoiding the Chicken and Egg Problem
|
||||
|
||||
This guide ensures you can recover your homelab services even if some infrastructure is down.
|
||||
|
||||
## 🎯 Recovery Priority Order
|
||||
|
||||
### Phase 1: Core Infrastructure (No Dependencies)
|
||||
1. **Router/Network** - Physical access required
|
||||
2. **Calypso Server** - Direct console/SSH access
|
||||
3. **Basic Docker** - Local container management
|
||||
|
||||
### Phase 2: Essential Services (Minimal Dependencies)
|
||||
1. **Nginx Proxy Manager** - Enables external access
|
||||
2. **Gitea** - Code repository access
|
||||
3. **DNS/DHCP** - Network services
|
||||
|
||||
### Phase 3: Application Services (Depends on Phase 1+2)
|
||||
1. **Reactive Resume v5** - Depends on NPM for external access
|
||||
2. **Other applications** - Can be restored after core services
|
||||
|
||||
## 🔧 Emergency Access Methods
|
||||
|
||||
### If Gitea is Down
|
||||
```bash
|
||||
# Access via direct IP (bypass DNS)
|
||||
ssh Vish@192.168.0.250 -p 62000
|
||||
|
||||
# Local git clone from backup
|
||||
git clone /volume1/backups/homelab-repo-backup.git
|
||||
|
||||
# Manual deployment from local files
|
||||
scp -P 62000 docker-compose.yml Vish@192.168.0.250:/volume1/docker/service/
|
||||
```
|
||||
|
||||
### If NPM is Down
|
||||
```bash
|
||||
# Direct service access via IP:PORT
|
||||
http://192.168.0.250:9751 # Reactive Resume
|
||||
http://192.168.0.250:3000 # Gitea
|
||||
http://192.168.0.250:81 # NPM Admin (when working)
|
||||
|
||||
# Emergency NPM deployment (no GitOps)
|
||||
ssh Vish@192.168.0.250 -p 62000
|
||||
sudo /usr/local/bin/docker run -d \
|
||||
--name nginx-proxy-manager-emergency \
|
||||
-p 8880:80 -p 8443:443 -p 81:81 \
|
||||
-v /volume1/docker/nginx-proxy-manager/data:/data \
|
||||
-v /volume1/docker/nginx-proxy-manager/letsencrypt:/etc/letsencrypt \
|
||||
jc21/nginx-proxy-manager:latest
|
||||
```
|
||||
|
||||
### If DNS is Down
|
||||
```bash
|
||||
# Use IP addresses directly
|
||||
192.168.0.250 # Calypso
|
||||
192.168.0.1 # Router
|
||||
8.8.8.8 # Google DNS
|
||||
|
||||
# Edit local hosts file
|
||||
echo "192.168.0.250 calypso.local git.local" >> /etc/hosts
|
||||
```
|
||||
|
||||
## 📦 Offline Deployment Packages
|
||||
|
||||
### Create Emergency Deployment Kit
|
||||
```bash
|
||||
# Create offline deployment package
|
||||
mkdir -p /volume1/backups/emergency-kit
|
||||
cd /home/homelab/organized/repos/homelab
|
||||
|
||||
# Package NPM deployment
|
||||
tar -czf /volume1/backups/emergency-kit/npm-deployment.tar.gz \
|
||||
Calypso/nginx_proxy_manager/
|
||||
|
||||
# Package Reactive Resume deployment
|
||||
tar -czf /volume1/backups/emergency-kit/reactive-resume-deployment.tar.gz \
|
||||
Calypso/reactive_resume_v5/
|
||||
|
||||
# Package essential configs
|
||||
tar -czf /volume1/backups/emergency-kit/essential-configs.tar.gz \
|
||||
Calypso/*.yaml Calypso/*.yml
|
||||
```
|
||||
|
||||
### Use Emergency Kit
|
||||
```bash
|
||||
# Extract and deploy without Git
|
||||
ssh Vish@192.168.0.250 -p 62000
|
||||
cd /volume1/backups/emergency-kit
|
||||
|
||||
# Deploy NPM first
|
||||
tar -xzf npm-deployment.tar.gz
|
||||
cd nginx_proxy_manager
|
||||
chmod +x deploy.sh
|
||||
./deploy.sh deploy
|
||||
|
||||
# Deploy Reactive Resume
|
||||
cd ../
|
||||
tar -xzf reactive-resume-deployment.tar.gz
|
||||
cd reactive_resume_v5
|
||||
chmod +x deploy.sh
|
||||
./deploy.sh deploy
|
||||
```
|
||||
|
||||
## 🔄 Service Dependencies Map
|
||||
|
||||
```
|
||||
Internet Access
|
||||
↓
|
||||
Router (Physical)
|
||||
↓
|
||||
Calypso Server (SSH: 192.168.0.250:62000)
|
||||
↓
|
||||
Docker Engine (Local)
|
||||
↓
|
||||
┌─────────────────┬─────────────────┐
|
||||
│ NPM (Port 81) │ Gitea (Port 3000) │ ← Independent services
|
||||
└─────────────────┴─────────────────┘
|
||||
↓ ↓
|
||||
External Access Code Repository
|
||||
↓ ↓
|
||||
Reactive Resume v5 ← GitOps Deployment
|
||||
```
|
||||
|
||||
## 🚀 Bootstrap Procedures
|
||||
|
||||
### Complete Infrastructure Loss
|
||||
1. **Physical Access**: Console to Calypso
|
||||
2. **Network Setup**: Configure static IP if DHCP down
|
||||
3. **Docker Start**: `sudo systemctl start docker`
|
||||
4. **Manual NPM**: Deploy NPM container directly
|
||||
5. **Git Access**: Clone from backup or external source
|
||||
6. **GitOps Resume**: Use deployment scripts
|
||||
|
||||
### Partial Service Loss
|
||||
```bash
|
||||
# If only applications are down (NPM working)
|
||||
cd /home/homelab/organized/repos/homelab/Calypso/reactive_resume_v5
|
||||
./deploy.sh deploy
|
||||
|
||||
# If NPM is down (applications working)
|
||||
cd /home/homelab/organized/repos/homelab/Calypso/nginx_proxy_manager
|
||||
./deploy.sh deploy
|
||||
|
||||
# If Git is down (use local backup)
|
||||
cp -r /volume1/backups/homelab-latest/* /tmp/homelab-recovery/
|
||||
cd /tmp/homelab-recovery/Calypso/reactive_resume_v5
|
||||
./deploy.sh deploy
|
||||
```
|
||||
|
||||
## 📋 Recovery Checklists
|
||||
|
||||
### NPM Recovery Checklist
|
||||
- [ ] Calypso server accessible via SSH
|
||||
- [ ] Docker service running
|
||||
- [ ] Port 81 available for admin UI
|
||||
- [ ] Ports 8880/8443 available for proxy
|
||||
- [ ] Data directory exists: `/volume1/docker/nginx-proxy-manager/data`
|
||||
- [ ] SSL certificates preserved: `/volume1/docker/nginx-proxy-manager/letsencrypt`
|
||||
- [ ] Router port forwarding: 80→8880, 443→8443
|
||||
|
||||
### Reactive Resume Recovery Checklist
|
||||
- [ ] NPM deployed and healthy
|
||||
- [ ] Database directory exists: `/volume1/docker/rxv5/db`
|
||||
- [ ] Storage directory exists: `/volume1/docker/rxv5/seaweedfs`
|
||||
- [ ] Ollama directory exists: `/volume1/docker/rxv5/ollama`
|
||||
- [ ] SMTP credentials available
|
||||
- [ ] External domain resolving: `nslookup rx.vish.gg`
|
||||
- [ ] NPM proxy hosts configured
|
||||
|
||||
## 🔐 Emergency Credentials
|
||||
|
||||
### Default Service Credentials
|
||||
```bash
|
||||
# NPM Default (change immediately)
|
||||
Email: admin@example.com
|
||||
Password: "REDACTED_PASSWORD"
|
||||
|
||||
# Database Credentials (from compose)
|
||||
User: resumeuser
|
||||
Password: "REDACTED_PASSWORD"
|
||||
Database: resume
|
||||
|
||||
# SMTP (from environment)
|
||||
User: your-email@example.com
|
||||
Password: "REDACTED_PASSWORD" # Stored in compose file
|
||||
```
|
||||
|
||||
### SSH Access
|
||||
```bash
|
||||
# Primary access
|
||||
ssh Vish@192.168.0.250 -p 62000
|
||||
|
||||
# If SSH key fails, use password
|
||||
# Ensure password auth is enabled in emergency
|
||||
```
|
||||
|
||||
## 📞 Emergency Contacts & Resources
|
||||
|
||||
### External Resources (No Local Dependencies)
|
||||
- **Docker Hub**: https://hub.docker.com/
|
||||
- **Ollama Models**: https://ollama.ai/library
|
||||
- **GitHub Backup**: https://github.com/yourusername/homelab-backup
|
||||
- **Documentation**: This file (print/save offline)
|
||||
|
||||
### Recovery Commands Reference
|
||||
```bash
|
||||
# Check what's running
|
||||
sudo /usr/local/bin/docker ps -a
|
||||
|
||||
# Emergency container cleanup
|
||||
sudo /usr/local/bin/docker system prune -af
|
||||
|
||||
# Network troubleshooting
|
||||
ping 8.8.8.8
|
||||
nslookup rx.vish.gg
|
||||
curl -I http://192.168.0.250:81
|
||||
|
||||
# Service health checks
|
||||
curl http://192.168.0.250:9751/health
|
||||
curl http://192.168.0.250:11434/api/tags
|
||||
```
|
||||
|
||||
## 🎯 Prevention Strategies
|
||||
|
||||
### Regular Backups
|
||||
```bash
|
||||
# Weekly automated backup
|
||||
0 2 * * 0 /usr/local/bin/backup-homelab.sh
|
||||
|
||||
# Backup script creates:
|
||||
# - Git repository backup
|
||||
# - Docker volume backups
|
||||
# - Configuration exports
|
||||
# - Emergency deployment kits
|
||||
```
|
||||
|
||||
### Health Monitoring
|
||||
```bash
|
||||
# Daily health checks
|
||||
0 8 * * * /usr/local/bin/health-check.sh
|
||||
|
||||
# Alerts on:
|
||||
# - Service failures
|
||||
# - Disk space issues
|
||||
# - Network connectivity problems
|
||||
# - SSL certificate expiration
|
||||
```
|
||||
|
||||
### Documentation Maintenance
|
||||
- Keep this file updated with any infrastructure changes
|
||||
- Test recovery procedures quarterly
|
||||
- Maintain offline copies of critical documentation
|
||||
- Document any custom configurations or passwords
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2026-02-16
|
||||
**Tested**: Recovery procedures verified
|
||||
**Next Review**: 2026-05-16
|
||||
Reference in New Issue
Block a user