Files
homelab-optimized/docs/troubleshooting/disaster-recovery.md
Gitea Mirror Bot e7652c8dab
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
2026-04-20 01:32:01 +00:00

590 lines
16 KiB
Markdown

# 🚨 Disaster Recovery Guide
**🔴 Advanced Guide**
This guide covers critical disaster recovery scenarios for your homelab, including complete router failure, network reconfiguration, and service restoration procedures.
## 🎯 Disaster Scenarios Covered
1. **🔥 Router Failure** - Complete router replacement and reconfiguration
2. **🌐 Network Reconfiguration** - ISP changes, subnet changes, IP conflicts
3. **🔌 Power Outage Recovery** - Bringing services back online in correct order
4. **💾 Storage Failure** - Data recovery and service restoration
5. **🔐 Password Manager Outage** - Accessing credentials when Vaultwarden is down
---
## 🔥 Router Failure Recovery
### 📋 **Pre-Disaster Preparation (Do This Now!)**
#### 1. **Document Current Network Configuration**
```bash
# Create network documentation file
mkdir -p ~/homelab-recovery
cat > ~/homelab-recovery/network-config.md << 'EOF'
# Network Configuration Backup
## Router Information
- **Model**: [Your Router Model]
- **Firmware**: [Version]
- **Admin URL**: http://192.168.1.1
- **Admin User**: admin
- **Admin Password**: [Document in password manager]
## Network Settings
- **WAN Type**: DHCP / Static / PPPoE
- **ISP Settings**: [Document ISP-specific settings]
- **Subnet**: 192.168.1.0/24
- **DHCP Range**: 192.168.1.100-192.168.1.200
- **DNS Servers**: 1.1.1.1, 8.8.8.8
## Static IP Assignments
EOF
# Document all static IPs
echo "## Static IP Assignments" >> ~/homelab-recovery/network-config.md
```
#### 2. **Export Router Configuration**
```bash
# Most routers allow config export
# Login to router web interface
# Look for: System → Backup/Restore → Export Configuration
# Save to: ~/homelab-recovery/router-backup-$(date +%Y%m%d).bin
```
#### 3. **Document Port Forwarding Rules**
```bash
cat > ~/homelab-recovery/port-forwarding.md << 'EOF'
# Port Forwarding Rules
## Essential Services
| External Port | Internal IP | Internal Port | Protocol | Service |
|---------------|-------------|---------------|----------|---------|
| 51820 | 192.168.1.100 | 51820 | UDP | WireGuard (Atlantis) |
| 51820 | 192.168.1.102 | 51820 | UDP | WireGuard (Concord) |
| 80 | 192.168.1.100 | 8341 | TCP | HTTP (Nginx Proxy) |
| 443 | 192.168.1.100 | 8766 | TCP | HTTPS (Nginx Proxy) |
## Gaming Services (Optional)
| External Port | Internal IP | Internal Port | Protocol | Service |
|---------------|-------------|---------------|----------|---------|
| 7777 | 192.168.1.103 | 7777 | TCP/UDP | Satisfactory |
| 27015 | 192.168.1.103 | 27015 | TCP/UDP | L4D2 Server |
## Dynamic DNS
- **Service**: [Your DDNS Provider]
- **Hostname**: vishinator.synology.me
- **Update URL**: [Document update mechanism]
EOF
```
### 🛠️ **Router Replacement Procedure**
#### **Step 1: Physical Setup**
```bash
# 1. Connect new router to modem
# 2. Connect computer directly to router via Ethernet
# 3. Power on router and wait for boot (2-3 minutes)
```
#### **Step 2: Basic Network Configuration**
```bash
# Access router admin interface
# Default is usually: http://192.168.1.1 or http://192.168.0.1
# For TP-Link Archer BE800 v1.6: http://192.168.0.1 or http://tplinkwifi.net
# Default login: admin/admin
# If different subnet, find router IP:
ip route | grep default
# or
arp -a | grep -E "(router|gateway)"
```
**Router Configuration Checklist:**
```bash
# ✅ Set admin password (use password manager)
# ✅ Configure WAN connection (DHCP/Static/PPPoE)
# ✅ Set WiFi SSID and password
# ✅ Configure subnet: 192.168.1.0/24
# ✅ Set DHCP range: 192.168.1.100-192.168.1.200
# ✅ Configure DNS servers: 1.1.1.1, 8.8.8.8
# ✅ Enable UPnP (if needed)
# ✅ Disable WPS (security)
```
**📖 For TP-Link Archer BE800 v1.6 specific instructions, see: [TP-Link Archer BE800 Setup Guide](../infrastructure/tplink-archer-be800-setup.md)**
#### **Step 3: Static IP Assignment**
**Critical Static IPs (Configure First):**
```bash
# In router DHCP reservation settings:
# Primary Infrastructure
atlantis.vish.local → 192.168.1.100 # MAC: [Document MAC]
calypso.vish.local → 192.168.1.101 # MAC: [Document MAC]
concord-nuc.vish.local → 192.168.1.102 # MAC: [Document MAC]
# Virtual Machines
homelab-vm.vish.local → 192.168.1.103 # MAC: [Document MAC]
chicago-vm.vish.local → 192.168.1.104 # MAC: [Document MAC]
bulgaria-vm.vish.local → 192.168.1.105 # MAC: [Document MAC]
# Specialized Hosts
anubis.vish.local → 192.168.1.106 # MAC: [Document MAC]
guava.vish.local → 192.168.1.107 # MAC: [Document MAC]
setillo.vish.local → 192.168.1.108 # MAC: [Document MAC]
# Raspberry Pi Cluster
rpi-vish.vish.local → 192.168.1.109 # MAC: [Document MAC]
rpi-kevin.vish.local → 192.168.1.110 # MAC: [Document MAC]
# Edge Devices
nvidia-shield.vish.local → 192.168.1.111 # MAC: [Document MAC]
```
**Find MAC Addresses:**
```bash
# On each host, run:
ip link show | grep -E "(ether|link)"
# or
cat /sys/class/net/eth0/address
# From router, check DHCP client list
# Or use network scanner:
nmap -sn 192.168.1.0/24
arp -a
```
#### **Step 4: Port Forwarding Configuration**
**Essential Port Forwards (Configure Immediately):**
```bash
# VPN Access (Highest Priority)
External: 51820/UDP → Internal: 192.168.1.100:51820 (Atlantis WireGuard)
External: 51821/UDP → Internal: 192.168.1.102:51820 (Concord WireGuard)
# Web Services (If needed)
External: 80/TCP → Internal: 192.168.1.100:8341 (HTTP)
External: 443/TCP → Internal: 192.168.1.100:8766 (HTTPS)
```
**Gaming Services (If hosting public games):**
```bash
# Satisfactory Server
External: 7777/TCP → Internal: 192.168.1.103:7777
External: 7777/UDP → Internal: 192.168.1.103:7777
# Left 4 Dead 2 Server
External: 27015/TCP → Internal: 192.168.1.103:27015
External: 27015/UDP → Internal: 192.168.1.103:27015
External: 27020/UDP → Internal: 192.168.1.103:27020
External: 27005/UDP → Internal: 192.168.1.103:27005
```
#### **Step 5: Dynamic DNS Configuration**
**Update DDNS Settings:**
```bash
# Method 1: Router Built-in DDNS
# Configure in router: Advanced → Dynamic DNS
# Service: [Your provider]
# Hostname: vishinator.synology.me
# Username: [Your DDNS username]
# Password: "REDACTED_PASSWORD" DDNS password]
# Method 2: Manual Update (if router doesn't support your provider)
# SSH to a homelab host and run:
curl -u "username:password" \
"https://your-ddns-provider.com/update?hostname=vishinator.synology.me&myip=$(curl -s ifconfig.me)"
```
**Test DDNS:**
```bash
# Wait 5-10 minutes, then test:
nslookup vishinator.synology.me
dig vishinator.synology.me
# Should return your new external IP
curl ifconfig.me # Compare with DDNS result
```
### 🔧 **Service Recovery Order**
**Phase 1: Core Infrastructure (First 30 minutes)**
```bash
# 1. Verify network connectivity
ping 8.8.8.8
ping google.com
# 2. Check all hosts are reachable
ping atlantis.vish.local
ping calypso.vish.local
ping concord-nuc.vish.local
# 3. Verify DNS resolution
nslookup atlantis.vish.local
```
**Phase 2: Essential Services (Next 30 minutes)**
```bash
# 4. Check VPN services
# Test WireGuard from external device
# Verify Tailscale connectivity
# 5. Verify password manager
curl -I https://atlantis.vish.local:8222 # Vaultwarden
# 6. Check monitoring
curl -I https://atlantis.vish.local:3000 # Grafana
curl -I https://atlantis.vish.local:3001 # Uptime Kuma
```
**Phase 3: Media and Applications (Next hour)**
```bash
# 7. Media services
curl -I https://atlantis.vish.local:32400 # Plex
curl -I https://calypso.vish.local:2283 # Immich
# 8. Communication services
curl -I https://homelab-vm.vish.local:8065 # Mattermost
# 9. Development services
curl -I https://atlantis.vish.local:8929 # GitLab
```
### 📱 **Mobile Hotspot Emergency Access**
If your internet is down but you need to configure the router:
```bash
# 1. Connect phone to new router WiFi
# 2. Enable mobile hotspot on another device
# 3. Connect computer to mobile hotspot
# 4. Access router via: http://192.168.1.1
# 5. Configure WAN settings to use mobile hotspot temporarily
```
---
## 🌐 Network Reconfiguration Scenarios
### **ISP Changes (New Modem/Different Settings)**
#### **Scenario 1: New Cable Modem**
```bash
# 1. Connect new modem to router WAN port
# 2. Power cycle both devices (modem first, then router)
# 3. Check WAN connection in router interface
# 4. Update DDNS if external IP changed
# 5. Test port forwarding from external network
```
#### **Scenario 2: Fiber Installation**
```bash
# 1. Configure router for new connection type
# 2. May need PPPoE credentials from ISP
# 3. Update MTU settings if required (usually 1500 for fiber)
# 4. Test speed and latency
# 5. Update monitoring dashboards with new metrics
```
#### **Scenario 3: Subnet Change Required**
```bash
# If you need to change from 192.168.1.x to different subnet:
# 1. Plan new IP scheme
# Old: 192.168.1.0/24
# New: 192.168.2.0/24 (example)
# 2. Update router DHCP settings
# 3. Update static IP reservations
# 4. Update all service configurations
# 5. Update Tailscale subnet routes
# 6. Update monitoring configurations
# 7. Update documentation
```
### **IP Conflict Resolution**
```bash
# If new router uses different default subnet:
# 1. Identify conflicts
nmap -sn 192.168.0.0/24 # Scan new subnet
nmap -sn 192.168.1.0/24 # Scan old subnet
# 2. Choose resolution strategy:
# Option A: Change router to use 192.168.1.x
# Option B: Reconfigure all devices for new subnet
# 3. Update all static configurations
# 4. Update firewall rules
# 5. Update service discovery
```
---
## 🔌 Power Outage Recovery
### **Startup Sequence (Critical Order)**
```bash
# Phase 1: Infrastructure (0-5 minutes)
# 1. Modem/Internet connection
# 2. Router/Switch
# 3. NAS devices (Atlantis, Calypso) - these take longest to boot
# Phase 2: Core Services (5-10 minutes)
# 4. Primary compute hosts (concord-nuc)
# 5. Virtual machine hosts
# Phase 3: Applications (10-15 minutes)
# 6. Raspberry Pi devices
# 7. Edge devices
# 8. Verify all services are running
```
**Automated Startup Script:**
```bash
#!/bin/bash
# ~/homelab-recovery/startup-sequence.sh
echo "🔌 Starting homelab recovery sequence..."
# Wait for network
echo "⏳ Waiting for network connectivity..."
while ! ping -c 1 8.8.8.8 >/dev/null 2>&1; do
sleep 5
done
echo "✅ Network is up"
# Check each host
hosts=(
"atlantis.vish.local"
"calypso.vish.local"
"concord-nuc.vish.local"
"homelab-vm.vish.local"
"chicago-vm.vish.local"
"bulgaria-vm.vish.local"
)
for host in "${hosts[@]}"; do
echo "🔍 Checking $host..."
if ping -c 1 "$host" >/dev/null 2>&1; then
echo "$host is responding"
else
echo "$host is not responding"
fi
done
echo "🎯 Recovery sequence complete"
```
---
## 💾 Storage Failure Recovery
### **Backup Verification**
```bash
# Before disaster strikes, verify backups exist:
# 1. Docker volume backups
ls -la /volume1/docker/*/
du -sh /volume1/docker/*/
# 2. Configuration backups
find ~/homelab-recovery -name "*.yml" -o -name "*.yaml"
# 3. Database backups
ls -la /volume1/docker/*/backup/
ls -la /volume1/docker/*/db_backup/
```
### **Service Restoration Priority**
```bash
# 1. Password Manager (Vaultwarden) - Need passwords for everything else
# 2. DNS/DHCP (Pi-hole) - Network services
# 3. Monitoring (Grafana/Prometheus) - Visibility into recovery
# 4. VPN (WireGuard/Tailscale) - Remote access
# 5. Media services - Lower priority
# 6. Development services - Lowest priority
```
---
## 🔧 Emergency Toolkit
### **Essential Recovery Files**
Create and maintain these files:
```bash
# Create recovery directory
mkdir -p ~/homelab-recovery/{configs,scripts,docs,backups}
# Network configuration
~/homelab-recovery/docs/network-config.md
~/homelab-recovery/docs/port-forwarding.md
~/homelab-recovery/docs/static-ips.md
# Service configurations
~/homelab-recovery/configs/docker-compose-essential.yml
~/homelab-recovery/configs/nginx-proxy-manager.conf
~/homelab-recovery/configs/wireguard-configs/
# Recovery scripts
~/homelab-recovery/scripts/startup-sequence.sh
~/homelab-recovery/scripts/test-connectivity.sh
~/homelab-recovery/scripts/restore-services.sh
# Backup files
~/homelab-recovery/backups/router-config-$(date +%Y%m%d).bin
~/homelab-recovery/backups/vaultwarden-backup.json
~/homelab-recovery/backups/essential-passwords.txt.gpg
```
### **Emergency Contact Information**
```bash
cat > ~/homelab-recovery/docs/emergency-contacts.md << 'EOF'
# Emergency Contacts
## ISP Support
- **Provider**: [Your ISP]
- **Phone**: [Support number]
- **Account**: [Account number]
- **Service Address**: [Your address]
## Hardware Vendors
- **Router**: [Manufacturer support]
- **NAS**: Synology Support
- **Server**: [Hardware vendor]
## Service Providers
- **Domain Registrar**: [Your registrar]
- **DDNS Provider**: [Your DDNS service]
- **Cloud Backup**: [Your backup service]
EOF
```
### **Quick Reference Commands**
```bash
# Network diagnostics
ping 8.8.8.8 # Internet connectivity
nslookup google.com # DNS resolution
ip route # Routing table
arp -a # ARP table
netstat -rn # Network routes
# Service checks
docker ps # Running containers
systemctl status tailscaled # Tailscale status
systemctl status docker # Docker status
# Port checks
nmap -p 22,80,443,51820 localhost
telnet hostname port
nc -zv hostname port
```
---
## 📋 Recovery Checklists
### **🔥 Router Failure Checklist**
```bash
☐ Physical setup (modem → router → computer)
☐ Access router admin interface
☐ Configure basic settings (SSID, password, subnet)
☐ Set static IP reservations for all hosts
☐ Configure port forwarding rules
☐ Update DDNS settings
☐ Test VPN connectivity
☐ Verify all services accessible
☐ Update documentation with any changes
☐ Test from external network
```
### **🌐 Network Change Checklist**
```bash
☐ Document old configuration
☐ Plan new IP scheme
☐ Update router settings
☐ Update static IP reservations
☐ Update service configurations
☐ Update Tailscale subnet routes
☐ Update monitoring dashboards
☐ Update documentation
☐ Test all services
☐ Update backup scripts
```
### **🔌 Power Outage Checklist**
```bash
☐ Wait for stable power (use UPS if available)
☐ Start devices in correct order
☐ Verify network connectivity
☐ Check all hosts are responding
☐ Verify essential services are running
☐ Check for any corrupted data
☐ Update monitoring dashboards
☐ Document any issues encountered
```
---
## 🚨 Emergency Procedures
### **If Everything is Down**
```bash
# 1. Stay calm and work systematically
# 2. Check physical connections first
# 3. Verify power to all devices
# 4. Check internet connectivity with direct connection
# 5. Work through recovery checklists step by step
# 6. Document everything for future reference
```
### **If You're Locked Out**
```bash
# 1. Try default router credentials (often admin/admin)
# 2. Look for reset button on router (hold 10-30 seconds)
# 3. Check router label for default WiFi password
# 4. Use mobile hotspot for internet access during recovery
# 5. Access password manager from mobile device if needed
```
### **If Services Won't Start**
```bash
# 1. Check Docker daemon is running
systemctl status docker
# 2. Check disk space
df -h
# 3. Check for port conflicts
netstat -tulpn | grep :port
# 4. Check container logs
docker logs container-name
# 5. Try starting services individually
docker-compose up service-name
```
---
## 📚 Related Documentation
- [Tailscale Setup Guide](../infrastructure/tailscale-setup-guide.md) - Alternative access method
- [Port Forwarding Guide](../infrastructure/port-forwarding-guide.md) - Detailed port configuration
- [Security Model](../infrastructure/security.md) - Security considerations during recovery
- [Offline Password Access](offline-password-access.md) - Accessing passwords when Vaultwarden is down
- [Authentik SSO Rebuild](authentik-sso-rebuild.md) - Complete SSO/OAuth2 disaster recovery
- [Authentik SSO Setup](../infrastructure/authentik-sso.md) - SSO configuration reference
---
**💡 Pro Tip**: Practice these procedures when everything is working! Run through the checklists quarterly to ensure your documentation is current and you're familiar with the process. A disaster is not the time to learn these procedures for the first time.