# Emergency Procedures This document outlines emergency procedures for critical failures in the homelab infrastructure. ## 🚨 Emergency Contact Information ### Critical Service Access - **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md) - **Network Emergency**: Router admin at `192.168.0.1` (admin/admin) - **Power Emergency**: UPS management at `192.168.0.50` ### External Services - **Cloudflare**: Dashboard access for DNS/tunnel management - **Tailscale**: Admin console for mesh VPN recovery - **Domain Registrar**: For DNS changes if Cloudflare fails ## 🔥 Critical Failure Scenarios ### Complete Network Failure #### Symptoms - No internet connectivity - Cannot access local services - Router/switch unresponsive #### Immediate Actions 1. **Check Physical Connections** ```bash # Check cable connections # Verify power to router/switches # Check UPS status ``` 2. **Router Recovery** ```bash # Power cycle router (30-second wait) # Access router admin: http://192.168.0.1 # Check WAN connection status # Verify DHCP is enabled ``` 3. **Switch Recovery** ```bash # Power cycle managed switches # Check link lights on all ports # Verify VLAN configuration if applicable ``` #### Recovery Steps 1. Restore basic internet connectivity 2. Verify internal network communication 3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md)) 4. Test external access through port forwards ### Power Outage Recovery #### During Outage - UPS should maintain critical systems for 15-30 minutes - Graceful shutdown sequence will be triggered automatically - Monitor UPS status via web interface if accessible #### After Power Restoration 1. **Wait for Network Stability** (5 minutes) 2. **Start Core Infrastructure** ```bash # Synology NAS systems (auto-start enabled) # Router and switches (auto-start) # Internet connection verification ``` 3. **Start Host Systems in Order** - Proxmox hosts - Physical machines (Anubis, Guava, Concord NUC) - Raspberry Pi devices 4. **Verify Service Health** ```bash # Check Portainer endpoints # Verify monitoring stack # Test critical services (Plex, Vaultwarden, etc.) ``` ### Storage System Failure #### Synology NAS Failure ```bash # Check RAID status cat /proc/mdstat # Check disk health smartctl -a /dev/sda # Emergency data recovery # 1. Stop all Docker containers # 2. Mount drives on another system # 3. Copy critical data # 4. Restore from backups ``` #### Critical Data Recovery Priority 1. **Vaultwarden database** - Password access 2. **Configuration files** - Service configs 3. **Media libraries** - Plex/Jellyfin content 4. **Personal data** - Photos, documents ### Authentication System Failure (Authentik) #### Symptoms - Cannot log into SSO-protected services - Grafana, Portainer access denied - Web services show authentication errors #### Emergency Access 1. **Use Local Admin Accounts** ```bash # Portainer: Use local admin account # Grafana: Use admin/admin fallback # Direct service access via IP:port ``` 2. **Bypass Authentication Temporarily** ```bash # Edit compose files to disable auth # Restart services without SSO # Fix Authentik issues # Re-enable authentication ``` ### Database Corruption #### PostgreSQL Recovery ```bash # Stop all dependent services docker stop service1 service2 # Backup corrupted database docker exec postgres pg_dump -U user database > backup.sql # Restore from backup docker exec -i postgres psql -U user database < clean_backup.sql # Restart services docker start service1 service2 ``` #### Redis Recovery ```bash # Stop Redis docker stop redis # Check data integrity docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb # Restore from backup or start fresh docker start redis ``` ## 🛠️ Emergency Toolkit ### Essential Commands ```bash # System status overview htop && df -h && docker ps # Network connectivity test ping 8.8.8.8 && ping google.com # Service restart (replace service-name) docker restart service-name # Emergency container stop docker stop $(docker ps -q) # Emergency system reboot sudo reboot ``` ### Emergency Access Methods #### SSH Access ```bash # Direct IP access ssh user@192.168.0.XXX # Tailscale access (if available) ssh user@100.XXX.XXX.XXX # Cloudflare tunnel access ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname ``` #### Web Interface Access ```bash # Direct IP access (bypass DNS) http://192.168.0.XXX:PORT # Tailscale access http://100.XXX.XXX.XXX:PORT # Emergency port forwards # Check router configuration for emergency access ``` ### Emergency Configuration Files #### Minimal Docker Compose ```yaml # Emergency Portainer deployment version: '3.8' services: portainer: image: portainer/portainer-ce:latest ports: - "9000:9000" volumes: - /var/run/docker.sock:/var/run/docker.sock - portainer_data:/data restart: unless-stopped volumes: portainer_data: ``` #### Emergency Nginx Config ```nginx # Basic reverse proxy for emergency access server { listen 80; server_name _; location / { proxy_pass http://backend-service:port; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } } ``` ## 📱 Communication During Emergencies ### Notification Channels 1. **ntfy** - If homelab services are partially functional 2. **Signal** - For critical alerts (if bridge is working) 3. **Email** - External email for status updates 4. **SMS** - For complete infrastructure failure ### Status Communication ```bash # Send status update via ntfy curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC # Log emergency actions echo "$(date): Emergency action taken" >> /var/log/emergency.log ``` ## 🔄 Recovery Verification ### Post-Emergency Checklist - [ ] All hosts responding to ping - [ ] Critical services accessible - [ ] Monitoring stack operational - [ ] External access working - [ ] Backup systems functional - [ ] Security services active ### Service Priority Recovery Order 1. **Network Infrastructure** (Router, switches, DNS) 2. **Storage Systems** (Synology, TrueNAS) 3. **Authentication** (Authentik, Vaultwarden) 4. **Monitoring** (Prometheus, Grafana) 5. **Core Services** (Portainer, reverse proxy) 6. **Media Services** (Plex, arr stack) 7. **Communication** (Matrix, Mastodon) 8. **Development** (Gitea, CI/CD) 9. **Optional Services** (Gaming, AI/ML) ## 📋 Emergency Documentation ### Quick Reference Cards Keep printed copies of: - Network diagram with IP addresses - Critical service URLs and ports - Emergency contact information - Basic recovery commands ### Offline Access - USB drive with critical configs - Printed network documentation - Mobile hotspot for internet access - Laptop with SSH clients configured ## 🔍 Post-Emergency Analysis ### Incident Documentation ```bash # Create incident report cat > incident_$(date +%Y%m%d).md << EOF # Emergency Incident Report **Date**: $(date) **Duration**: X hours **Affected Services**: List services **Root Cause**: Description **Resolution**: Steps taken **Prevention**: Future improvements ## Timeline - HH:MM - Issue detected - HH:MM - Emergency procedures initiated - HH:MM - Service restored ## Lessons Learned - What worked well - What could be improved - Action items for prevention EOF ``` ### Improvement Actions 1. Update emergency procedures based on lessons learned 2. Test backup systems regularly 3. Improve monitoring and alerting 4. Document new failure scenarios 5. Update emergency contact information --- *This document should be reviewed and updated after each emergency incident*