Sanitized mirror from private repository - 2026-04-06 21:14:57 UTC

2026-04-06 21:14:57 +00:00
commit d3fa5d354a
1415 changed files with 359812 additions and 0 deletions
--- a/docs/troubleshooting/emergency.md
+++ b/docs/troubleshooting/emergency.md
@@ -0,0 +1,327 @@
+# Emergency Procedures
+
+This document outlines emergency procedures for critical failures in the homelab infrastructure.
+
+## 🚨 Emergency Contact Information
+
+### Critical Service Access
+- **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md)
+- **Network Emergency**: Router admin at `192.168.0.1` (admin/admin)
+- **Power Emergency**: UPS management at `192.168.0.50`
+
+### External Services
+- **Cloudflare**: Dashboard access for DNS/tunnel management
+- **Tailscale**: Admin console for mesh VPN recovery
+- **Domain Registrar**: For DNS changes if Cloudflare fails
+
+## 🔥 Critical Failure Scenarios
+
+### Complete Network Failure
+
+#### Symptoms
+- No internet connectivity
+- Cannot access local services
+- Router/switch unresponsive
+
+#### Immediate Actions
+1. **Check Physical Connections**
+   ```bash
+   # Check cable connections
+   # Verify power to router/switches
+   # Check UPS status
+   ```
+
+2. **Router Recovery**
+   ```bash
+   # Power cycle router (30-second wait)
+   # Access router admin: http://192.168.0.1
+   # Check WAN connection status
+   # Verify DHCP is enabled
+   ```
+
+3. **Switch Recovery**
+   ```bash
+   # Power cycle managed switches
+   # Check link lights on all ports
+   # Verify VLAN configuration if applicable
+   ```
+
+#### Recovery Steps
+1. Restore basic internet connectivity
+2. Verify internal network communication
+3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md))
+4. Test external access through port forwards
+
+### Power Outage Recovery
+
+#### During Outage
+- UPS should maintain critical systems for 15-30 minutes
+- Graceful shutdown sequence will be triggered automatically
+- Monitor UPS status via web interface if accessible
+
+#### After Power Restoration
+1. **Wait for Network Stability** (5 minutes)
+2. **Start Core Infrastructure**
+   ```bash
+   # Synology NAS systems (auto-start enabled)
+   # Router and switches (auto-start)
+   # Internet connection verification
+   ```
+
+3. **Start Host Systems in Order**
+   - Proxmox hosts
+   - Physical machines (Anubis, Guava, Concord NUC)
+   - Raspberry Pi devices
+
+4. **Verify Service Health**
+   ```bash
+   # Check Portainer endpoints
+   # Verify monitoring stack
+   # Test critical services (Plex, Vaultwarden, etc.)
+   ```
+
+### Storage System Failure
+
+#### Synology NAS Failure
+```bash
+# Check RAID status
+cat /proc/mdstat
+
+# Check disk health
+smartctl -a /dev/sda
+
+# Emergency data recovery
+# 1. Stop all Docker containers
+# 2. Mount drives on another system
+# 3. Copy critical data
+# 4. Restore from backups
+```
+
+#### Critical Data Recovery Priority
+1. **Vaultwarden database** - Password access
+2. **Configuration files** - Service configs
+3. **Media libraries** - Plex/Jellyfin content
+4. **Personal data** - Photos, documents
+
+### Authentication System Failure (Authentik)
+
+#### Symptoms
+- Cannot log into SSO-protected services
+- Grafana, Portainer access denied
+- Web services show authentication errors
+
+#### Emergency Access
+1. **Use Local Admin Accounts**
+   ```bash
+   # Portainer: Use local admin account
+   # Grafana: Use admin/admin fallback
+   # Direct service access via IP:port
+   ```
+
+2. **Bypass Authentication Temporarily**
+   ```bash
+   # Edit compose files to disable auth
+   # Restart services without SSO
+   # Fix Authentik issues
+   # Re-enable authentication
+   ```
+
+### Database Corruption
+
+#### PostgreSQL Recovery
+```bash
+# Stop all dependent services
+docker stop service1 service2
+
+# Backup corrupted database
+docker exec postgres pg_dump -U user database > backup.sql
+
+# Restore from backup
+docker exec -i postgres psql -U user database < clean_backup.sql
+
+# Restart services
+docker start service1 service2
+```
+
+#### Redis Recovery
+```bash
+# Stop Redis
+docker stop redis
+
+# Check data integrity
+docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb
+
+# Restore from backup or start fresh
+docker start redis
+```
+
+## 🛠️ Emergency Toolkit
+
+### Essential Commands
+```bash
+# System status overview
+htop && df -h && docker ps
+
+# Network connectivity test
+ping 8.8.8.8 && ping google.com
+
+# Service restart (replace service-name)
+docker restart service-name
+
+# Emergency container stop
+docker stop $(docker ps -q)
+
+# Emergency system reboot
+sudo reboot
+```
+
+### Emergency Access Methods
+
+#### SSH Access
+```bash
+# Direct IP access
+ssh user@192.168.0.XXX
+
+# Tailscale access (if available)
+ssh user@100.XXX.XXX.XXX
+
+# Cloudflare tunnel access
+ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
+```
+
+#### Web Interface Access
+```bash
+# Direct IP access (bypass DNS)
+http://192.168.0.XXX:PORT
+
+# Tailscale access
+http://100.XXX.XXX.XXX:PORT
+
+# Emergency port forwards
+# Check router configuration for emergency access
+```
+
+### Emergency Configuration Files
+
+#### Minimal Docker Compose
+```yaml
+# Emergency Portainer deployment
+version: '3.8'
+services:
+  portainer:
+    image: portainer/portainer-ce:latest
+    ports:
+      - "9000:9000"
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock
+      - portainer_data:/data
+    restart: unless-stopped
+volumes:
+  portainer_data:
+```
+
+#### Emergency Nginx Config
+```nginx
+# Basic reverse proxy for emergency access
+server {
+    listen 80;
+    server_name _;
+    
+    location / {
+        proxy_pass http://backend-service:port;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+    }
+}
+```
+
+## 📱 Communication During Emergencies
+
+### Notification Channels
+1. **ntfy** - If homelab services are partially functional
+2. **Signal** - For critical alerts (if bridge is working)
+3. **Email** - External email for status updates
+4. **SMS** - For complete infrastructure failure
+
+### Status Communication
+```bash
+# Send status update via ntfy
+curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC
+
+# Log emergency actions
+echo "$(date): Emergency action taken" >> /var/log/emergency.log
+```
+
+## 🔄 Recovery Verification
+
+### Post-Emergency Checklist
+- [ ] All hosts responding to ping
+- [ ] Critical services accessible
+- [ ] Monitoring stack operational
+- [ ] External access working
+- [ ] Backup systems functional
+- [ ] Security services active
+
+### Service Priority Recovery Order
+1. **Network Infrastructure** (Router, switches, DNS)
+2. **Storage Systems** (Synology, TrueNAS)
+3. **Authentication** (Authentik, Vaultwarden)
+4. **Monitoring** (Prometheus, Grafana)
+5. **Core Services** (Portainer, reverse proxy)
+6. **Media Services** (Plex, arr stack)
+7. **Communication** (Matrix, Mastodon)
+8. **Development** (Gitea, CI/CD)
+9. **Optional Services** (Gaming, AI/ML)
+
+## 📋 Emergency Documentation
+
+### Quick Reference Cards
+Keep printed copies of:
+- Network diagram with IP addresses
+- Critical service URLs and ports
+- Emergency contact information
+- Basic recovery commands
+
+### Offline Access
+- USB drive with critical configs
+- Printed network documentation
+- Mobile hotspot for internet access
+- Laptop with SSH clients configured
+
+## 🔍 Post-Emergency Analysis
+
+### Incident Documentation
+```bash
+# Create incident report
+cat > incident_$(date +%Y%m%d).md << EOF
+# Emergency Incident Report
+
+**Date**: $(date)
+**Duration**: X hours
+**Affected Services**: List services
+**Root Cause**: Description
+**Resolution**: Steps taken
+**Prevention**: Future improvements
+
+## Timeline
+- HH:MM - Issue detected
+- HH:MM - Emergency procedures initiated
+- HH:MM - Service restored
+
+## Lessons Learned
+- What worked well
+- What could be improved
+- Action items for prevention
+EOF
+```
+
+### Improvement Actions
+1. Update emergency procedures based on lessons learned
+2. Test backup systems regularly
+3. Improve monitoring and alerting
+4. Document new failure scenarios
+5. Update emergency contact information
+
+---
+
+*This document should be reviewed and updated after each emergency incident*