7.6 KiB
7.6 KiB
Emergency Procedures
This document outlines emergency procedures for critical failures in the homelab infrastructure.
🚨 Emergency Contact Information
Critical Service Access
- Vaultwarden Emergency: See Offline Password Access
- Network Emergency: Router admin at
192.168.0.1(admin/admin) - Power Emergency: UPS management at
192.168.0.50
External Services
- Cloudflare: Dashboard access for DNS/tunnel management
- Tailscale: Admin console for mesh VPN recovery
- Domain Registrar: For DNS changes if Cloudflare fails
🔥 Critical Failure Scenarios
Complete Network Failure
Symptoms
- No internet connectivity
- Cannot access local services
- Router/switch unresponsive
Immediate Actions
-
Check Physical Connections
# Check cable connections # Verify power to router/switches # Check UPS status -
Router Recovery
# Power cycle router (30-second wait) # Access router admin: http://192.168.0.1 # Check WAN connection status # Verify DHCP is enabled -
Switch Recovery
# Power cycle managed switches # Check link lights on all ports # Verify VLAN configuration if applicable
Recovery Steps
- Restore basic internet connectivity
- Verify internal network communication
- Restart critical services in order (see Service Dependencies)
- Test external access through port forwards
Power Outage Recovery
During Outage
- UPS should maintain critical systems for 15-30 minutes
- Graceful shutdown sequence will be triggered automatically
- Monitor UPS status via web interface if accessible
After Power Restoration
-
Wait for Network Stability (5 minutes)
-
Start Core Infrastructure
# Synology NAS systems (auto-start enabled) # Router and switches (auto-start) # Internet connection verification -
Start Host Systems in Order
- Proxmox hosts
- Physical machines (Anubis, Guava, Concord NUC)
- Raspberry Pi devices
-
Verify Service Health
# Check Portainer endpoints # Verify monitoring stack # Test critical services (Plex, Vaultwarden, etc.)
Storage System Failure
Synology NAS Failure
# Check RAID status
cat /proc/mdstat
# Check disk health
smartctl -a /dev/sda
# Emergency data recovery
# 1. Stop all Docker containers
# 2. Mount drives on another system
# 3. Copy critical data
# 4. Restore from backups
Critical Data Recovery Priority
- Vaultwarden database - Password access
- Configuration files - Service configs
- Media libraries - Plex/Jellyfin content
- Personal data - Photos, documents
Authentication System Failure (Authentik)
Symptoms
- Cannot log into SSO-protected services
- Grafana, Portainer access denied
- Web services show authentication errors
Emergency Access
-
Use Local Admin Accounts
# Portainer: Use local admin account # Grafana: Use admin/admin fallback # Direct service access via IP:port -
Bypass Authentication Temporarily
# Edit compose files to disable auth # Restart services without SSO # Fix Authentik issues # Re-enable authentication
Database Corruption
PostgreSQL Recovery
# Stop all dependent services
docker stop service1 service2
# Backup corrupted database
docker exec postgres pg_dump -U user database > backup.sql
# Restore from backup
docker exec -i postgres psql -U user database < clean_backup.sql
# Restart services
docker start service1 service2
Redis Recovery
# Stop Redis
docker stop redis
# Check data integrity
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb
# Restore from backup or start fresh
docker start redis
🛠️ Emergency Toolkit
Essential Commands
# System status overview
htop && df -h && docker ps
# Network connectivity test
ping 8.8.8.8 && ping google.com
# Service restart (replace service-name)
docker restart service-name
# Emergency container stop
docker stop $(docker ps -q)
# Emergency system reboot
sudo reboot
Emergency Access Methods
SSH Access
# Direct IP access
ssh user@192.168.0.XXX
# Tailscale access (if available)
ssh user@100.XXX.XXX.XXX
# Cloudflare tunnel access
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
Web Interface Access
# Direct IP access (bypass DNS)
http://192.168.0.XXX:PORT
# Tailscale access
http://100.XXX.XXX.XXX:PORT
# Emergency port forwards
# Check router configuration for emergency access
Emergency Configuration Files
Minimal Docker Compose
# Emergency Portainer deployment
version: '3.8'
services:
portainer:
image: portainer/portainer-ce:latest
ports:
- "9000:9000"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- portainer_data:/data
restart: unless-stopped
volumes:
portainer_data:
Emergency Nginx Config
# Basic reverse proxy for emergency access
server {
listen 80;
server_name _;
location / {
proxy_pass http://backend-service:port;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
📱 Communication During Emergencies
Notification Channels
- ntfy - If homelab services are partially functional
- Signal - For critical alerts (if bridge is working)
- Email - External email for status updates
- SMS - For complete infrastructure failure
Status Communication
# Send status update via ntfy
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC
# Log emergency actions
echo "$(date): Emergency action taken" >> /var/log/emergency.log
🔄 Recovery Verification
Post-Emergency Checklist
- All hosts responding to ping
- Critical services accessible
- Monitoring stack operational
- External access working
- Backup systems functional
- Security services active
Service Priority Recovery Order
- Network Infrastructure (Router, switches, DNS)
- Storage Systems (Synology, TrueNAS)
- Authentication (Authentik, Vaultwarden)
- Monitoring (Prometheus, Grafana)
- Core Services (Portainer, reverse proxy)
- Media Services (Plex, arr stack)
- Communication (Matrix, Mastodon)
- Development (Gitea, CI/CD)
- Optional Services (Gaming, AI/ML)
📋 Emergency Documentation
Quick Reference Cards
Keep printed copies of:
- Network diagram with IP addresses
- Critical service URLs and ports
- Emergency contact information
- Basic recovery commands
Offline Access
- USB drive with critical configs
- Printed network documentation
- Mobile hotspot for internet access
- Laptop with SSH clients configured
🔍 Post-Emergency Analysis
Incident Documentation
# Create incident report
cat > incident_$(date +%Y%m%d).md << EOF
# Emergency Incident Report
**Date**: $(date)
**Duration**: X hours
**Affected Services**: List services
**Root Cause**: Description
**Resolution**: Steps taken
**Prevention**: Future improvements
## Timeline
- HH:MM - Issue detected
- HH:MM - Emergency procedures initiated
- HH:MM - Service restored
## Lessons Learned
- What worked well
- What could be improved
- Action items for prevention
EOF
Improvement Actions
- Update emergency procedures based on lessons learned
- Test backup systems regularly
- Improve monitoring and alerting
- Document new failure scenarios
- Update emergency contact information
This document should be reviewed and updated after each emergency incident