327 lines
7.6 KiB
Markdown
327 lines
7.6 KiB
Markdown
# Emergency Procedures
|
|
|
|
This document outlines emergency procedures for critical failures in the homelab infrastructure.
|
|
|
|
## 🚨 Emergency Contact Information
|
|
|
|
### Critical Service Access
|
|
- **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md)
|
|
- **Network Emergency**: Router admin at `192.168.0.1` (admin/admin)
|
|
- **Power Emergency**: UPS management at `192.168.0.50`
|
|
|
|
### External Services
|
|
- **Cloudflare**: Dashboard access for DNS/tunnel management
|
|
- **Tailscale**: Admin console for mesh VPN recovery
|
|
- **Domain Registrar**: For DNS changes if Cloudflare fails
|
|
|
|
## 🔥 Critical Failure Scenarios
|
|
|
|
### Complete Network Failure
|
|
|
|
#### Symptoms
|
|
- No internet connectivity
|
|
- Cannot access local services
|
|
- Router/switch unresponsive
|
|
|
|
#### Immediate Actions
|
|
1. **Check Physical Connections**
|
|
```bash
|
|
# Check cable connections
|
|
# Verify power to router/switches
|
|
# Check UPS status
|
|
```
|
|
|
|
2. **Router Recovery**
|
|
```bash
|
|
# Power cycle router (30-second wait)
|
|
# Access router admin: http://192.168.0.1
|
|
# Check WAN connection status
|
|
# Verify DHCP is enabled
|
|
```
|
|
|
|
3. **Switch Recovery**
|
|
```bash
|
|
# Power cycle managed switches
|
|
# Check link lights on all ports
|
|
# Verify VLAN configuration if applicable
|
|
```
|
|
|
|
#### Recovery Steps
|
|
1. Restore basic internet connectivity
|
|
2. Verify internal network communication
|
|
3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md))
|
|
4. Test external access through port forwards
|
|
|
|
### Power Outage Recovery
|
|
|
|
#### During Outage
|
|
- UPS should maintain critical systems for 15-30 minutes
|
|
- Graceful shutdown sequence will be triggered automatically
|
|
- Monitor UPS status via web interface if accessible
|
|
|
|
#### After Power Restoration
|
|
1. **Wait for Network Stability** (5 minutes)
|
|
2. **Start Core Infrastructure**
|
|
```bash
|
|
# Synology NAS systems (auto-start enabled)
|
|
# Router and switches (auto-start)
|
|
# Internet connection verification
|
|
```
|
|
|
|
3. **Start Host Systems in Order**
|
|
- Proxmox hosts
|
|
- Physical machines (Anubis, Guava, Concord NUC)
|
|
- Raspberry Pi devices
|
|
|
|
4. **Verify Service Health**
|
|
```bash
|
|
# Check Portainer endpoints
|
|
# Verify monitoring stack
|
|
# Test critical services (Plex, Vaultwarden, etc.)
|
|
```
|
|
|
|
### Storage System Failure
|
|
|
|
#### Synology NAS Failure
|
|
```bash
|
|
# Check RAID status
|
|
cat /proc/mdstat
|
|
|
|
# Check disk health
|
|
smartctl -a /dev/sda
|
|
|
|
# Emergency data recovery
|
|
# 1. Stop all Docker containers
|
|
# 2. Mount drives on another system
|
|
# 3. Copy critical data
|
|
# 4. Restore from backups
|
|
```
|
|
|
|
#### Critical Data Recovery Priority
|
|
1. **Vaultwarden database** - Password access
|
|
2. **Configuration files** - Service configs
|
|
3. **Media libraries** - Plex/Jellyfin content
|
|
4. **Personal data** - Photos, documents
|
|
|
|
### Authentication System Failure (Authentik)
|
|
|
|
#### Symptoms
|
|
- Cannot log into SSO-protected services
|
|
- Grafana, Portainer access denied
|
|
- Web services show authentication errors
|
|
|
|
#### Emergency Access
|
|
1. **Use Local Admin Accounts**
|
|
```bash
|
|
# Portainer: Use local admin account
|
|
# Grafana: Use admin/admin fallback
|
|
# Direct service access via IP:port
|
|
```
|
|
|
|
2. **Bypass Authentication Temporarily**
|
|
```bash
|
|
# Edit compose files to disable auth
|
|
# Restart services without SSO
|
|
# Fix Authentik issues
|
|
# Re-enable authentication
|
|
```
|
|
|
|
### Database Corruption
|
|
|
|
#### PostgreSQL Recovery
|
|
```bash
|
|
# Stop all dependent services
|
|
docker stop service1 service2
|
|
|
|
# Backup corrupted database
|
|
docker exec postgres pg_dump -U user database > backup.sql
|
|
|
|
# Restore from backup
|
|
docker exec -i postgres psql -U user database < clean_backup.sql
|
|
|
|
# Restart services
|
|
docker start service1 service2
|
|
```
|
|
|
|
#### Redis Recovery
|
|
```bash
|
|
# Stop Redis
|
|
docker stop redis
|
|
|
|
# Check data integrity
|
|
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb
|
|
|
|
# Restore from backup or start fresh
|
|
docker start redis
|
|
```
|
|
|
|
## 🛠️ Emergency Toolkit
|
|
|
|
### Essential Commands
|
|
```bash
|
|
# System status overview
|
|
htop && df -h && docker ps
|
|
|
|
# Network connectivity test
|
|
ping 8.8.8.8 && ping google.com
|
|
|
|
# Service restart (replace service-name)
|
|
docker restart service-name
|
|
|
|
# Emergency container stop
|
|
docker stop $(docker ps -q)
|
|
|
|
# Emergency system reboot
|
|
sudo reboot
|
|
```
|
|
|
|
### Emergency Access Methods
|
|
|
|
#### SSH Access
|
|
```bash
|
|
# Direct IP access
|
|
ssh user@192.168.0.XXX
|
|
|
|
# Tailscale access (if available)
|
|
ssh user@100.XXX.XXX.XXX
|
|
|
|
# Cloudflare tunnel access
|
|
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
|
|
```
|
|
|
|
#### Web Interface Access
|
|
```bash
|
|
# Direct IP access (bypass DNS)
|
|
http://192.168.0.XXX:PORT
|
|
|
|
# Tailscale access
|
|
http://100.XXX.XXX.XXX:PORT
|
|
|
|
# Emergency port forwards
|
|
# Check router configuration for emergency access
|
|
```
|
|
|
|
### Emergency Configuration Files
|
|
|
|
#### Minimal Docker Compose
|
|
```yaml
|
|
# Emergency Portainer deployment
|
|
version: '3.8'
|
|
services:
|
|
portainer:
|
|
image: portainer/portainer-ce:latest
|
|
ports:
|
|
- "9000:9000"
|
|
volumes:
|
|
- /var/run/docker.sock:/var/run/docker.sock
|
|
- portainer_data:/data
|
|
restart: unless-stopped
|
|
volumes:
|
|
portainer_data:
|
|
```
|
|
|
|
#### Emergency Nginx Config
|
|
```nginx
|
|
# Basic reverse proxy for emergency access
|
|
server {
|
|
listen 80;
|
|
server_name _;
|
|
|
|
location / {
|
|
proxy_pass http://backend-service:port;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
}
|
|
}
|
|
```
|
|
|
|
## 📱 Communication During Emergencies
|
|
|
|
### Notification Channels
|
|
1. **ntfy** - If homelab services are partially functional
|
|
2. **Signal** - For critical alerts (if bridge is working)
|
|
3. **Email** - External email for status updates
|
|
4. **SMS** - For complete infrastructure failure
|
|
|
|
### Status Communication
|
|
```bash
|
|
# Send status update via ntfy
|
|
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC
|
|
|
|
# Log emergency actions
|
|
echo "$(date): Emergency action taken" >> /var/log/emergency.log
|
|
```
|
|
|
|
## 🔄 Recovery Verification
|
|
|
|
### Post-Emergency Checklist
|
|
- [ ] All hosts responding to ping
|
|
- [ ] Critical services accessible
|
|
- [ ] Monitoring stack operational
|
|
- [ ] External access working
|
|
- [ ] Backup systems functional
|
|
- [ ] Security services active
|
|
|
|
### Service Priority Recovery Order
|
|
1. **Network Infrastructure** (Router, switches, DNS)
|
|
2. **Storage Systems** (Synology, TrueNAS)
|
|
3. **Authentication** (Authentik, Vaultwarden)
|
|
4. **Monitoring** (Prometheus, Grafana)
|
|
5. **Core Services** (Portainer, reverse proxy)
|
|
6. **Media Services** (Plex, arr stack)
|
|
7. **Communication** (Matrix, Mastodon)
|
|
8. **Development** (Gitea, CI/CD)
|
|
9. **Optional Services** (Gaming, AI/ML)
|
|
|
|
## 📋 Emergency Documentation
|
|
|
|
### Quick Reference Cards
|
|
Keep printed copies of:
|
|
- Network diagram with IP addresses
|
|
- Critical service URLs and ports
|
|
- Emergency contact information
|
|
- Basic recovery commands
|
|
|
|
### Offline Access
|
|
- USB drive with critical configs
|
|
- Printed network documentation
|
|
- Mobile hotspot for internet access
|
|
- Laptop with SSH clients configured
|
|
|
|
## 🔍 Post-Emergency Analysis
|
|
|
|
### Incident Documentation
|
|
```bash
|
|
# Create incident report
|
|
cat > incident_$(date +%Y%m%d).md << EOF
|
|
# Emergency Incident Report
|
|
|
|
**Date**: $(date)
|
|
**Duration**: X hours
|
|
**Affected Services**: List services
|
|
**Root Cause**: Description
|
|
**Resolution**: Steps taken
|
|
**Prevention**: Future improvements
|
|
|
|
## Timeline
|
|
- HH:MM - Issue detected
|
|
- HH:MM - Emergency procedures initiated
|
|
- HH:MM - Service restored
|
|
|
|
## Lessons Learned
|
|
- What worked well
|
|
- What could be improved
|
|
- Action items for prevention
|
|
EOF
|
|
```
|
|
|
|
### Improvement Actions
|
|
1. Update emergency procedures based on lessons learned
|
|
2. Test backup systems regularly
|
|
3. Improve monitoring and alerting
|
|
4. Document new failure scenarios
|
|
5. Update emergency contact information
|
|
|
|
---
|
|
|
|
*This document should be reviewed and updated after each emergency incident* |