Sanitized mirror from private repository - 2026-03-22 08:40:48 UTC
This commit is contained in:
327
docs/troubleshooting/emergency.md
Normal file
327
docs/troubleshooting/emergency.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Emergency Procedures
|
||||
|
||||
This document outlines emergency procedures for critical failures in the homelab infrastructure.
|
||||
|
||||
## 🚨 Emergency Contact Information
|
||||
|
||||
### Critical Service Access
|
||||
- **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md)
|
||||
- **Network Emergency**: Router admin at `192.168.0.1` (admin/admin)
|
||||
- **Power Emergency**: UPS management at `192.168.0.50`
|
||||
|
||||
### External Services
|
||||
- **Cloudflare**: Dashboard access for DNS/tunnel management
|
||||
- **Tailscale**: Admin console for mesh VPN recovery
|
||||
- **Domain Registrar**: For DNS changes if Cloudflare fails
|
||||
|
||||
## 🔥 Critical Failure Scenarios
|
||||
|
||||
### Complete Network Failure
|
||||
|
||||
#### Symptoms
|
||||
- No internet connectivity
|
||||
- Cannot access local services
|
||||
- Router/switch unresponsive
|
||||
|
||||
#### Immediate Actions
|
||||
1. **Check Physical Connections**
|
||||
```bash
|
||||
# Check cable connections
|
||||
# Verify power to router/switches
|
||||
# Check UPS status
|
||||
```
|
||||
|
||||
2. **Router Recovery**
|
||||
```bash
|
||||
# Power cycle router (30-second wait)
|
||||
# Access router admin: http://192.168.0.1
|
||||
# Check WAN connection status
|
||||
# Verify DHCP is enabled
|
||||
```
|
||||
|
||||
3. **Switch Recovery**
|
||||
```bash
|
||||
# Power cycle managed switches
|
||||
# Check link lights on all ports
|
||||
# Verify VLAN configuration if applicable
|
||||
```
|
||||
|
||||
#### Recovery Steps
|
||||
1. Restore basic internet connectivity
|
||||
2. Verify internal network communication
|
||||
3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md))
|
||||
4. Test external access through port forwards
|
||||
|
||||
### Power Outage Recovery
|
||||
|
||||
#### During Outage
|
||||
- UPS should maintain critical systems for 15-30 minutes
|
||||
- Graceful shutdown sequence will be triggered automatically
|
||||
- Monitor UPS status via web interface if accessible
|
||||
|
||||
#### After Power Restoration
|
||||
1. **Wait for Network Stability** (5 minutes)
|
||||
2. **Start Core Infrastructure**
|
||||
```bash
|
||||
# Synology NAS systems (auto-start enabled)
|
||||
# Router and switches (auto-start)
|
||||
# Internet connection verification
|
||||
```
|
||||
|
||||
3. **Start Host Systems in Order**
|
||||
- Proxmox hosts
|
||||
- Physical machines (Anubis, Guava, Concord NUC)
|
||||
- Raspberry Pi devices
|
||||
|
||||
4. **Verify Service Health**
|
||||
```bash
|
||||
# Check Portainer endpoints
|
||||
# Verify monitoring stack
|
||||
# Test critical services (Plex, Vaultwarden, etc.)
|
||||
```
|
||||
|
||||
### Storage System Failure
|
||||
|
||||
#### Synology NAS Failure
|
||||
```bash
|
||||
# Check RAID status
|
||||
cat /proc/mdstat
|
||||
|
||||
# Check disk health
|
||||
smartctl -a /dev/sda
|
||||
|
||||
# Emergency data recovery
|
||||
# 1. Stop all Docker containers
|
||||
# 2. Mount drives on another system
|
||||
# 3. Copy critical data
|
||||
# 4. Restore from backups
|
||||
```
|
||||
|
||||
#### Critical Data Recovery Priority
|
||||
1. **Vaultwarden database** - Password access
|
||||
2. **Configuration files** - Service configs
|
||||
3. **Media libraries** - Plex/Jellyfin content
|
||||
4. **Personal data** - Photos, documents
|
||||
|
||||
### Authentication System Failure (Authentik)
|
||||
|
||||
#### Symptoms
|
||||
- Cannot log into SSO-protected services
|
||||
- Grafana, Portainer access denied
|
||||
- Web services show authentication errors
|
||||
|
||||
#### Emergency Access
|
||||
1. **Use Local Admin Accounts**
|
||||
```bash
|
||||
# Portainer: Use local admin account
|
||||
# Grafana: Use admin/admin fallback
|
||||
# Direct service access via IP:port
|
||||
```
|
||||
|
||||
2. **Bypass Authentication Temporarily**
|
||||
```bash
|
||||
# Edit compose files to disable auth
|
||||
# Restart services without SSO
|
||||
# Fix Authentik issues
|
||||
# Re-enable authentication
|
||||
```
|
||||
|
||||
### Database Corruption
|
||||
|
||||
#### PostgreSQL Recovery
|
||||
```bash
|
||||
# Stop all dependent services
|
||||
docker stop service1 service2
|
||||
|
||||
# Backup corrupted database
|
||||
docker exec postgres pg_dump -U user database > backup.sql
|
||||
|
||||
# Restore from backup
|
||||
docker exec -i postgres psql -U user database < clean_backup.sql
|
||||
|
||||
# Restart services
|
||||
docker start service1 service2
|
||||
```
|
||||
|
||||
#### Redis Recovery
|
||||
```bash
|
||||
# Stop Redis
|
||||
docker stop redis
|
||||
|
||||
# Check data integrity
|
||||
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb
|
||||
|
||||
# Restore from backup or start fresh
|
||||
docker start redis
|
||||
```
|
||||
|
||||
## 🛠️ Emergency Toolkit
|
||||
|
||||
### Essential Commands
|
||||
```bash
|
||||
# System status overview
|
||||
htop && df -h && docker ps
|
||||
|
||||
# Network connectivity test
|
||||
ping 8.8.8.8 && ping google.com
|
||||
|
||||
# Service restart (replace service-name)
|
||||
docker restart service-name
|
||||
|
||||
# Emergency container stop
|
||||
docker stop $(docker ps -q)
|
||||
|
||||
# Emergency system reboot
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
### Emergency Access Methods
|
||||
|
||||
#### SSH Access
|
||||
```bash
|
||||
# Direct IP access
|
||||
ssh user@192.168.0.XXX
|
||||
|
||||
# Tailscale access (if available)
|
||||
ssh user@100.XXX.XXX.XXX
|
||||
|
||||
# Cloudflare tunnel access
|
||||
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
|
||||
```
|
||||
|
||||
#### Web Interface Access
|
||||
```bash
|
||||
# Direct IP access (bypass DNS)
|
||||
http://192.168.0.XXX:PORT
|
||||
|
||||
# Tailscale access
|
||||
http://100.XXX.XXX.XXX:PORT
|
||||
|
||||
# Emergency port forwards
|
||||
# Check router configuration for emergency access
|
||||
```
|
||||
|
||||
### Emergency Configuration Files
|
||||
|
||||
#### Minimal Docker Compose
|
||||
```yaml
|
||||
# Emergency Portainer deployment
|
||||
version: '3.8'
|
||||
services:
|
||||
portainer:
|
||||
image: portainer/portainer-ce:latest
|
||||
ports:
|
||||
- "9000:9000"
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
- portainer_data:/data
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
portainer_data:
|
||||
```
|
||||
|
||||
#### Emergency Nginx Config
|
||||
```nginx
|
||||
# Basic reverse proxy for emergency access
|
||||
server {
|
||||
listen 80;
|
||||
server_name _;
|
||||
|
||||
location / {
|
||||
proxy_pass http://backend-service:port;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 📱 Communication During Emergencies
|
||||
|
||||
### Notification Channels
|
||||
1. **ntfy** - If homelab services are partially functional
|
||||
2. **Signal** - For critical alerts (if bridge is working)
|
||||
3. **Email** - External email for status updates
|
||||
4. **SMS** - For complete infrastructure failure
|
||||
|
||||
### Status Communication
|
||||
```bash
|
||||
# Send status update via ntfy
|
||||
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC
|
||||
|
||||
# Log emergency actions
|
||||
echo "$(date): Emergency action taken" >> /var/log/emergency.log
|
||||
```
|
||||
|
||||
## 🔄 Recovery Verification
|
||||
|
||||
### Post-Emergency Checklist
|
||||
- [ ] All hosts responding to ping
|
||||
- [ ] Critical services accessible
|
||||
- [ ] Monitoring stack operational
|
||||
- [ ] External access working
|
||||
- [ ] Backup systems functional
|
||||
- [ ] Security services active
|
||||
|
||||
### Service Priority Recovery Order
|
||||
1. **Network Infrastructure** (Router, switches, DNS)
|
||||
2. **Storage Systems** (Synology, TrueNAS)
|
||||
3. **Authentication** (Authentik, Vaultwarden)
|
||||
4. **Monitoring** (Prometheus, Grafana)
|
||||
5. **Core Services** (Portainer, reverse proxy)
|
||||
6. **Media Services** (Plex, arr stack)
|
||||
7. **Communication** (Matrix, Mastodon)
|
||||
8. **Development** (Gitea, CI/CD)
|
||||
9. **Optional Services** (Gaming, AI/ML)
|
||||
|
||||
## 📋 Emergency Documentation
|
||||
|
||||
### Quick Reference Cards
|
||||
Keep printed copies of:
|
||||
- Network diagram with IP addresses
|
||||
- Critical service URLs and ports
|
||||
- Emergency contact information
|
||||
- Basic recovery commands
|
||||
|
||||
### Offline Access
|
||||
- USB drive with critical configs
|
||||
- Printed network documentation
|
||||
- Mobile hotspot for internet access
|
||||
- Laptop with SSH clients configured
|
||||
|
||||
## 🔍 Post-Emergency Analysis
|
||||
|
||||
### Incident Documentation
|
||||
```bash
|
||||
# Create incident report
|
||||
cat > incident_$(date +%Y%m%d).md << EOF
|
||||
# Emergency Incident Report
|
||||
|
||||
**Date**: $(date)
|
||||
**Duration**: X hours
|
||||
**Affected Services**: List services
|
||||
**Root Cause**: Description
|
||||
**Resolution**: Steps taken
|
||||
**Prevention**: Future improvements
|
||||
|
||||
## Timeline
|
||||
- HH:MM - Issue detected
|
||||
- HH:MM - Emergency procedures initiated
|
||||
- HH:MM - Service restored
|
||||
|
||||
## Lessons Learned
|
||||
- What worked well
|
||||
- What could be improved
|
||||
- Action items for prevention
|
||||
EOF
|
||||
```
|
||||
|
||||
### Improvement Actions
|
||||
1. Update emergency procedures based on lessons learned
|
||||
2. Test backup systems regularly
|
||||
3. Improve monitoring and alerting
|
||||
4. Document new failure scenarios
|
||||
5. Update emergency contact information
|
||||
|
||||
---
|
||||
|
||||
*This document should be reviewed and updated after each emergency incident*
|
||||
Reference in New Issue
Block a user