Sanitized mirror from private repository - 2026-04-06 21:14:57 UTC
Some checks failed
Documentation / Build Docusaurus (push) Failing after 17m1s
Documentation / Deploy to GitHub Pages (push) Has been skipped

This commit is contained in:
Gitea Mirror Bot
2026-04-06 21:14:57 +00:00
commit d3fa5d354a
1415 changed files with 359812 additions and 0 deletions

View File

@@ -0,0 +1,327 @@
# Emergency Procedures
This document outlines emergency procedures for critical failures in the homelab infrastructure.
## 🚨 Emergency Contact Information
### Critical Service Access
- **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md)
- **Network Emergency**: Router admin at `192.168.0.1` (admin/admin)
- **Power Emergency**: UPS management at `192.168.0.50`
### External Services
- **Cloudflare**: Dashboard access for DNS/tunnel management
- **Tailscale**: Admin console for mesh VPN recovery
- **Domain Registrar**: For DNS changes if Cloudflare fails
## 🔥 Critical Failure Scenarios
### Complete Network Failure
#### Symptoms
- No internet connectivity
- Cannot access local services
- Router/switch unresponsive
#### Immediate Actions
1. **Check Physical Connections**
```bash
# Check cable connections
# Verify power to router/switches
# Check UPS status
```
2. **Router Recovery**
```bash
# Power cycle router (30-second wait)
# Access router admin: http://192.168.0.1
# Check WAN connection status
# Verify DHCP is enabled
```
3. **Switch Recovery**
```bash
# Power cycle managed switches
# Check link lights on all ports
# Verify VLAN configuration if applicable
```
#### Recovery Steps
1. Restore basic internet connectivity
2. Verify internal network communication
3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md))
4. Test external access through port forwards
### Power Outage Recovery
#### During Outage
- UPS should maintain critical systems for 15-30 minutes
- Graceful shutdown sequence will be triggered automatically
- Monitor UPS status via web interface if accessible
#### After Power Restoration
1. **Wait for Network Stability** (5 minutes)
2. **Start Core Infrastructure**
```bash
# Synology NAS systems (auto-start enabled)
# Router and switches (auto-start)
# Internet connection verification
```
3. **Start Host Systems in Order**
- Proxmox hosts
- Physical machines (Anubis, Guava, Concord NUC)
- Raspberry Pi devices
4. **Verify Service Health**
```bash
# Check Portainer endpoints
# Verify monitoring stack
# Test critical services (Plex, Vaultwarden, etc.)
```
### Storage System Failure
#### Synology NAS Failure
```bash
# Check RAID status
cat /proc/mdstat
# Check disk health
smartctl -a /dev/sda
# Emergency data recovery
# 1. Stop all Docker containers
# 2. Mount drives on another system
# 3. Copy critical data
# 4. Restore from backups
```
#### Critical Data Recovery Priority
1. **Vaultwarden database** - Password access
2. **Configuration files** - Service configs
3. **Media libraries** - Plex/Jellyfin content
4. **Personal data** - Photos, documents
### Authentication System Failure (Authentik)
#### Symptoms
- Cannot log into SSO-protected services
- Grafana, Portainer access denied
- Web services show authentication errors
#### Emergency Access
1. **Use Local Admin Accounts**
```bash
# Portainer: Use local admin account
# Grafana: Use admin/admin fallback
# Direct service access via IP:port
```
2. **Bypass Authentication Temporarily**
```bash
# Edit compose files to disable auth
# Restart services without SSO
# Fix Authentik issues
# Re-enable authentication
```
### Database Corruption
#### PostgreSQL Recovery
```bash
# Stop all dependent services
docker stop service1 service2
# Backup corrupted database
docker exec postgres pg_dump -U user database > backup.sql
# Restore from backup
docker exec -i postgres psql -U user database < clean_backup.sql
# Restart services
docker start service1 service2
```
#### Redis Recovery
```bash
# Stop Redis
docker stop redis
# Check data integrity
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb
# Restore from backup or start fresh
docker start redis
```
## 🛠️ Emergency Toolkit
### Essential Commands
```bash
# System status overview
htop && df -h && docker ps
# Network connectivity test
ping 8.8.8.8 && ping google.com
# Service restart (replace service-name)
docker restart service-name
# Emergency container stop
docker stop $(docker ps -q)
# Emergency system reboot
sudo reboot
```
### Emergency Access Methods
#### SSH Access
```bash
# Direct IP access
ssh user@192.168.0.XXX
# Tailscale access (if available)
ssh user@100.XXX.XXX.XXX
# Cloudflare tunnel access
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
```
#### Web Interface Access
```bash
# Direct IP access (bypass DNS)
http://192.168.0.XXX:PORT
# Tailscale access
http://100.XXX.XXX.XXX:PORT
# Emergency port forwards
# Check router configuration for emergency access
```
### Emergency Configuration Files
#### Minimal Docker Compose
```yaml
# Emergency Portainer deployment
version: '3.8'
services:
portainer:
image: portainer/portainer-ce:latest
ports:
- "9000:9000"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- portainer_data:/data
restart: unless-stopped
volumes:
portainer_data:
```
#### Emergency Nginx Config
```nginx
# Basic reverse proxy for emergency access
server {
listen 80;
server_name _;
location / {
proxy_pass http://backend-service:port;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
```
## 📱 Communication During Emergencies
### Notification Channels
1. **ntfy** - If homelab services are partially functional
2. **Signal** - For critical alerts (if bridge is working)
3. **Email** - External email for status updates
4. **SMS** - For complete infrastructure failure
### Status Communication
```bash
# Send status update via ntfy
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC
# Log emergency actions
echo "$(date): Emergency action taken" >> /var/log/emergency.log
```
## 🔄 Recovery Verification
### Post-Emergency Checklist
- [ ] All hosts responding to ping
- [ ] Critical services accessible
- [ ] Monitoring stack operational
- [ ] External access working
- [ ] Backup systems functional
- [ ] Security services active
### Service Priority Recovery Order
1. **Network Infrastructure** (Router, switches, DNS)
2. **Storage Systems** (Synology, TrueNAS)
3. **Authentication** (Authentik, Vaultwarden)
4. **Monitoring** (Prometheus, Grafana)
5. **Core Services** (Portainer, reverse proxy)
6. **Media Services** (Plex, arr stack)
7. **Communication** (Matrix, Mastodon)
8. **Development** (Gitea, CI/CD)
9. **Optional Services** (Gaming, AI/ML)
## 📋 Emergency Documentation
### Quick Reference Cards
Keep printed copies of:
- Network diagram with IP addresses
- Critical service URLs and ports
- Emergency contact information
- Basic recovery commands
### Offline Access
- USB drive with critical configs
- Printed network documentation
- Mobile hotspot for internet access
- Laptop with SSH clients configured
## 🔍 Post-Emergency Analysis
### Incident Documentation
```bash
# Create incident report
cat > incident_$(date +%Y%m%d).md << EOF
# Emergency Incident Report
**Date**: $(date)
**Duration**: X hours
**Affected Services**: List services
**Root Cause**: Description
**Resolution**: Steps taken
**Prevention**: Future improvements
## Timeline
- HH:MM - Issue detected
- HH:MM - Emergency procedures initiated
- HH:MM - Service restored
## Lessons Learned
- What worked well
- What could be improved
- Action items for prevention
EOF
```
### Improvement Actions
1. Update emergency procedures based on lessons learned
2. Test backup systems regularly
3. Improve monitoring and alerting
4. Document new failure scenarios
5. Update emergency contact information
---
*This document should be reviewed and updated after each emergency incident*