homelab-optimized/docs/troubleshooting/emergency.md

# Emergency Procedures

This document outlines emergency procedures for critical failures in the homelab infrastructure.

## 🚨 Emergency Contact Information

### Critical Service Access
- **Vaultwarden Emergency**: See [Offline Password Access](offline-password-access.md)
- **Network Emergency**: Router admin at `192.168.0.1` (admin/admin)
- **Power Emergency**: UPS management at `192.168.0.50`

### External Services
- **Cloudflare**: Dashboard access for DNS/tunnel management
- **Tailscale**: Admin console for mesh VPN recovery
- **Domain Registrar**: For DNS changes if Cloudflare fails

## 🔥 Critical Failure Scenarios

### Complete Network Failure

#### Symptoms
- No internet connectivity
- Cannot access local services
- Router/switch unresponsive

#### Immediate Actions
1. **Check Physical Connections**
   ```bash
   # Check cable connections
   # Verify power to router/switches
   # Check UPS status
   ```

2. **Router Recovery**
   ```bash
   # Power cycle router (30-second wait)
   # Access router admin: http://192.168.0.1
   # Check WAN connection status
   # Verify DHCP is enabled
   ```

3. **Switch Recovery**
   ```bash
   # Power cycle managed switches
   # Check link lights on all ports
   # Verify VLAN configuration if applicable
   ```

#### Recovery Steps
1. Restore basic internet connectivity
2. Verify internal network communication
3. Restart critical services in order (see [Service Dependencies](../services/dependencies.md))
4. Test external access through port forwards

### Power Outage Recovery

#### During Outage
- UPS should maintain critical systems for 15-30 minutes
- Graceful shutdown sequence will be triggered automatically
- Monitor UPS status via web interface if accessible

#### After Power Restoration
1. **Wait for Network Stability** (5 minutes)
2. **Start Core Infrastructure**
   ```bash
   # Synology NAS systems (auto-start enabled)
   # Router and switches (auto-start)
   # Internet connection verification
   ```

3. **Start Host Systems in Order**
   - Proxmox hosts
   - Physical machines (Anubis, Guava, Concord NUC)
   - Raspberry Pi devices

4. **Verify Service Health**
   ```bash
   # Check Portainer endpoints
   # Verify monitoring stack
   # Test critical services (Plex, Vaultwarden, etc.)
   ```

### Storage System Failure

#### Synology NAS Failure
```bash
# Check RAID status
cat /proc/mdstat

# Check disk health
smartctl -a /dev/sda

# Emergency data recovery
# 1. Stop all Docker containers
# 2. Mount drives on another system
# 3. Copy critical data
# 4. Restore from backups
```

#### Critical Data Recovery Priority
1. **Vaultwarden database** - Password access
2. **Configuration files** - Service configs
3. **Media libraries** - Plex/Jellyfin content
4. **Personal data** - Photos, documents

### Authentication System Failure (Authentik)

#### Symptoms
- Cannot log into SSO-protected services
- Grafana, Portainer access denied
- Web services show authentication errors

#### Emergency Access
1. **Use Local Admin Accounts**
   ```bash
   # Portainer: Use local admin account
   # Grafana: Use admin/admin fallback
   # Direct service access via IP:port
   ```

2. **Bypass Authentication Temporarily**
   ```bash
   # Edit compose files to disable auth
   # Restart services without SSO
   # Fix Authentik issues
   # Re-enable authentication
   ```

### Database Corruption

#### PostgreSQL Recovery
```bash
# Stop all dependent services
docker stop service1 service2

# Backup corrupted database
docker exec postgres pg_dump -U user database > backup.sql

# Restore from backup
docker exec -i postgres psql -U user database < clean_backup.sql

# Restart services
docker start service1 service2
```

#### Redis Recovery
```bash
# Stop Redis
docker stop redis

# Check data integrity
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb

# Restore from backup or start fresh
docker start redis
```

## 🛠️ Emergency Toolkit

### Essential Commands
```bash
# System status overview
htop && df -h && docker ps

# Network connectivity test
ping 8.8.8.8 && ping google.com

# Service restart (replace service-name)
docker restart service-name

# Emergency container stop
docker stop $(docker ps -q)

# Emergency system reboot
sudo reboot
```

### Emergency Access Methods

#### SSH Access
```bash
# Direct IP access
ssh user@192.168.0.XXX

# Tailscale access (if available)
ssh user@100.XXX.XXX.XXX

# Cloudflare tunnel access
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname
```

#### Web Interface Access
```bash
# Direct IP access (bypass DNS)
http://192.168.0.XXX:PORT

# Tailscale access
http://100.XXX.XXX.XXX:PORT

# Emergency port forwards
# Check router configuration for emergency access
```

### Emergency Configuration Files

#### Minimal Docker Compose
```yaml
# Emergency Portainer deployment
version: '3.8'
services:
  portainer:
    image: portainer/portainer-ce:latest
    ports:
      - "9000:9000"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - portainer_data:/data
    restart: unless-stopped
volumes:
  portainer_data:
```

#### Emergency Nginx Config
```nginx
# Basic reverse proxy for emergency access
server {
    listen 80;
    server_name _;

    location / {
        proxy_pass http://backend-service:port;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
```

## 📱 Communication During Emergencies

### Notification Channels
1. **ntfy** - If homelab services are partially functional
2. **Signal** - For critical alerts (if bridge is working)
3. **Email** - External email for status updates
4. **SMS** - For complete infrastructure failure

### Status Communication
```bash
# Send status update via ntfy
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC

# Log emergency actions
echo "$(date): Emergency action taken" >> /var/log/emergency.log
```

## 🔄 Recovery Verification

### Post-Emergency Checklist
- [ ] All hosts responding to ping
- [ ] Critical services accessible
- [ ] Monitoring stack operational
- [ ] External access working
- [ ] Backup systems functional
- [ ] Security services active

### Service Priority Recovery Order
1. **Network Infrastructure** (Router, switches, DNS)
2. **Storage Systems** (Synology, TrueNAS)
3. **Authentication** (Authentik, Vaultwarden)
4. **Monitoring** (Prometheus, Grafana)
5. **Core Services** (Portainer, reverse proxy)
6. **Media Services** (Plex, arr stack)
7. **Communication** (Matrix, Mastodon)
8. **Development** (Gitea, CI/CD)
9. **Optional Services** (Gaming, AI/ML)

## 📋 Emergency Documentation

### Quick Reference Cards
Keep printed copies of:
- Network diagram with IP addresses
- Critical service URLs and ports
- Emergency contact information
- Basic recovery commands

### Offline Access
- USB drive with critical configs
- Printed network documentation
- Mobile hotspot for internet access
- Laptop with SSH clients configured

## 🔍 Post-Emergency Analysis

### Incident Documentation
```bash
# Create incident report
cat > incident_$(date +%Y%m%d).md << EOF
# Emergency Incident Report

**Date**: $(date)
**Duration**: X hours
**Affected Services**: List services
**Root Cause**: Description
**Resolution**: Steps taken
**Prevention**: Future improvements

## Timeline
- HH:MM - Issue detected
- HH:MM - Emergency procedures initiated
- HH:MM - Service restored

## Lessons Learned
- What worked well
- What could be improved
- Action items for prevention
EOF
```

### Improvement Actions
1. Update emergency procedures based on lessons learned
2. Test backup systems regularly
3. Improve monitoring and alerting
4. Document new failure scenarios
5. Update emergency contact information

---

*This document should be reviewed and updated after each emergency incident*