Files
homelab-optimized/docs/troubleshooting/emergency.md
Gitea Mirror Bot 5c2fcfeb21
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m1s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-28 12:26:38 UTC
2026-03-28 12:26:38 +00:00

7.6 KiB

Emergency Procedures

This document outlines emergency procedures for critical failures in the homelab infrastructure.

🚨 Emergency Contact Information

Critical Service Access

  • Vaultwarden Emergency: See Offline Password Access
  • Network Emergency: Router admin at 192.168.0.1 (admin/admin)
  • Power Emergency: UPS management at 192.168.0.50

External Services

  • Cloudflare: Dashboard access for DNS/tunnel management
  • Tailscale: Admin console for mesh VPN recovery
  • Domain Registrar: For DNS changes if Cloudflare fails

🔥 Critical Failure Scenarios

Complete Network Failure

Symptoms

  • No internet connectivity
  • Cannot access local services
  • Router/switch unresponsive

Immediate Actions

  1. Check Physical Connections

    # Check cable connections
    # Verify power to router/switches
    # Check UPS status
    
  2. Router Recovery

    # Power cycle router (30-second wait)
    # Access router admin: http://192.168.0.1
    # Check WAN connection status
    # Verify DHCP is enabled
    
  3. Switch Recovery

    # Power cycle managed switches
    # Check link lights on all ports
    # Verify VLAN configuration if applicable
    

Recovery Steps

  1. Restore basic internet connectivity
  2. Verify internal network communication
  3. Restart critical services in order (see Service Dependencies)
  4. Test external access through port forwards

Power Outage Recovery

During Outage

  • UPS should maintain critical systems for 15-30 minutes
  • Graceful shutdown sequence will be triggered automatically
  • Monitor UPS status via web interface if accessible

After Power Restoration

  1. Wait for Network Stability (5 minutes)

  2. Start Core Infrastructure

    # Synology NAS systems (auto-start enabled)
    # Router and switches (auto-start)
    # Internet connection verification
    
  3. Start Host Systems in Order

    • Proxmox hosts
    • Physical machines (Anubis, Guava, Concord NUC)
    • Raspberry Pi devices
  4. Verify Service Health

    # Check Portainer endpoints
    # Verify monitoring stack
    # Test critical services (Plex, Vaultwarden, etc.)
    

Storage System Failure

Synology NAS Failure

# Check RAID status
cat /proc/mdstat

# Check disk health
smartctl -a /dev/sda

# Emergency data recovery
# 1. Stop all Docker containers
# 2. Mount drives on another system
# 3. Copy critical data
# 4. Restore from backups

Critical Data Recovery Priority

  1. Vaultwarden database - Password access
  2. Configuration files - Service configs
  3. Media libraries - Plex/Jellyfin content
  4. Personal data - Photos, documents

Authentication System Failure (Authentik)

Symptoms

  • Cannot log into SSO-protected services
  • Grafana, Portainer access denied
  • Web services show authentication errors

Emergency Access

  1. Use Local Admin Accounts

    # Portainer: Use local admin account
    # Grafana: Use admin/admin fallback
    # Direct service access via IP:port
    
  2. Bypass Authentication Temporarily

    # Edit compose files to disable auth
    # Restart services without SSO
    # Fix Authentik issues
    # Re-enable authentication
    

Database Corruption

PostgreSQL Recovery

# Stop all dependent services
docker stop service1 service2

# Backup corrupted database
docker exec postgres pg_dump -U user database > backup.sql

# Restore from backup
docker exec -i postgres psql -U user database < clean_backup.sql

# Restart services
docker start service1 service2

Redis Recovery

# Stop Redis
docker stop redis

# Check data integrity
docker run --rm -v redis_data:/data redis redis-check-rdb /data/dump.rdb

# Restore from backup or start fresh
docker start redis

🛠️ Emergency Toolkit

Essential Commands

# System status overview
htop && df -h && docker ps

# Network connectivity test
ping 8.8.8.8 && ping google.com

# Service restart (replace service-name)
docker restart service-name

# Emergency container stop
docker stop $(docker ps -q)

# Emergency system reboot
sudo reboot

Emergency Access Methods

SSH Access

# Direct IP access
ssh user@192.168.0.XXX

# Tailscale access (if available)
ssh user@100.XXX.XXX.XXX

# Cloudflare tunnel access
ssh -o ProxyCommand='cloudflared access ssh --hostname %h' user@hostname

Web Interface Access

# Direct IP access (bypass DNS)
http://192.168.0.XXX:PORT

# Tailscale access
http://100.XXX.XXX.XXX:PORT

# Emergency port forwards
# Check router configuration for emergency access

Emergency Configuration Files

Minimal Docker Compose

# Emergency Portainer deployment
version: '3.8'
services:
  portainer:
    image: portainer/portainer-ce:latest
    ports:
      - "9000:9000"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - portainer_data:/data
    restart: unless-stopped
volumes:
  portainer_data:

Emergency Nginx Config

# Basic reverse proxy for emergency access
server {
    listen 80;
    server_name _;
    
    location / {
        proxy_pass http://backend-service:port;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

📱 Communication During Emergencies

Notification Channels

  1. ntfy - If homelab services are partially functional
  2. Signal - For critical alerts (if bridge is working)
  3. Email - External email for status updates
  4. SMS - For complete infrastructure failure

Status Communication

# Send status update via ntfy
curl -d "Emergency: System status update" ntfy.vish.gg/REDACTED_NTFY_TOPIC

# Log emergency actions
echo "$(date): Emergency action taken" >> /var/log/emergency.log

🔄 Recovery Verification

Post-Emergency Checklist

  • All hosts responding to ping
  • Critical services accessible
  • Monitoring stack operational
  • External access working
  • Backup systems functional
  • Security services active

Service Priority Recovery Order

  1. Network Infrastructure (Router, switches, DNS)
  2. Storage Systems (Synology, TrueNAS)
  3. Authentication (Authentik, Vaultwarden)
  4. Monitoring (Prometheus, Grafana)
  5. Core Services (Portainer, reverse proxy)
  6. Media Services (Plex, arr stack)
  7. Communication (Matrix, Mastodon)
  8. Development (Gitea, CI/CD)
  9. Optional Services (Gaming, AI/ML)

📋 Emergency Documentation

Quick Reference Cards

Keep printed copies of:

  • Network diagram with IP addresses
  • Critical service URLs and ports
  • Emergency contact information
  • Basic recovery commands

Offline Access

  • USB drive with critical configs
  • Printed network documentation
  • Mobile hotspot for internet access
  • Laptop with SSH clients configured

🔍 Post-Emergency Analysis

Incident Documentation

# Create incident report
cat > incident_$(date +%Y%m%d).md << EOF
# Emergency Incident Report

**Date**: $(date)
**Duration**: X hours
**Affected Services**: List services
**Root Cause**: Description
**Resolution**: Steps taken
**Prevention**: Future improvements

## Timeline
- HH:MM - Issue detected
- HH:MM - Emergency procedures initiated
- HH:MM - Service restored

## Lessons Learned
- What worked well
- What could be improved
- Action items for prevention
EOF

Improvement Actions

  1. Update emergency procedures based on lessons learned
  2. Test backup systems regularly
  3. Improve monitoring and alerting
  4. Document new failure scenarios
  5. Update emergency contact information

This document should be reviewed and updated after each emergency incident