# Watchtower Emergency Procedures ## 🚨 Emergency Response Guide This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure. ## 📊 Current Status (Last Updated: 2026-02-09) ### Endpoint Status Summary | Endpoint | Status | Port | Notification URL | Notes | |----------|--------|------|------------------|-------| | **Calypso** | 🟢 HEALTHY | 8080 | `generic+http://localhost:8081/updates` | Fixed crash loop | | **Atlantis** | 🟢 HEALTHY | 8081 | `generic+http://localhost:8082/updates` | Fixed port conflict | | **vish-concord-nuc** | 🟢 HEALTHY | 8080 | None configured | Stable for 2+ weeks | | **rpi5** | ❌ NOT DEPLOYED | - | - | Consider deployment | | **Homelab VM** | ⚠️ OFFLINE | - | - | Endpoint unreachable | ## 🔧 Emergency Fix Scripts ### Quick Status Check ```bash # Run comprehensive status check ./scripts/check-watchtower-status.sh ``` ### Emergency Crash Loop Fix ```bash # Fix notification URL format issues ./scripts/portainer-fix-v2.sh ``` ### Port Conflict Resolution ```bash # Fix port conflicts (Atlantis specific) ./scripts/fix-atlantis-port.sh ``` ## 🚨 Common Issues and Solutions ### Issue 1: Crash Loop with "unknown service 'http'" Error **Symptoms:** ``` level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\"" ``` **Root Cause:** Invalid Shoutrrr notification URL format **Solution:** ```bash # WRONG FORMAT: WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes # CORRECT FORMAT: WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates ``` **Emergency Fix:** 1. Stop the crash looping container 2. Remove the broken container 3. Recreate with correct notification URL format 4. Start the new container ### Issue 2: Port Conflict (Address Already in Use) **Symptoms:** ``` Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use ``` **Solution:** 1. Identify conflicting service on port 8080 2. Use alternative port (8081, 8082, etc.) 3. Update port mapping in container configuration **Emergency Fix:** ```bash # Use different port in HostConfig "PortBindings": {"8080/tcp": [{"HostPort": "8081"}]} ``` ### Issue 3: Notification Service Connection Refused **Symptoms:** ``` error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused" ``` **Root Cause:** ntfy service not running on target port **Solutions:** 1. **Deploy ntfy service locally:** ```yaml # hosts/[hostname]/ntfy.yaml version: '3.8' services: ntfy: image: binwiederhier/ntfy ports: - "8081:80" command: serve volumes: - ntfy-data:/var/lib/ntfy ``` 2. **Use external ntfy service:** ```bash WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC ``` 3. **Disable notifications temporarily:** ```bash # Remove notification environment variables unset WATCHTOWER_NOTIFICATIONS unset WATCHTOWER_NOTIFICATION_URL ``` ## 🔍 Diagnostic Commands ### Check Container Status ```bash # Via Portainer API curl -H "X-API-Key: $API_KEY" \ "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \ jq '.[] | select(.Names[]? | contains("watchtower"))' ``` ### View Container Logs ```bash # Last 50 lines curl -H "X-API-Key: $API_KEY" \ "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50" ``` ### Check Port Usage ```bash # SSH to host and check port usage netstat -tulpn | grep :8080 lsof -i :8080 ``` ### Verify Notification Service ```bash # Test ntfy service curl -d "Test message" http://localhost:8081/updates ``` ## 🛠️ Manual Recovery Procedures ### Complete Watchtower Rebuild 1. **Stop and remove existing container:** ```bash docker stop watchtower docker rm watchtower ``` 2. **Pull latest image:** ```bash docker pull containrrr/watchtower:latest ``` 3. **Deploy with correct configuration:** ```bash docker run -d \ --name watchtower \ --restart always \ -p 8080:8080 \ -v /var/run/docker.sock:/var/run/docker.sock \ -e WATCHTOWER_CLEANUP=true \ -e WATCHTOWER_INCLUDE_RESTARTING=true \ -e WATCHTOWER_INCLUDE_STOPPED=true \ -e WATCHTOWER_REVIVE_STOPPED=false \ -e WATCHTOWER_POLL_INTERVAL=3600 \ -e WATCHTOWER_TIMEOUT=10s \ -e WATCHTOWER_HTTP_API_UPDATE=true \ -e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \ -e WATCHTOWER_NOTIFICATIONS=shoutrrr \ -e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \ -e TZ=America/Los_Angeles \ containrrr/watchtower:latest ``` ### Notification Service Deployment 1. **Deploy ntfy service:** ```bash docker run -d \ --name ntfy \ --restart always \ -p 8081:80 \ -v ntfy-data:/var/lib/ntfy \ binwiederhier/ntfy serve ``` 2. **Test notification:** ```bash curl -d "Watchtower test notification" http://localhost:8081/updates ``` ## 📋 Preventive Measures ### Regular Health Checks ```bash # Add to crontab for automated monitoring 0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh ``` ### Configuration Validation ```bash # Validate Docker Compose before deployment docker-compose -f watchtower.yml config ``` ### Backup Configurations ```bash # Backup working configurations cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d) ``` ## 🔄 Recovery Testing ### Monthly Recovery Drill 1. Intentionally stop Watchtower on test endpoint 2. Run emergency recovery procedures 3. Verify functionality and notifications 4. Document any issues or improvements needed ### Notification Testing ```bash # Test all notification endpoints for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts done ``` ## 📞 Escalation Procedures ### Level 1: Automated Recovery - Scripts attempt automatic recovery - Status checks verify success - Notifications sent on failure ### Level 2: Manual Intervention - Review logs and error messages - Apply manual fixes using this guide - Update configurations as needed ### Level 3: Infrastructure Review - Assess overall architecture - Consider alternative solutions - Update emergency procedures ## 📚 Reference Information ### Shoutrrr URL Formats ```bash # Generic HTTP webhook generic+http://localhost:8081/updates # ntfy service (HTTPS) ntfy://ntfy.example.com/topic # Discord webhook discord://token@channel # Slack webhook slack://token@channel ``` ### Environment Variables Reference ```bash WATCHTOWER_CLEANUP=true # Remove old images WATCHTOWER_INCLUDE_RESTARTING=true # Update restarting containers WATCHTOWER_INCLUDE_STOPPED=true # Update stopped containers WATCHTOWER_REVIVE_STOPPED=false # Don't start stopped containers WATCHTOWER_POLL_INTERVAL=3600 # Check every hour WATCHTOWER_TIMEOUT=10s # Container stop timeout WATCHTOWER_HTTP_API_UPDATE=true # Enable HTTP API WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" # API authentication WATCHTOWER_NOTIFICATIONS=shoutrrr # Enable notifications WATCHTOWER_NOTIFICATION_URL=url # Notification endpoint TZ=America/Los_Angeles # Timezone ``` ### API Endpoints ```bash # Portainer API base BASE_URL="http://vishinator.synology.me:10000" # Endpoint IDs ATLANTIS_ID=2 CALYPSO_ID=443397 CONCORD_NUC_ID=443398 RPI5_ID=443395 HOMELAB_VM_ID=443399 ``` ## 🔐 Security Considerations ### API Key Management - Store API keys securely - Rotate keys regularly - Use environment variables, not hardcoded values ### Container Security - Run with minimal privileges - Use read-only Docker socket when possible - Implement network segmentation ### Notification Security - Use HTTPS for external notifications - Implement authentication for notification endpoints - Avoid sensitive information in notification messages ## 📈 Monitoring and Metrics ### Key Metrics to Track - Container update success rate - Notification delivery success - Recovery time from failures - Resource usage trends ### Alerting Thresholds - Watchtower down for > 5 minutes: Critical - Failed updates > 3 in 24 hours: Warning - Notification failures > 10%: Warning ## 🔄 Continuous Improvement ### Regular Reviews - Monthly review of emergency procedures - Quarterly testing of all recovery scenarios - Annual architecture assessment ### Documentation Updates - Update procedures after each incident - Incorporate lessons learned - Maintain current contact information --- **Last Updated:** 2026-02-09 **Next Review:** 2026-03-09 **Document Owner:** Homelab Operations Team