8.5 KiB
8.5 KiB
Watchtower Emergency Procedures
🚨 Emergency Response Guide
This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.
📊 Current Status (Last Updated: 2026-02-09)
Endpoint Status Summary
| Endpoint | Status | Port | Notification URL | Notes |
|---|---|---|---|---|
| Calypso | 🟢 HEALTHY | 8080 | generic+http://localhost:8081/updates |
Fixed crash loop |
| Atlantis | 🟢 HEALTHY | 8081 | generic+http://localhost:8082/updates |
Fixed port conflict |
| vish-concord-nuc | 🟢 HEALTHY | 8080 | None configured | Stable for 2+ weeks |
| rpi5 | ❌ NOT DEPLOYED | - | - | Consider deployment |
| Homelab VM | ⚠️ OFFLINE | - | - | Endpoint unreachable |
🔧 Emergency Fix Scripts
Quick Status Check
# Run comprehensive status check
./scripts/check-watchtower-status.sh
Emergency Crash Loop Fix
# Fix notification URL format issues
./scripts/portainer-fix-v2.sh
Port Conflict Resolution
# Fix port conflicts (Atlantis specific)
./scripts/fix-atlantis-port.sh
🚨 Common Issues and Solutions
Issue 1: Crash Loop with "unknown service 'http'" Error
Symptoms:
level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""
Root Cause: Invalid Shoutrrr notification URL format
Solution:
# WRONG FORMAT:
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes
# CORRECT FORMAT:
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates
Emergency Fix:
- Stop the crash looping container
- Remove the broken container
- Recreate with correct notification URL format
- Start the new container
Issue 2: Port Conflict (Address Already in Use)
Symptoms:
Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use
Solution:
- Identify conflicting service on port 8080
- Use alternative port (8081, 8082, etc.)
- Update port mapping in container configuration
Emergency Fix:
# Use different port in HostConfig
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}
Issue 3: Notification Service Connection Refused
Symptoms:
error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"
Root Cause: ntfy service not running on target port
Solutions:
- Deploy ntfy service locally:
# hosts/[hostname]/ntfy.yaml
version: '3.8'
services:
ntfy:
image: binwiederhier/ntfy
ports:
- "8081:80"
command: serve
volumes:
- ntfy-data:/var/lib/ntfy
- Use external ntfy service:
WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC
- Disable notifications temporarily:
# Remove notification environment variables
unset WATCHTOWER_NOTIFICATIONS
unset WATCHTOWER_NOTIFICATION_URL
🔍 Diagnostic Commands
Check Container Status
# Via Portainer API
curl -H "X-API-Key: $API_KEY" \
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
jq '.[] | select(.Names[]? | contains("watchtower"))'
View Container Logs
# Last 50 lines
curl -H "X-API-Key: $API_KEY" \
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"
Check Port Usage
# SSH to host and check port usage
netstat -tulpn | grep :8080
lsof -i :8080
Verify Notification Service
# Test ntfy service
curl -d "Test message" http://localhost:8081/updates
🛠️ Manual Recovery Procedures
Complete Watchtower Rebuild
- Stop and remove existing container:
docker stop watchtower
docker rm watchtower
- Pull latest image:
docker pull containrrr/watchtower:latest
- Deploy with correct configuration:
docker run -d \
--name watchtower \
--restart always \
-p 8080:8080 \
-v /var/run/docker.sock:/var/run/docker.sock \
-e WATCHTOWER_CLEANUP=true \
-e WATCHTOWER_INCLUDE_RESTARTING=true \
-e WATCHTOWER_INCLUDE_STOPPED=true \
-e WATCHTOWER_REVIVE_STOPPED=false \
-e WATCHTOWER_POLL_INTERVAL=3600 \
-e WATCHTOWER_TIMEOUT=10s \
-e WATCHTOWER_HTTP_API_UPDATE=true \
-e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
-e WATCHTOWER_NOTIFICATIONS=shoutrrr \
-e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
-e TZ=America/Los_Angeles \
containrrr/watchtower:latest
Notification Service Deployment
- Deploy ntfy service:
docker run -d \
--name ntfy \
--restart always \
-p 8081:80 \
-v ntfy-data:/var/lib/ntfy \
binwiederhier/ntfy serve
- Test notification:
curl -d "Watchtower test notification" http://localhost:8081/updates
📋 Preventive Measures
Regular Health Checks
# Add to crontab for automated monitoring
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh
Configuration Validation
# Validate Docker Compose before deployment
docker-compose -f watchtower.yml config
Backup Configurations
# Backup working configurations
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)
🔄 Recovery Testing
Monthly Recovery Drill
- Intentionally stop Watchtower on test endpoint
- Run emergency recovery procedures
- Verify functionality and notifications
- Document any issues or improvements needed
Notification Testing
# Test all notification endpoints
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
done
📞 Escalation Procedures
Level 1: Automated Recovery
- Scripts attempt automatic recovery
- Status checks verify success
- Notifications sent on failure
Level 2: Manual Intervention
- Review logs and error messages
- Apply manual fixes using this guide
- Update configurations as needed
Level 3: Infrastructure Review
- Assess overall architecture
- Consider alternative solutions
- Update emergency procedures
📚 Reference Information
Shoutrrr URL Formats
# Generic HTTP webhook
generic+http://localhost:8081/updates
# ntfy service (HTTPS)
ntfy://ntfy.example.com/topic
# Discord webhook
discord://token@channel
# Slack webhook
slack://token@channel
Environment Variables Reference
WATCHTOWER_CLEANUP=true # Remove old images
WATCHTOWER_INCLUDE_RESTARTING=true # Update restarting containers
WATCHTOWER_INCLUDE_STOPPED=true # Update stopped containers
WATCHTOWER_REVIVE_STOPPED=false # Don't start stopped containers
WATCHTOWER_POLL_INTERVAL=3600 # Check every hour
WATCHTOWER_TIMEOUT=10s # Container stop timeout
WATCHTOWER_HTTP_API_UPDATE=true # Enable HTTP API
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" # API authentication
WATCHTOWER_NOTIFICATIONS=shoutrrr # Enable notifications
WATCHTOWER_NOTIFICATION_URL=url # Notification endpoint
TZ=America/Los_Angeles # Timezone
API Endpoints
# Portainer API base
BASE_URL="http://vishinator.synology.me:10000"
# Endpoint IDs
ATLANTIS_ID=2
CALYPSO_ID=443397
CONCORD_NUC_ID=443398
RPI5_ID=443395
HOMELAB_VM_ID=443399
🔐 Security Considerations
API Key Management
- Store API keys securely
- Rotate keys regularly
- Use environment variables, not hardcoded values
Container Security
- Run with minimal privileges
- Use read-only Docker socket when possible
- Implement network segmentation
Notification Security
- Use HTTPS for external notifications
- Implement authentication for notification endpoints
- Avoid sensitive information in notification messages
📈 Monitoring and Metrics
Key Metrics to Track
- Container update success rate
- Notification delivery success
- Recovery time from failures
- Resource usage trends
Alerting Thresholds
- Watchtower down for > 5 minutes: Critical
- Failed updates > 3 in 24 hours: Warning
- Notification failures > 10%: Warning
🔄 Continuous Improvement
Regular Reviews
- Monthly review of emergency procedures
- Quarterly testing of all recovery scenarios
- Annual architecture assessment
Documentation Updates
- Update procedures after each incident
- Incorporate lessons learned
- Maintain current contact information
Last Updated: 2026-02-09
Next Review: 2026-03-09
Document Owner: Homelab Operations Team