Files
homelab-optimized/docs/troubleshooting/WATCHTOWER_EMERGENCY_PROCEDURES.md
Gitea Mirror Bot 8a947d9e36
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-04 03:23:14 UTC
2026-04-04 03:23:14 +00:00

8.5 KiB

Watchtower Emergency Procedures

🚨 Emergency Response Guide

This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.

📊 Current Status (Last Updated: 2026-02-09)

Endpoint Status Summary

Endpoint Status Port Notification URL Notes
Calypso 🟢 HEALTHY 8080 generic+http://localhost:8081/updates Fixed crash loop
Atlantis 🟢 HEALTHY 8081 generic+http://localhost:8082/updates Fixed port conflict
vish-concord-nuc 🟢 HEALTHY 8080 None configured Stable for 2+ weeks
rpi5 NOT DEPLOYED - - Consider deployment
Homelab VM ⚠️ OFFLINE - - Endpoint unreachable

🔧 Emergency Fix Scripts

Quick Status Check

# Run comprehensive status check
./scripts/check-watchtower-status.sh

Emergency Crash Loop Fix

# Fix notification URL format issues
./scripts/portainer-fix-v2.sh

Port Conflict Resolution

# Fix port conflicts (Atlantis specific)
./scripts/fix-atlantis-port.sh

🚨 Common Issues and Solutions

Issue 1: Crash Loop with "unknown service 'http'" Error

Symptoms:

level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""

Root Cause: Invalid Shoutrrr notification URL format

Solution:

# WRONG FORMAT:
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes

# CORRECT FORMAT:
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates

Emergency Fix:

  1. Stop the crash looping container
  2. Remove the broken container
  3. Recreate with correct notification URL format
  4. Start the new container

Issue 2: Port Conflict (Address Already in Use)

Symptoms:

Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use

Solution:

  1. Identify conflicting service on port 8080
  2. Use alternative port (8081, 8082, etc.)
  3. Update port mapping in container configuration

Emergency Fix:

# Use different port in HostConfig
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}

Issue 3: Notification Service Connection Refused

Symptoms:

error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"

Root Cause: ntfy service not running on target port

Solutions:

  1. Deploy ntfy service locally:
# hosts/[hostname]/ntfy.yaml
version: '3.8'
services:
  ntfy:
    image: binwiederhier/ntfy
    ports:
      - "8081:80"
    command: serve
    volumes:
      - ntfy-data:/var/lib/ntfy
  1. Use external ntfy service:
WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC
  1. Disable notifications temporarily:
# Remove notification environment variables
unset WATCHTOWER_NOTIFICATIONS
unset WATCHTOWER_NOTIFICATION_URL

🔍 Diagnostic Commands

Check Container Status

# Via Portainer API
curl -H "X-API-Key: $API_KEY" \
  "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
  jq '.[] | select(.Names[]? | contains("watchtower"))'

View Container Logs

# Last 50 lines
curl -H "X-API-Key: $API_KEY" \
  "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"

Check Port Usage

# SSH to host and check port usage
netstat -tulpn | grep :8080
lsof -i :8080

Verify Notification Service

# Test ntfy service
curl -d "Test message" http://localhost:8081/updates

🛠️ Manual Recovery Procedures

Complete Watchtower Rebuild

  1. Stop and remove existing container:
docker stop watchtower
docker rm watchtower
  1. Pull latest image:
docker pull containrrr/watchtower:latest
  1. Deploy with correct configuration:
docker run -d \
  --name watchtower \
  --restart always \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e WATCHTOWER_CLEANUP=true \
  -e WATCHTOWER_INCLUDE_RESTARTING=true \
  -e WATCHTOWER_INCLUDE_STOPPED=true \
  -e WATCHTOWER_REVIVE_STOPPED=false \
  -e WATCHTOWER_POLL_INTERVAL=3600 \
  -e WATCHTOWER_TIMEOUT=10s \
  -e WATCHTOWER_HTTP_API_UPDATE=true \
  -e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
  -e WATCHTOWER_NOTIFICATIONS=shoutrrr \
  -e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
  -e TZ=America/Los_Angeles \
  containrrr/watchtower:latest

Notification Service Deployment

  1. Deploy ntfy service:
docker run -d \
  --name ntfy \
  --restart always \
  -p 8081:80 \
  -v ntfy-data:/var/lib/ntfy \
  binwiederhier/ntfy serve
  1. Test notification:
curl -d "Watchtower test notification" http://localhost:8081/updates

📋 Preventive Measures

Regular Health Checks

# Add to crontab for automated monitoring
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh

Configuration Validation

# Validate Docker Compose before deployment
docker-compose -f watchtower.yml config

Backup Configurations

# Backup working configurations
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)

🔄 Recovery Testing

Monthly Recovery Drill

  1. Intentionally stop Watchtower on test endpoint
  2. Run emergency recovery procedures
  3. Verify functionality and notifications
  4. Document any issues or improvements needed

Notification Testing

# Test all notification endpoints
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
  curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
done

📞 Escalation Procedures

Level 1: Automated Recovery

  • Scripts attempt automatic recovery
  • Status checks verify success
  • Notifications sent on failure

Level 2: Manual Intervention

  • Review logs and error messages
  • Apply manual fixes using this guide
  • Update configurations as needed

Level 3: Infrastructure Review

  • Assess overall architecture
  • Consider alternative solutions
  • Update emergency procedures

📚 Reference Information

Shoutrrr URL Formats

# Generic HTTP webhook
generic+http://localhost:8081/updates

# ntfy service (HTTPS)
ntfy://ntfy.example.com/topic

# Discord webhook
discord://token@channel

# Slack webhook
slack://token@channel

Environment Variables Reference

WATCHTOWER_CLEANUP=true                    # Remove old images
WATCHTOWER_INCLUDE_RESTARTING=true         # Update restarting containers
WATCHTOWER_INCLUDE_STOPPED=true            # Update stopped containers
WATCHTOWER_REVIVE_STOPPED=false            # Don't start stopped containers
WATCHTOWER_POLL_INTERVAL=3600              # Check every hour
WATCHTOWER_TIMEOUT=10s                     # Container stop timeout
WATCHTOWER_HTTP_API_UPDATE=true            # Enable HTTP API
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"       # API authentication
WATCHTOWER_NOTIFICATIONS=shoutrrr          # Enable notifications
WATCHTOWER_NOTIFICATION_URL=url            # Notification endpoint
TZ=America/Los_Angeles                     # Timezone

API Endpoints

# Portainer API base
BASE_URL="http://vishinator.synology.me:10000"

# Endpoint IDs
ATLANTIS_ID=2
CALYPSO_ID=443397
CONCORD_NUC_ID=443398
RPI5_ID=443395
HOMELAB_VM_ID=443399

🔐 Security Considerations

API Key Management

  • Store API keys securely
  • Rotate keys regularly
  • Use environment variables, not hardcoded values

Container Security

  • Run with minimal privileges
  • Use read-only Docker socket when possible
  • Implement network segmentation

Notification Security

  • Use HTTPS for external notifications
  • Implement authentication for notification endpoints
  • Avoid sensitive information in notification messages

📈 Monitoring and Metrics

Key Metrics to Track

  • Container update success rate
  • Notification delivery success
  • Recovery time from failures
  • Resource usage trends

Alerting Thresholds

  • Watchtower down for > 5 minutes: Critical
  • Failed updates > 3 in 24 hours: Warning
  • Notification failures > 10%: Warning

🔄 Continuous Improvement

Regular Reviews

  • Monthly review of emergency procedures
  • Quarterly testing of all recovery scenarios
  • Annual architecture assessment

Documentation Updates

  • Update procedures after each incident
  • Incorporate lessons learned
  • Maintain current contact information

Last Updated: 2026-02-09
Next Review: 2026-03-09
Document Owner: Homelab Operations Team