Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 8a947d9e36

Documentation / Build Docusaurus (push) Failing after 5m3s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-04-04 03:23:14 UTC

2026-04-04 03:23:14 +00:00

8.5 KiB

Raw Blame History

Watchtower Emergency Procedures

🚨 Emergency Response Guide

This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.

📊 Current Status (Last Updated: 2026-02-09)

Endpoint Status Summary

Endpoint	Status	Port	Notification URL	Notes
Calypso	🟢 HEALTHY	8080	`generic+http://localhost:8081/updates`	Fixed crash loop
Atlantis	🟢 HEALTHY	8081	`generic+http://localhost:8082/updates`	Fixed port conflict
vish-concord-nuc	🟢 HEALTHY	8080	None configured	Stable for 2+ weeks
rpi5	❌ NOT DEPLOYED	-	-	Consider deployment
Homelab VM	⚠️ OFFLINE	-	-	Endpoint unreachable

🔧 Emergency Fix Scripts

Quick Status Check

# Run comprehensive status check
./scripts/check-watchtower-status.sh

Emergency Crash Loop Fix

# Fix notification URL format issues
./scripts/portainer-fix-v2.sh

Port Conflict Resolution

# Fix port conflicts (Atlantis specific)
./scripts/fix-atlantis-port.sh

🚨 Common Issues and Solutions

Issue 1: Crash Loop with "unknown service 'http'" Error

Symptoms:

level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""

Root Cause: Invalid Shoutrrr notification URL format

Solution:

# WRONG FORMAT:
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes

# CORRECT FORMAT:
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates

Emergency Fix:

Stop the crash looping container
Remove the broken container
Recreate with correct notification URL format
Start the new container

Issue 2: Port Conflict (Address Already in Use)

Symptoms:

Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use

Solution:

Identify conflicting service on port 8080
Use alternative port (8081, 8082, etc.)
Update port mapping in container configuration

Emergency Fix:

# Use different port in HostConfig
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}

Issue 3: Notification Service Connection Refused

Symptoms:

error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"

Root Cause: ntfy service not running on target port

Solutions:

Deploy ntfy service locally:

# hosts/[hostname]/ntfy.yaml
version: '3.8'
services:
  ntfy:
    image: binwiederhier/ntfy
    ports:
      - "8081:80"
    command: serve
    volumes:
      - ntfy-data:/var/lib/ntfy

Use external ntfy service:

WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC

Disable notifications temporarily:

# Remove notification environment variables
unset WATCHTOWER_NOTIFICATIONS
unset WATCHTOWER_NOTIFICATION_URL

🔍 Diagnostic Commands

Check Container Status

# Via Portainer API
curl -H "X-API-Key: $API_KEY" \
  "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
  jq '.[] | select(.Names[]? | contains("watchtower"))'

View Container Logs

# Last 50 lines
curl -H "X-API-Key: $API_KEY" \
  "$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"

Check Port Usage

# SSH to host and check port usage
netstat -tulpn | grep :8080
lsof -i :8080

Verify Notification Service

# Test ntfy service
curl -d "Test message" http://localhost:8081/updates

🛠️ Manual Recovery Procedures

Complete Watchtower Rebuild

Stop and remove existing container:

docker stop watchtower
docker rm watchtower

Pull latest image:

docker pull containrrr/watchtower:latest

Deploy with correct configuration:

docker run -d \
  --name watchtower \
  --restart always \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e WATCHTOWER_CLEANUP=true \
  -e WATCHTOWER_INCLUDE_RESTARTING=true \
  -e WATCHTOWER_INCLUDE_STOPPED=true \
  -e WATCHTOWER_REVIVE_STOPPED=false \
  -e WATCHTOWER_POLL_INTERVAL=3600 \
  -e WATCHTOWER_TIMEOUT=10s \
  -e WATCHTOWER_HTTP_API_UPDATE=true \
  -e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
  -e WATCHTOWER_NOTIFICATIONS=shoutrrr \
  -e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
  -e TZ=America/Los_Angeles \
  containrrr/watchtower:latest

Notification Service Deployment

Deploy ntfy service:

docker run -d \
  --name ntfy \
  --restart always \
  -p 8081:80 \
  -v ntfy-data:/var/lib/ntfy \
  binwiederhier/ntfy serve

Test notification:

curl -d "Watchtower test notification" http://localhost:8081/updates

📋 Preventive Measures

Regular Health Checks

# Add to crontab for automated monitoring
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh

Configuration Validation

# Validate Docker Compose before deployment
docker-compose -f watchtower.yml config

Backup Configurations

# Backup working configurations
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)

🔄 Recovery Testing

Monthly Recovery Drill

Intentionally stop Watchtower on test endpoint
Run emergency recovery procedures
Verify functionality and notifications
Document any issues or improvements needed

Notification Testing

# Test all notification endpoints
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
  curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
done

📞 Escalation Procedures

Level 1: Automated Recovery

Scripts attempt automatic recovery
Status checks verify success
Notifications sent on failure

Level 2: Manual Intervention

Review logs and error messages
Apply manual fixes using this guide
Update configurations as needed

Level 3: Infrastructure Review

Assess overall architecture
Consider alternative solutions
Update emergency procedures

📚 Reference Information

Shoutrrr URL Formats

# Generic HTTP webhook
generic+http://localhost:8081/updates

# ntfy service (HTTPS)
ntfy://ntfy.example.com/topic

# Discord webhook
discord://token@channel

# Slack webhook
slack://token@channel

Environment Variables Reference

WATCHTOWER_CLEANUP=true                    # Remove old images
WATCHTOWER_INCLUDE_RESTARTING=true         # Update restarting containers
WATCHTOWER_INCLUDE_STOPPED=true            # Update stopped containers
WATCHTOWER_REVIVE_STOPPED=false            # Don't start stopped containers
WATCHTOWER_POLL_INTERVAL=3600              # Check every hour
WATCHTOWER_TIMEOUT=10s                     # Container stop timeout
WATCHTOWER_HTTP_API_UPDATE=true            # Enable HTTP API
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"       # API authentication
WATCHTOWER_NOTIFICATIONS=shoutrrr          # Enable notifications
WATCHTOWER_NOTIFICATION_URL=url            # Notification endpoint
TZ=America/Los_Angeles                     # Timezone

API Endpoints

# Portainer API base
BASE_URL="http://vishinator.synology.me:10000"

# Endpoint IDs
ATLANTIS_ID=2
CALYPSO_ID=443397
CONCORD_NUC_ID=443398
RPI5_ID=443395
HOMELAB_VM_ID=443399

🔐 Security Considerations

API Key Management

Store API keys securely
Rotate keys regularly
Use environment variables, not hardcoded values

Container Security

Run with minimal privileges
Use read-only Docker socket when possible
Implement network segmentation

Notification Security

Use HTTPS for external notifications
Implement authentication for notification endpoints
Avoid sensitive information in notification messages

📈 Monitoring and Metrics

Key Metrics to Track

Container update success rate
Notification delivery success
Recovery time from failures
Resource usage trends

Alerting Thresholds

Watchtower down for > 5 minutes: Critical
Failed updates > 3 in 24 hours: Warning
Notification failures > 10%: Warning

🔄 Continuous Improvement

Regular Reviews

Monthly review of emergency procedures
Quarterly testing of all recovery scenarios
Annual architecture assessment

Documentation Updates

Update procedures after each incident
Incorporate lessons learned
Maintain current contact information

Last Updated: 2026-02-09
Next Review: 2026-03-09
Document Owner: Homelab Operations Team

8.5 KiB Raw Blame History

Watchtower Emergency Procedures

🚨 Emergency Response Guide

📊 Current Status (Last Updated: 2026-02-09)

Endpoint Status Summary

🔧 Emergency Fix Scripts

Quick Status Check

Emergency Crash Loop Fix

Port Conflict Resolution

🚨 Common Issues and Solutions

Issue 1: Crash Loop with "unknown service 'http'" Error

Issue 2: Port Conflict (Address Already in Use)

Issue 3: Notification Service Connection Refused

🔍 Diagnostic Commands

Check Container Status

View Container Logs

Check Port Usage

Verify Notification Service

🛠️ Manual Recovery Procedures

Complete Watchtower Rebuild

Notification Service Deployment

📋 Preventive Measures

Regular Health Checks

Configuration Validation

Backup Configurations

🔄 Recovery Testing

Monthly Recovery Drill

Notification Testing

📞 Escalation Procedures

Level 1: Automated Recovery

Level 2: Manual Intervention

Level 3: Infrastructure Review

📚 Reference Information

Shoutrrr URL Formats

Environment Variables Reference

API Endpoints

🔐 Security Considerations

API Key Management

Container Security

Notification Security

📈 Monitoring and Metrics

Key Metrics to Track

Alerting Thresholds

🔄 Continuous Improvement

Regular Reviews

Documentation Updates

8.5 KiB

Raw Blame History