345 lines
8.5 KiB
Markdown
345 lines
8.5 KiB
Markdown
# Watchtower Emergency Procedures
|
|
|
|
## 🚨 Emergency Response Guide
|
|
|
|
This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.
|
|
|
|
## 📊 Current Status (Last Updated: 2026-02-09)
|
|
|
|
### Endpoint Status Summary
|
|
| Endpoint | Status | Port | Notification URL | Notes |
|
|
|----------|--------|------|------------------|-------|
|
|
| **Calypso** | 🟢 HEALTHY | 8080 | `generic+http://localhost:8081/updates` | Fixed crash loop |
|
|
| **Atlantis** | 🟢 HEALTHY | 8081 | `generic+http://localhost:8082/updates` | Fixed port conflict |
|
|
| **vish-concord-nuc** | 🟢 HEALTHY | 8080 | None configured | Stable for 2+ weeks |
|
|
| **rpi5** | ❌ NOT DEPLOYED | - | - | Consider deployment |
|
|
| **Homelab VM** | ⚠️ OFFLINE | - | - | Endpoint unreachable |
|
|
|
|
## 🔧 Emergency Fix Scripts
|
|
|
|
### Quick Status Check
|
|
```bash
|
|
# Run comprehensive status check
|
|
./scripts/check-watchtower-status.sh
|
|
```
|
|
|
|
### Emergency Crash Loop Fix
|
|
```bash
|
|
# Fix notification URL format issues
|
|
./scripts/portainer-fix-v2.sh
|
|
```
|
|
|
|
### Port Conflict Resolution
|
|
```bash
|
|
# Fix port conflicts (Atlantis specific)
|
|
./scripts/fix-atlantis-port.sh
|
|
```
|
|
|
|
## 🚨 Common Issues and Solutions
|
|
|
|
### Issue 1: Crash Loop with "unknown service 'http'" Error
|
|
|
|
**Symptoms:**
|
|
```
|
|
level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""
|
|
```
|
|
|
|
**Root Cause:** Invalid Shoutrrr notification URL format
|
|
|
|
**Solution:**
|
|
```bash
|
|
# WRONG FORMAT:
|
|
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes
|
|
|
|
# CORRECT FORMAT:
|
|
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates
|
|
```
|
|
|
|
**Emergency Fix:**
|
|
1. Stop the crash looping container
|
|
2. Remove the broken container
|
|
3. Recreate with correct notification URL format
|
|
4. Start the new container
|
|
|
|
### Issue 2: Port Conflict (Address Already in Use)
|
|
|
|
**Symptoms:**
|
|
```
|
|
Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use
|
|
```
|
|
|
|
**Solution:**
|
|
1. Identify conflicting service on port 8080
|
|
2. Use alternative port (8081, 8082, etc.)
|
|
3. Update port mapping in container configuration
|
|
|
|
**Emergency Fix:**
|
|
```bash
|
|
# Use different port in HostConfig
|
|
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}
|
|
```
|
|
|
|
### Issue 3: Notification Service Connection Refused
|
|
|
|
**Symptoms:**
|
|
```
|
|
error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"
|
|
```
|
|
|
|
**Root Cause:** ntfy service not running on target port
|
|
|
|
**Solutions:**
|
|
1. **Deploy ntfy service locally:**
|
|
```yaml
|
|
# hosts/[hostname]/ntfy.yaml
|
|
version: '3.8'
|
|
services:
|
|
ntfy:
|
|
image: binwiederhier/ntfy
|
|
ports:
|
|
- "8081:80"
|
|
command: serve
|
|
volumes:
|
|
- ntfy-data:/var/lib/ntfy
|
|
```
|
|
|
|
2. **Use external ntfy service:**
|
|
```bash
|
|
WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC
|
|
```
|
|
|
|
3. **Disable notifications temporarily:**
|
|
```bash
|
|
# Remove notification environment variables
|
|
unset WATCHTOWER_NOTIFICATIONS
|
|
unset WATCHTOWER_NOTIFICATION_URL
|
|
```
|
|
|
|
## 🔍 Diagnostic Commands
|
|
|
|
### Check Container Status
|
|
```bash
|
|
# Via Portainer API
|
|
curl -H "X-API-Key: $API_KEY" \
|
|
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
|
|
jq '.[] | select(.Names[]? | contains("watchtower"))'
|
|
```
|
|
|
|
### View Container Logs
|
|
```bash
|
|
# Last 50 lines
|
|
curl -H "X-API-Key: $API_KEY" \
|
|
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"
|
|
```
|
|
|
|
### Check Port Usage
|
|
```bash
|
|
# SSH to host and check port usage
|
|
netstat -tulpn | grep :8080
|
|
lsof -i :8080
|
|
```
|
|
|
|
### Verify Notification Service
|
|
```bash
|
|
# Test ntfy service
|
|
curl -d "Test message" http://localhost:8081/updates
|
|
```
|
|
|
|
## 🛠️ Manual Recovery Procedures
|
|
|
|
### Complete Watchtower Rebuild
|
|
|
|
1. **Stop and remove existing container:**
|
|
```bash
|
|
docker stop watchtower
|
|
docker rm watchtower
|
|
```
|
|
|
|
2. **Pull latest image:**
|
|
```bash
|
|
docker pull containrrr/watchtower:latest
|
|
```
|
|
|
|
3. **Deploy with correct configuration:**
|
|
```bash
|
|
docker run -d \
|
|
--name watchtower \
|
|
--restart always \
|
|
-p 8080:8080 \
|
|
-v /var/run/docker.sock:/var/run/docker.sock \
|
|
-e WATCHTOWER_CLEANUP=true \
|
|
-e WATCHTOWER_INCLUDE_RESTARTING=true \
|
|
-e WATCHTOWER_INCLUDE_STOPPED=true \
|
|
-e WATCHTOWER_REVIVE_STOPPED=false \
|
|
-e WATCHTOWER_POLL_INTERVAL=3600 \
|
|
-e WATCHTOWER_TIMEOUT=10s \
|
|
-e WATCHTOWER_HTTP_API_UPDATE=true \
|
|
-e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
|
|
-e WATCHTOWER_NOTIFICATIONS=shoutrrr \
|
|
-e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
|
|
-e TZ=America/Los_Angeles \
|
|
containrrr/watchtower:latest
|
|
```
|
|
|
|
### Notification Service Deployment
|
|
|
|
1. **Deploy ntfy service:**
|
|
```bash
|
|
docker run -d \
|
|
--name ntfy \
|
|
--restart always \
|
|
-p 8081:80 \
|
|
-v ntfy-data:/var/lib/ntfy \
|
|
binwiederhier/ntfy serve
|
|
```
|
|
|
|
2. **Test notification:**
|
|
```bash
|
|
curl -d "Watchtower test notification" http://localhost:8081/updates
|
|
```
|
|
|
|
## 📋 Preventive Measures
|
|
|
|
### Regular Health Checks
|
|
```bash
|
|
# Add to crontab for automated monitoring
|
|
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh
|
|
```
|
|
|
|
### Configuration Validation
|
|
```bash
|
|
# Validate Docker Compose before deployment
|
|
docker-compose -f watchtower.yml config
|
|
```
|
|
|
|
### Backup Configurations
|
|
```bash
|
|
# Backup working configurations
|
|
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)
|
|
```
|
|
|
|
## 🔄 Recovery Testing
|
|
|
|
### Monthly Recovery Drill
|
|
1. Intentionally stop Watchtower on test endpoint
|
|
2. Run emergency recovery procedures
|
|
3. Verify functionality and notifications
|
|
4. Document any issues or improvements needed
|
|
|
|
### Notification Testing
|
|
```bash
|
|
# Test all notification endpoints
|
|
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
|
|
curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
|
|
done
|
|
```
|
|
|
|
## 📞 Escalation Procedures
|
|
|
|
### Level 1: Automated Recovery
|
|
- Scripts attempt automatic recovery
|
|
- Status checks verify success
|
|
- Notifications sent on failure
|
|
|
|
### Level 2: Manual Intervention
|
|
- Review logs and error messages
|
|
- Apply manual fixes using this guide
|
|
- Update configurations as needed
|
|
|
|
### Level 3: Infrastructure Review
|
|
- Assess overall architecture
|
|
- Consider alternative solutions
|
|
- Update emergency procedures
|
|
|
|
## 📚 Reference Information
|
|
|
|
### Shoutrrr URL Formats
|
|
```bash
|
|
# Generic HTTP webhook
|
|
generic+http://localhost:8081/updates
|
|
|
|
# ntfy service (HTTPS)
|
|
ntfy://ntfy.example.com/topic
|
|
|
|
# Discord webhook
|
|
discord://token@channel
|
|
|
|
# Slack webhook
|
|
slack://token@channel
|
|
```
|
|
|
|
### Environment Variables Reference
|
|
```bash
|
|
WATCHTOWER_CLEANUP=true # Remove old images
|
|
WATCHTOWER_INCLUDE_RESTARTING=true # Update restarting containers
|
|
WATCHTOWER_INCLUDE_STOPPED=true # Update stopped containers
|
|
WATCHTOWER_REVIVE_STOPPED=false # Don't start stopped containers
|
|
WATCHTOWER_POLL_INTERVAL=3600 # Check every hour
|
|
WATCHTOWER_TIMEOUT=10s # Container stop timeout
|
|
WATCHTOWER_HTTP_API_UPDATE=true # Enable HTTP API
|
|
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" # API authentication
|
|
WATCHTOWER_NOTIFICATIONS=shoutrrr # Enable notifications
|
|
WATCHTOWER_NOTIFICATION_URL=url # Notification endpoint
|
|
TZ=America/Los_Angeles # Timezone
|
|
```
|
|
|
|
### API Endpoints
|
|
```bash
|
|
# Portainer API base
|
|
BASE_URL="http://vishinator.synology.me:10000"
|
|
|
|
# Endpoint IDs
|
|
ATLANTIS_ID=2
|
|
CALYPSO_ID=443397
|
|
CONCORD_NUC_ID=443398
|
|
RPI5_ID=443395
|
|
HOMELAB_VM_ID=443399
|
|
```
|
|
|
|
## 🔐 Security Considerations
|
|
|
|
### API Key Management
|
|
- Store API keys securely
|
|
- Rotate keys regularly
|
|
- Use environment variables, not hardcoded values
|
|
|
|
### Container Security
|
|
- Run with minimal privileges
|
|
- Use read-only Docker socket when possible
|
|
- Implement network segmentation
|
|
|
|
### Notification Security
|
|
- Use HTTPS for external notifications
|
|
- Implement authentication for notification endpoints
|
|
- Avoid sensitive information in notification messages
|
|
|
|
## 📈 Monitoring and Metrics
|
|
|
|
### Key Metrics to Track
|
|
- Container update success rate
|
|
- Notification delivery success
|
|
- Recovery time from failures
|
|
- Resource usage trends
|
|
|
|
### Alerting Thresholds
|
|
- Watchtower down for > 5 minutes: Critical
|
|
- Failed updates > 3 in 24 hours: Warning
|
|
- Notification failures > 10%: Warning
|
|
|
|
## 🔄 Continuous Improvement
|
|
|
|
### Regular Reviews
|
|
- Monthly review of emergency procedures
|
|
- Quarterly testing of all recovery scenarios
|
|
- Annual architecture assessment
|
|
|
|
### Documentation Updates
|
|
- Update procedures after each incident
|
|
- Incorporate lessons learned
|
|
- Maintain current contact information
|
|
|
|
---
|
|
|
|
**Last Updated:** 2026-02-09
|
|
**Next Review:** 2026-03-09
|
|
**Document Owner:** Homelab Operations Team |