Files
homelab-optimized/docs/troubleshooting/WATCHTOWER_EMERGENCY_PROCEDURES.md
Gitea Mirror Bot abd959b47a
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-05 05:12:55 UTC
2026-04-05 05:12:55 +00:00

345 lines
8.5 KiB
Markdown

# Watchtower Emergency Procedures
## 🚨 Emergency Response Guide
This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.
## 📊 Current Status (Last Updated: 2026-02-09)
### Endpoint Status Summary
| Endpoint | Status | Port | Notification URL | Notes |
|----------|--------|------|------------------|-------|
| **Calypso** | 🟢 HEALTHY | 8080 | `generic+http://localhost:8081/updates` | Fixed crash loop |
| **Atlantis** | 🟢 HEALTHY | 8081 | `generic+http://localhost:8082/updates` | Fixed port conflict |
| **vish-concord-nuc** | 🟢 HEALTHY | 8080 | None configured | Stable for 2+ weeks |
| **rpi5** | ❌ NOT DEPLOYED | - | - | Consider deployment |
| **Homelab VM** | ⚠️ OFFLINE | - | - | Endpoint unreachable |
## 🔧 Emergency Fix Scripts
### Quick Status Check
```bash
# Run comprehensive status check
./scripts/check-watchtower-status.sh
```
### Emergency Crash Loop Fix
```bash
# Fix notification URL format issues
./scripts/portainer-fix-v2.sh
```
### Port Conflict Resolution
```bash
# Fix port conflicts (Atlantis specific)
./scripts/fix-atlantis-port.sh
```
## 🚨 Common Issues and Solutions
### Issue 1: Crash Loop with "unknown service 'http'" Error
**Symptoms:**
```
level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""
```
**Root Cause:** Invalid Shoutrrr notification URL format
**Solution:**
```bash
# WRONG FORMAT:
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes
# CORRECT FORMAT:
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates
```
**Emergency Fix:**
1. Stop the crash looping container
2. Remove the broken container
3. Recreate with correct notification URL format
4. Start the new container
### Issue 2: Port Conflict (Address Already in Use)
**Symptoms:**
```
Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use
```
**Solution:**
1. Identify conflicting service on port 8080
2. Use alternative port (8081, 8082, etc.)
3. Update port mapping in container configuration
**Emergency Fix:**
```bash
# Use different port in HostConfig
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}
```
### Issue 3: Notification Service Connection Refused
**Symptoms:**
```
error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"
```
**Root Cause:** ntfy service not running on target port
**Solutions:**
1. **Deploy ntfy service locally:**
```yaml
# hosts/[hostname]/ntfy.yaml
version: '3.8'
services:
ntfy:
image: binwiederhier/ntfy
ports:
- "8081:80"
command: serve
volumes:
- ntfy-data:/var/lib/ntfy
```
2. **Use external ntfy service:**
```bash
WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC
```
3. **Disable notifications temporarily:**
```bash
# Remove notification environment variables
unset WATCHTOWER_NOTIFICATIONS
unset WATCHTOWER_NOTIFICATION_URL
```
## 🔍 Diagnostic Commands
### Check Container Status
```bash
# Via Portainer API
curl -H "X-API-Key: $API_KEY" \
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
jq '.[] | select(.Names[]? | contains("watchtower"))'
```
### View Container Logs
```bash
# Last 50 lines
curl -H "X-API-Key: $API_KEY" \
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"
```
### Check Port Usage
```bash
# SSH to host and check port usage
netstat -tulpn | grep :8080
lsof -i :8080
```
### Verify Notification Service
```bash
# Test ntfy service
curl -d "Test message" http://localhost:8081/updates
```
## 🛠️ Manual Recovery Procedures
### Complete Watchtower Rebuild
1. **Stop and remove existing container:**
```bash
docker stop watchtower
docker rm watchtower
```
2. **Pull latest image:**
```bash
docker pull containrrr/watchtower:latest
```
3. **Deploy with correct configuration:**
```bash
docker run -d \
--name watchtower \
--restart always \
-p 8080:8080 \
-v /var/run/docker.sock:/var/run/docker.sock \
-e WATCHTOWER_CLEANUP=true \
-e WATCHTOWER_INCLUDE_RESTARTING=true \
-e WATCHTOWER_INCLUDE_STOPPED=true \
-e WATCHTOWER_REVIVE_STOPPED=false \
-e WATCHTOWER_POLL_INTERVAL=3600 \
-e WATCHTOWER_TIMEOUT=10s \
-e WATCHTOWER_HTTP_API_UPDATE=true \
-e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
-e WATCHTOWER_NOTIFICATIONS=shoutrrr \
-e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
-e TZ=America/Los_Angeles \
containrrr/watchtower:latest
```
### Notification Service Deployment
1. **Deploy ntfy service:**
```bash
docker run -d \
--name ntfy \
--restart always \
-p 8081:80 \
-v ntfy-data:/var/lib/ntfy \
binwiederhier/ntfy serve
```
2. **Test notification:**
```bash
curl -d "Watchtower test notification" http://localhost:8081/updates
```
## 📋 Preventive Measures
### Regular Health Checks
```bash
# Add to crontab for automated monitoring
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh
```
### Configuration Validation
```bash
# Validate Docker Compose before deployment
docker-compose -f watchtower.yml config
```
### Backup Configurations
```bash
# Backup working configurations
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)
```
## 🔄 Recovery Testing
### Monthly Recovery Drill
1. Intentionally stop Watchtower on test endpoint
2. Run emergency recovery procedures
3. Verify functionality and notifications
4. Document any issues or improvements needed
### Notification Testing
```bash
# Test all notification endpoints
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
done
```
## 📞 Escalation Procedures
### Level 1: Automated Recovery
- Scripts attempt automatic recovery
- Status checks verify success
- Notifications sent on failure
### Level 2: Manual Intervention
- Review logs and error messages
- Apply manual fixes using this guide
- Update configurations as needed
### Level 3: Infrastructure Review
- Assess overall architecture
- Consider alternative solutions
- Update emergency procedures
## 📚 Reference Information
### Shoutrrr URL Formats
```bash
# Generic HTTP webhook
generic+http://localhost:8081/updates
# ntfy service (HTTPS)
ntfy://ntfy.example.com/topic
# Discord webhook
discord://token@channel
# Slack webhook
slack://token@channel
```
### Environment Variables Reference
```bash
WATCHTOWER_CLEANUP=true # Remove old images
WATCHTOWER_INCLUDE_RESTARTING=true # Update restarting containers
WATCHTOWER_INCLUDE_STOPPED=true # Update stopped containers
WATCHTOWER_REVIVE_STOPPED=false # Don't start stopped containers
WATCHTOWER_POLL_INTERVAL=3600 # Check every hour
WATCHTOWER_TIMEOUT=10s # Container stop timeout
WATCHTOWER_HTTP_API_UPDATE=true # Enable HTTP API
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" # API authentication
WATCHTOWER_NOTIFICATIONS=shoutrrr # Enable notifications
WATCHTOWER_NOTIFICATION_URL=url # Notification endpoint
TZ=America/Los_Angeles # Timezone
```
### API Endpoints
```bash
# Portainer API base
BASE_URL="http://vishinator.synology.me:10000"
# Endpoint IDs
ATLANTIS_ID=2
CALYPSO_ID=443397
CONCORD_NUC_ID=443398
RPI5_ID=443395
HOMELAB_VM_ID=443399
```
## 🔐 Security Considerations
### API Key Management
- Store API keys securely
- Rotate keys regularly
- Use environment variables, not hardcoded values
### Container Security
- Run with minimal privileges
- Use read-only Docker socket when possible
- Implement network segmentation
### Notification Security
- Use HTTPS for external notifications
- Implement authentication for notification endpoints
- Avoid sensitive information in notification messages
## 📈 Monitoring and Metrics
### Key Metrics to Track
- Container update success rate
- Notification delivery success
- Recovery time from failures
- Resource usage trends
### Alerting Thresholds
- Watchtower down for > 5 minutes: Critical
- Failed updates > 3 in 24 hours: Warning
- Notification failures > 10%: Warning
## 🔄 Continuous Improvement
### Regular Reviews
- Monthly review of emergency procedures
- Quarterly testing of all recovery scenarios
- Annual architecture assessment
### Documentation Updates
- Update procedures after each incident
- Incorporate lessons learned
- Maintain current contact information
---
**Last Updated:** 2026-02-09
**Next Review:** 2026-03-09
**Document Owner:** Homelab Operations Team