Sanitized mirror from private repository - 2026-03-14 10:23:53 UTC
This commit is contained in:
345
docs/troubleshooting/WATCHTOWER_EMERGENCY_PROCEDURES.md
Normal file
345
docs/troubleshooting/WATCHTOWER_EMERGENCY_PROCEDURES.md
Normal file
@@ -0,0 +1,345 @@
|
||||
# Watchtower Emergency Procedures
|
||||
|
||||
## 🚨 Emergency Response Guide
|
||||
|
||||
This document provides step-by-step procedures for diagnosing and fixing Watchtower issues across your homelab infrastructure.
|
||||
|
||||
## 📊 Current Status (Last Updated: 2026-02-09)
|
||||
|
||||
### Endpoint Status Summary
|
||||
| Endpoint | Status | Port | Notification URL | Notes |
|
||||
|----------|--------|------|------------------|-------|
|
||||
| **Calypso** | 🟢 HEALTHY | 8080 | `generic+http://localhost:8081/updates` | Fixed crash loop |
|
||||
| **Atlantis** | 🟢 HEALTHY | 8081 | `generic+http://localhost:8082/updates` | Fixed port conflict |
|
||||
| **vish-concord-nuc** | 🟢 HEALTHY | 8080 | None configured | Stable for 2+ weeks |
|
||||
| **rpi5** | ❌ NOT DEPLOYED | - | - | Consider deployment |
|
||||
| **Homelab VM** | ⚠️ OFFLINE | - | - | Endpoint unreachable |
|
||||
|
||||
## 🔧 Emergency Fix Scripts
|
||||
|
||||
### Quick Status Check
|
||||
```bash
|
||||
# Run comprehensive status check
|
||||
./scripts/check-watchtower-status.sh
|
||||
```
|
||||
|
||||
### Emergency Crash Loop Fix
|
||||
```bash
|
||||
# Fix notification URL format issues
|
||||
./scripts/portainer-fix-v2.sh
|
||||
```
|
||||
|
||||
### Port Conflict Resolution
|
||||
```bash
|
||||
# Fix port conflicts (Atlantis specific)
|
||||
./scripts/fix-atlantis-port.sh
|
||||
```
|
||||
|
||||
## 🚨 Common Issues and Solutions
|
||||
|
||||
### Issue 1: Crash Loop with "unknown service 'http'" Error
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
level=fatal msg="Failed to initialize Shoutrrr notifications: error initializing router services: unknown service \"http\""
|
||||
```
|
||||
|
||||
**Root Cause:** Invalid Shoutrrr notification URL format
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# WRONG FORMAT:
|
||||
WATCHTOWER_NOTIFICATION_URL=ntfy://localhost:8081/updates?insecure=yes
|
||||
|
||||
# CORRECT FORMAT:
|
||||
WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates
|
||||
```
|
||||
|
||||
**Emergency Fix:**
|
||||
1. Stop the crash looping container
|
||||
2. Remove the broken container
|
||||
3. Recreate with correct notification URL format
|
||||
4. Start the new container
|
||||
|
||||
### Issue 2: Port Conflict (Address Already in Use)
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error starting userland proxy: listen tcp4 0.0.0.0:8080: bind: address already in use
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Identify conflicting service on port 8080
|
||||
2. Use alternative port (8081, 8082, etc.)
|
||||
3. Update port mapping in container configuration
|
||||
|
||||
**Emergency Fix:**
|
||||
```bash
|
||||
# Use different port in HostConfig
|
||||
"PortBindings": {"8080/tcp": [{"HostPort": "8081"}]}
|
||||
```
|
||||
|
||||
### Issue 3: Notification Service Connection Refused
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
error="Post \"http://localhost:8081/updates\": dial tcp 127.0.0.1:8081: connect: connection refused"
|
||||
```
|
||||
|
||||
**Root Cause:** ntfy service not running on target port
|
||||
|
||||
**Solutions:**
|
||||
1. **Deploy ntfy service locally:**
|
||||
```yaml
|
||||
# hosts/[hostname]/ntfy.yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
ntfy:
|
||||
image: binwiederhier/ntfy
|
||||
ports:
|
||||
- "8081:80"
|
||||
command: serve
|
||||
volumes:
|
||||
- ntfy-data:/var/lib/ntfy
|
||||
```
|
||||
|
||||
2. **Use external ntfy service:**
|
||||
```bash
|
||||
WATCHTOWER_NOTIFICATION_URL=ntfy://ntfy.vish.gg/REDACTED_NTFY_TOPIC
|
||||
```
|
||||
|
||||
3. **Disable notifications temporarily:**
|
||||
```bash
|
||||
# Remove notification environment variables
|
||||
unset WATCHTOWER_NOTIFICATIONS
|
||||
unset WATCHTOWER_NOTIFICATION_URL
|
||||
```
|
||||
|
||||
## 🔍 Diagnostic Commands
|
||||
|
||||
### Check Container Status
|
||||
```bash
|
||||
# Via Portainer API
|
||||
curl -H "X-API-Key: $API_KEY" \
|
||||
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/json" | \
|
||||
jq '.[] | select(.Names[]? | contains("watchtower"))'
|
||||
```
|
||||
|
||||
### View Container Logs
|
||||
```bash
|
||||
# Last 50 lines
|
||||
curl -H "X-API-Key: $API_KEY" \
|
||||
"$BASE_URL/api/endpoints/$ENDPOINT_ID/docker/containers/$CONTAINER_ID/logs?stdout=true&stderr=true&tail=50"
|
||||
```
|
||||
|
||||
### Check Port Usage
|
||||
```bash
|
||||
# SSH to host and check port usage
|
||||
netstat -tulpn | grep :8080
|
||||
lsof -i :8080
|
||||
```
|
||||
|
||||
### Verify Notification Service
|
||||
```bash
|
||||
# Test ntfy service
|
||||
curl -d "Test message" http://localhost:8081/updates
|
||||
```
|
||||
|
||||
## 🛠️ Manual Recovery Procedures
|
||||
|
||||
### Complete Watchtower Rebuild
|
||||
|
||||
1. **Stop and remove existing container:**
|
||||
```bash
|
||||
docker stop watchtower
|
||||
docker rm watchtower
|
||||
```
|
||||
|
||||
2. **Pull latest image:**
|
||||
```bash
|
||||
docker pull containrrr/watchtower:latest
|
||||
```
|
||||
|
||||
3. **Deploy with correct configuration:**
|
||||
```bash
|
||||
docker run -d \
|
||||
--name watchtower \
|
||||
--restart always \
|
||||
-p 8080:8080 \
|
||||
-v /var/run/docker.sock:/var/run/docker.sock \
|
||||
-e WATCHTOWER_CLEANUP=true \
|
||||
-e WATCHTOWER_INCLUDE_RESTARTING=true \
|
||||
-e WATCHTOWER_INCLUDE_STOPPED=true \
|
||||
-e WATCHTOWER_REVIVE_STOPPED=false \
|
||||
-e WATCHTOWER_POLL_INTERVAL=3600 \
|
||||
-e WATCHTOWER_TIMEOUT=10s \
|
||||
-e WATCHTOWER_HTTP_API_UPDATE=true \
|
||||
-e WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" \
|
||||
-e WATCHTOWER_NOTIFICATIONS=shoutrrr \
|
||||
-e WATCHTOWER_NOTIFICATION_URL=generic+http://localhost:8081/updates \
|
||||
-e TZ=America/Los_Angeles \
|
||||
containrrr/watchtower:latest
|
||||
```
|
||||
|
||||
### Notification Service Deployment
|
||||
|
||||
1. **Deploy ntfy service:**
|
||||
```bash
|
||||
docker run -d \
|
||||
--name ntfy \
|
||||
--restart always \
|
||||
-p 8081:80 \
|
||||
-v ntfy-data:/var/lib/ntfy \
|
||||
binwiederhier/ntfy serve
|
||||
```
|
||||
|
||||
2. **Test notification:**
|
||||
```bash
|
||||
curl -d "Watchtower test notification" http://localhost:8081/updates
|
||||
```
|
||||
|
||||
## 📋 Preventive Measures
|
||||
|
||||
### Regular Health Checks
|
||||
```bash
|
||||
# Add to crontab for automated monitoring
|
||||
0 */6 * * * /home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh
|
||||
```
|
||||
|
||||
### Configuration Validation
|
||||
```bash
|
||||
# Validate Docker Compose before deployment
|
||||
docker-compose -f watchtower.yml config
|
||||
```
|
||||
|
||||
### Backup Configurations
|
||||
```bash
|
||||
# Backup working configurations
|
||||
cp watchtower.yml watchtower.yml.backup.$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
## 🔄 Recovery Testing
|
||||
|
||||
### Monthly Recovery Drill
|
||||
1. Intentionally stop Watchtower on test endpoint
|
||||
2. Run emergency recovery procedures
|
||||
3. Verify functionality and notifications
|
||||
4. Document any issues or improvements needed
|
||||
|
||||
### Notification Testing
|
||||
```bash
|
||||
# Test all notification endpoints
|
||||
for endpoint in localhost:8081 localhost:8082 ntfy.vish.gg; do
|
||||
curl -d "Test from $(hostname)" http://$endpoint/homelab-alerts
|
||||
done
|
||||
```
|
||||
|
||||
## 📞 Escalation Procedures
|
||||
|
||||
### Level 1: Automated Recovery
|
||||
- Scripts attempt automatic recovery
|
||||
- Status checks verify success
|
||||
- Notifications sent on failure
|
||||
|
||||
### Level 2: Manual Intervention
|
||||
- Review logs and error messages
|
||||
- Apply manual fixes using this guide
|
||||
- Update configurations as needed
|
||||
|
||||
### Level 3: Infrastructure Review
|
||||
- Assess overall architecture
|
||||
- Consider alternative solutions
|
||||
- Update emergency procedures
|
||||
|
||||
## 📚 Reference Information
|
||||
|
||||
### Shoutrrr URL Formats
|
||||
```bash
|
||||
# Generic HTTP webhook
|
||||
generic+http://localhost:8081/updates
|
||||
|
||||
# ntfy service (HTTPS)
|
||||
ntfy://ntfy.example.com/topic
|
||||
|
||||
# Discord webhook
|
||||
discord://token@channel
|
||||
|
||||
# Slack webhook
|
||||
slack://token@channel
|
||||
```
|
||||
|
||||
### Environment Variables Reference
|
||||
```bash
|
||||
WATCHTOWER_CLEANUP=true # Remove old images
|
||||
WATCHTOWER_INCLUDE_RESTARTING=true # Update restarting containers
|
||||
WATCHTOWER_INCLUDE_STOPPED=true # Update stopped containers
|
||||
WATCHTOWER_REVIVE_STOPPED=false # Don't start stopped containers
|
||||
WATCHTOWER_POLL_INTERVAL=3600 # Check every hour
|
||||
WATCHTOWER_TIMEOUT=10s # Container stop timeout
|
||||
WATCHTOWER_HTTP_API_UPDATE=true # Enable HTTP API
|
||||
WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN" # API authentication
|
||||
WATCHTOWER_NOTIFICATIONS=shoutrrr # Enable notifications
|
||||
WATCHTOWER_NOTIFICATION_URL=url # Notification endpoint
|
||||
TZ=America/Los_Angeles # Timezone
|
||||
```
|
||||
|
||||
### API Endpoints
|
||||
```bash
|
||||
# Portainer API base
|
||||
BASE_URL="http://vishinator.synology.me:10000"
|
||||
|
||||
# Endpoint IDs
|
||||
ATLANTIS_ID=2
|
||||
CALYPSO_ID=443397
|
||||
CONCORD_NUC_ID=443398
|
||||
RPI5_ID=443395
|
||||
HOMELAB_VM_ID=443399
|
||||
```
|
||||
|
||||
## 🔐 Security Considerations
|
||||
|
||||
### API Key Management
|
||||
- Store API keys securely
|
||||
- Rotate keys regularly
|
||||
- Use environment variables, not hardcoded values
|
||||
|
||||
### Container Security
|
||||
- Run with minimal privileges
|
||||
- Use read-only Docker socket when possible
|
||||
- Implement network segmentation
|
||||
|
||||
### Notification Security
|
||||
- Use HTTPS for external notifications
|
||||
- Implement authentication for notification endpoints
|
||||
- Avoid sensitive information in notification messages
|
||||
|
||||
## 📈 Monitoring and Metrics
|
||||
|
||||
### Key Metrics to Track
|
||||
- Container update success rate
|
||||
- Notification delivery success
|
||||
- Recovery time from failures
|
||||
- Resource usage trends
|
||||
|
||||
### Alerting Thresholds
|
||||
- Watchtower down for > 5 minutes: Critical
|
||||
- Failed updates > 3 in 24 hours: Warning
|
||||
- Notification failures > 10%: Warning
|
||||
|
||||
## 🔄 Continuous Improvement
|
||||
|
||||
### Regular Reviews
|
||||
- Monthly review of emergency procedures
|
||||
- Quarterly testing of all recovery scenarios
|
||||
- Annual architecture assessment
|
||||
|
||||
### Documentation Updates
|
||||
- Update procedures after each incident
|
||||
- Incorporate lessons learned
|
||||
- Maintain current contact information
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2026-02-09
|
||||
**Next Review:** 2026-03-09
|
||||
**Document Owner:** Homelab Operations Team
|
||||
Reference in New Issue
Block a user