Files
homelab-optimized/docs/troubleshooting/watchtower-atlantis-incident-2026-02-09.md
Gitea Mirror Bot e7652c8dab
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
2026-04-20 01:32:01 +00:00

237 lines
7.2 KiB
Markdown

# Watchtower Atlantis Incident Report - February 9, 2026
## 📋 Incident Summary
| Field | Value |
|-------|-------|
| **Date** | February 9, 2026 |
| **Time** | 01:45 PST |
| **Severity** | Medium |
| **Status** | ✅ RESOLVED |
| **Affected Service** | Watchtower (Atlantis) |
| **Duration** | ~15 minutes |
| **Reporter** | User |
| **Resolver** | OpenHands Agent |
## 🚨 Problem Description
**Issue**: Watchtower container on Atlantis server was not running, preventing automatic Docker container updates.
**Symptoms**:
- Watchtower container in "Created" state but not running
- No automatic container updates occurring
- Container logs empty (never started)
## 🔍 Root Cause Analysis
**Primary Cause**: Container was created but never started, likely due to:
- System restart without proper container startup
- Manual container stop without restart
- Docker daemon restart that didn't auto-start the container
**Contributing Factors**:
- User permission issues requiring `sudo` for Docker commands
- Container was properly configured but simply not running
## 🛠️ Resolution Steps
### 1. Initial Diagnosis
```bash
# Connected to Atlantis server via SSH
ssh atlantis
# Attempted to check container status (permission denied)
docker ps -a | grep -i watchtower
# Error: permission denied while trying to connect to Docker daemon socket
# Used sudo to check container status
sudo docker ps -a | grep -i watchtower
# Found: Container in "Created" state, not running
```
### 2. Container Analysis
```bash
# Checked container logs (empty - never started)
sudo docker logs watchtower
# Verified container configuration
sudo docker inspect watchtower | grep -A 5 -B 5 "RestartPolicy"
# Confirmed: RestartPolicy set to "always" (correct)
```
### 3. Resolution Implementation
```bash
# Started the Watchtower container
sudo docker start watchtower
# Result: watchtower (container started successfully)
# Verified container is running
sudo docker ps | grep watchtower
# Result: Container running and healthy
```
### 4. Functionality Verification
```bash
# Checked container logs for proper startup
sudo docker logs watchtower --tail 20
# Confirmed: Watchtower 1.7.1 started successfully
# Confirmed: HTTP API enabled on port 8080 (mapped to 8082)
# Confirmed: Checking all containers enabled
# Tested HTTP API (without authentication)
curl -s -w "\nHTTP Status: %{http_code}\n" http://localhost:8082/v1/update
# Result: HTTP 401 (expected - API requires authentication)
# Verified API token configuration
sudo docker inspect watchtower | grep -i "api\|token\|auth" -A 2 -B 2
# Found: WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"
```
## ✅ Current Status
**Container Status**: ✅ Running and Healthy
- Container ID: `9f8fee3fbcea`
- Status: Up and running (healthy)
- Uptime: Stable since fix
- Port Mapping: 8082:8080 (HTTP API accessible)
**Configuration Verified**:
- ✅ Restart Policy: `always` (will auto-start on reboot)
- ✅ HTTP API: Enabled with authentication token
- ✅ Cleanup: Enabled (removes old images)
- ✅ Rolling Restart: Enabled (minimizes disruption)
- ✅ Timeout: 30s (graceful shutdown)
**API Access**:
- URL: `http://atlantis:8082/v1/update`
- Authentication: Bearer token `watchtower-update-token`
- Status: Functional and secured
## 🔧 Configuration Details
### Current Watchtower Configuration
```yaml
# From running container inspection
Environment:
- WATCHTOWER_POLL_INTERVAL=3600
- WATCHTOWER_TIMEOUT=10s
- WATCHTOWER_HTTP_API_UPDATE=true
- WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"
- TZ=America/Los_Angeles
Restart Policy: always
Port Mapping: 8082:8080
Volume Mounts: /var/run/docker.sock:/var/run/docker.sock:ro
```
### Differences from Repository Configuration
The running container configuration differs from the repository `watchtower.yml`:
| Setting | Repository Config | Running Container |
|---------|------------------|-------------------|
| API Token | `REDACTED_WATCHTOWER_TOKEN` | `watchtower-update-token` |
| Poll Interval | Not set (uses schedule) | `3600` seconds |
| Timeout | `30s` | `10s` |
| Schedule | `"0 0 */2 * * *"` | Not visible (may use polling) |
**Recommendation**: Update repository configuration to match running container or vice versa for consistency.
## 🚀 Prevention Measures
### Immediate Actions Taken
1. ✅ Container restarted and verified functional
2. ✅ Confirmed restart policy is set to "always"
3. ✅ Verified API functionality and security
### Recommended Long-term Improvements
#### 1. Monitoring Enhancement
```bash
# Add to monitoring stack
# Monitor Watchtower container health
# Alert on container state changes
```
#### 2. Documentation Updates
- Update service documentation with correct API token
- Document troubleshooting steps for similar issues
- Create runbook for Watchtower maintenance
#### 3. Automation Improvements
```bash
# Create health check script
#!/bin/bash
# Check if Watchtower is running and restart if needed
if ! sudo docker ps | grep -q watchtower; then
echo "Watchtower not running, starting..."
sudo docker start watchtower
fi
```
#### 4. Configuration Synchronization
- Reconcile differences between repository config and running container
- Implement configuration management to prevent drift
## 📚 Related Documentation
- **Service Config**: `/home/homelab/organized/repos/homelab/Atlantis/watchtower.yml`
- **Status Script**: `/home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh`
- **Emergency Script**: `/home/homelab/organized/repos/homelab/scripts/emergency-fix-watchtower-crash.sh`
- **Service Docs**: `/home/homelab/organized/repos/homelab/docs/services/individual/watchtower.md`
## 🔗 Useful Commands
### Status Checking
```bash
# Check container status
sudo docker ps | grep watchtower
# View container logs
sudo docker logs watchtower --tail 20
# Check container health
sudo docker inspect watchtower --format='{{.State.Health.Status}}'
```
### API Testing
```bash
# Test API without authentication (should return 401)
curl -s -w "\nHTTP Status: %{http_code}\n" http://localhost:8082/v1/update
# Test API with authentication
curl -s -H "Authorization: Bearer watchtower-update-token" http://localhost:8082/v1/update
```
### Container Management
```bash
# Start container
sudo docker start watchtower
# Restart container
sudo docker restart watchtower
# View container configuration
sudo docker inspect watchtower
```
## 📊 Lessons Learned
1. **Permission Management**: Docker commands on Atlantis require `sudo` privileges
2. **Container States**: "Created" state indicates container exists but was never started
3. **Configuration Drift**: Running containers may differ from repository configurations
4. **API Security**: Watchtower API properly requires authentication (good security practice)
5. **Restart Policies**: "always" restart policy doesn't help if container was never started initially
## 🎯 Action Items
- [ ] Update repository configuration to match running container
- [ ] Implement automated health checks for Watchtower
- [ ] Add Watchtower monitoring to existing monitoring stack
- [ ] Create user permissions documentation for Docker access
- [ ] Schedule regular configuration drift checks
---
**Incident Closed**: February 9, 2026 02:00 PST
**Resolution Time**: 15 minutes
**Next Review**: February 16, 2026 (1 week follow-up)