237 lines
7.2 KiB
Markdown
237 lines
7.2 KiB
Markdown
# Watchtower Atlantis Incident Report - February 9, 2026
|
|
|
|
## 📋 Incident Summary
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| **Date** | February 9, 2026 |
|
|
| **Time** | 01:45 PST |
|
|
| **Severity** | Medium |
|
|
| **Status** | ✅ RESOLVED |
|
|
| **Affected Service** | Watchtower (Atlantis) |
|
|
| **Duration** | ~15 minutes |
|
|
| **Reporter** | User |
|
|
| **Resolver** | OpenHands Agent |
|
|
|
|
## 🚨 Problem Description
|
|
|
|
**Issue**: Watchtower container on Atlantis server was not running, preventing automatic Docker container updates.
|
|
|
|
**Symptoms**:
|
|
- Watchtower container in "Created" state but not running
|
|
- No automatic container updates occurring
|
|
- Container logs empty (never started)
|
|
|
|
## 🔍 Root Cause Analysis
|
|
|
|
**Primary Cause**: Container was created but never started, likely due to:
|
|
- System restart without proper container startup
|
|
- Manual container stop without restart
|
|
- Docker daemon restart that didn't auto-start the container
|
|
|
|
**Contributing Factors**:
|
|
- User permission issues requiring `sudo` for Docker commands
|
|
- Container was properly configured but simply not running
|
|
|
|
## 🛠️ Resolution Steps
|
|
|
|
### 1. Initial Diagnosis
|
|
```bash
|
|
# Connected to Atlantis server via SSH
|
|
ssh atlantis
|
|
|
|
# Attempted to check container status (permission denied)
|
|
docker ps -a | grep -i watchtower
|
|
# Error: permission denied while trying to connect to Docker daemon socket
|
|
|
|
# Used sudo to check container status
|
|
sudo docker ps -a | grep -i watchtower
|
|
# Found: Container in "Created" state, not running
|
|
```
|
|
|
|
### 2. Container Analysis
|
|
```bash
|
|
# Checked container logs (empty - never started)
|
|
sudo docker logs watchtower
|
|
|
|
# Verified container configuration
|
|
sudo docker inspect watchtower | grep -A 5 -B 5 "RestartPolicy"
|
|
# Confirmed: RestartPolicy set to "always" (correct)
|
|
```
|
|
|
|
### 3. Resolution Implementation
|
|
```bash
|
|
# Started the Watchtower container
|
|
sudo docker start watchtower
|
|
# Result: watchtower (container started successfully)
|
|
|
|
# Verified container is running
|
|
sudo docker ps | grep watchtower
|
|
# Result: Container running and healthy
|
|
```
|
|
|
|
### 4. Functionality Verification
|
|
```bash
|
|
# Checked container logs for proper startup
|
|
sudo docker logs watchtower --tail 20
|
|
# Confirmed: Watchtower 1.7.1 started successfully
|
|
# Confirmed: HTTP API enabled on port 8080 (mapped to 8082)
|
|
# Confirmed: Checking all containers enabled
|
|
|
|
# Tested HTTP API (without authentication)
|
|
curl -s -w "\nHTTP Status: %{http_code}\n" http://localhost:8082/v1/update
|
|
# Result: HTTP 401 (expected - API requires authentication)
|
|
|
|
# Verified API token configuration
|
|
sudo docker inspect watchtower | grep -i "api\|token\|auth" -A 2 -B 2
|
|
# Found: WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"
|
|
```
|
|
|
|
## ✅ Current Status
|
|
|
|
**Container Status**: ✅ Running and Healthy
|
|
- Container ID: `9f8fee3fbcea`
|
|
- Status: Up and running (healthy)
|
|
- Uptime: Stable since fix
|
|
- Port Mapping: 8082:8080 (HTTP API accessible)
|
|
|
|
**Configuration Verified**:
|
|
- ✅ Restart Policy: `always` (will auto-start on reboot)
|
|
- ✅ HTTP API: Enabled with authentication token
|
|
- ✅ Cleanup: Enabled (removes old images)
|
|
- ✅ Rolling Restart: Enabled (minimizes disruption)
|
|
- ✅ Timeout: 30s (graceful shutdown)
|
|
|
|
**API Access**:
|
|
- URL: `http://atlantis:8082/v1/update`
|
|
- Authentication: Bearer token `watchtower-update-token`
|
|
- Status: Functional and secured
|
|
|
|
## 🔧 Configuration Details
|
|
|
|
### Current Watchtower Configuration
|
|
```yaml
|
|
# From running container inspection
|
|
Environment:
|
|
- WATCHTOWER_POLL_INTERVAL=3600
|
|
- WATCHTOWER_TIMEOUT=10s
|
|
- WATCHTOWER_HTTP_API_UPDATE=true
|
|
- WATCHTOWER_HTTP_API_TOKEN="REDACTED_HTTP_TOKEN"
|
|
- TZ=America/Los_Angeles
|
|
|
|
Restart Policy: always
|
|
Port Mapping: 8082:8080
|
|
Volume Mounts: /var/run/docker.sock:/var/run/docker.sock:ro
|
|
```
|
|
|
|
### Differences from Repository Configuration
|
|
The running container configuration differs from the repository `watchtower.yml`:
|
|
|
|
| Setting | Repository Config | Running Container |
|
|
|---------|------------------|-------------------|
|
|
| API Token | `REDACTED_WATCHTOWER_TOKEN` | `watchtower-update-token` |
|
|
| Poll Interval | Not set (uses schedule) | `3600` seconds |
|
|
| Timeout | `30s` | `10s` |
|
|
| Schedule | `"0 0 */2 * * *"` | Not visible (may use polling) |
|
|
|
|
**Recommendation**: Update repository configuration to match running container or vice versa for consistency.
|
|
|
|
## 🚀 Prevention Measures
|
|
|
|
### Immediate Actions Taken
|
|
1. ✅ Container restarted and verified functional
|
|
2. ✅ Confirmed restart policy is set to "always"
|
|
3. ✅ Verified API functionality and security
|
|
|
|
### Recommended Long-term Improvements
|
|
|
|
#### 1. Monitoring Enhancement
|
|
```bash
|
|
# Add to monitoring stack
|
|
# Monitor Watchtower container health
|
|
# Alert on container state changes
|
|
```
|
|
|
|
#### 2. Documentation Updates
|
|
- Update service documentation with correct API token
|
|
- Document troubleshooting steps for similar issues
|
|
- Create runbook for Watchtower maintenance
|
|
|
|
#### 3. Automation Improvements
|
|
```bash
|
|
# Create health check script
|
|
#!/bin/bash
|
|
# Check if Watchtower is running and restart if needed
|
|
if ! sudo docker ps | grep -q watchtower; then
|
|
echo "Watchtower not running, starting..."
|
|
sudo docker start watchtower
|
|
fi
|
|
```
|
|
|
|
#### 4. Configuration Synchronization
|
|
- Reconcile differences between repository config and running container
|
|
- Implement configuration management to prevent drift
|
|
|
|
## 📚 Related Documentation
|
|
|
|
- **Service Config**: `/home/homelab/organized/repos/homelab/Atlantis/watchtower.yml`
|
|
- **Status Script**: `/home/homelab/organized/repos/homelab/scripts/check-watchtower-status.sh`
|
|
- **Emergency Script**: `/home/homelab/organized/repos/homelab/scripts/emergency-fix-watchtower-crash.sh`
|
|
- **Service Docs**: `/home/homelab/organized/repos/homelab/docs/services/individual/watchtower.md`
|
|
|
|
## 🔗 Useful Commands
|
|
|
|
### Status Checking
|
|
```bash
|
|
# Check container status
|
|
sudo docker ps | grep watchtower
|
|
|
|
# View container logs
|
|
sudo docker logs watchtower --tail 20
|
|
|
|
# Check container health
|
|
sudo docker inspect watchtower --format='{{.State.Health.Status}}'
|
|
```
|
|
|
|
### API Testing
|
|
```bash
|
|
# Test API without authentication (should return 401)
|
|
curl -s -w "\nHTTP Status: %{http_code}\n" http://localhost:8082/v1/update
|
|
|
|
# Test API with authentication
|
|
curl -s -H "Authorization: Bearer watchtower-update-token" http://localhost:8082/v1/update
|
|
```
|
|
|
|
### Container Management
|
|
```bash
|
|
# Start container
|
|
sudo docker start watchtower
|
|
|
|
# Restart container
|
|
sudo docker restart watchtower
|
|
|
|
# View container configuration
|
|
sudo docker inspect watchtower
|
|
```
|
|
|
|
## 📊 Lessons Learned
|
|
|
|
1. **Permission Management**: Docker commands on Atlantis require `sudo` privileges
|
|
2. **Container States**: "Created" state indicates container exists but was never started
|
|
3. **Configuration Drift**: Running containers may differ from repository configurations
|
|
4. **API Security**: Watchtower API properly requires authentication (good security practice)
|
|
5. **Restart Policies**: "always" restart policy doesn't help if container was never started initially
|
|
|
|
## 🎯 Action Items
|
|
|
|
- [ ] Update repository configuration to match running container
|
|
- [ ] Implement automated health checks for Watchtower
|
|
- [ ] Add Watchtower monitoring to existing monitoring stack
|
|
- [ ] Create user permissions documentation for Docker access
|
|
- [ ] Schedule regular configuration drift checks
|
|
|
|
---
|
|
|
|
**Incident Closed**: February 9, 2026 02:00 PST
|
|
**Resolution Time**: 15 minutes
|
|
**Next Review**: February 16, 2026 (1 week follow-up) |