Sanitized mirror from private repository - 2026-03-11 06:12:00 UTC
This commit is contained in:
106
docs/admin/OPERATIONAL_NOTES.md
Normal file
106
docs/admin/OPERATIONAL_NOTES.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Operational Notes & Known Issues
|
||||
|
||||
*Last Updated: 2026-01-26*
|
||||
|
||||
This document contains important operational notes, known issues, and fixes for the homelab infrastructure.
|
||||
|
||||
---
|
||||
|
||||
## Server-Specific Notes
|
||||
|
||||
### Concord NUC (100.72.55.21)
|
||||
|
||||
#### Node Exporter
|
||||
- **Runs on bare metal** (not containerized)
|
||||
- Port: 9100
|
||||
- Prometheus scrapes successfully from `100.72.55.21:9100`
|
||||
- Do NOT deploy containerized node_exporter - it will conflict with the host service
|
||||
|
||||
#### Watchtower
|
||||
- Requires `DOCKER_API_VERSION=1.44` environment variable
|
||||
- This is because the Portainer Edge Agent uses an older Docker API version
|
||||
- Without this env var, watchtower fails with: `client version 1.25 is too old`
|
||||
|
||||
#### Invidious
|
||||
- Health check reports "unhealthy" but the application works fine
|
||||
- The health check calls `/api/v1/trending` which returns HTTP 500
|
||||
- This is a known upstream issue with YouTube's API changes
|
||||
- **Workaround**: Ignore the unhealthy status or modify the health check endpoint
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Monitoring
|
||||
|
||||
### Active Targets (as of 2026-01-26)
|
||||
|
||||
| Job | Target | Status |
|
||||
|-----|--------|--------|
|
||||
| prometheus | prometheus:9090 | 🟢 UP |
|
||||
| homelab-node | 100.67.40.126:9100 | 🟢 UP |
|
||||
| atlantis-node | 100.83.230.112:9100 | 🟢 UP |
|
||||
| atlantis-snmp | 100.83.230.112:9116 | 🟢 UP |
|
||||
| calypso-node | 100.103.48.78:9100 | 🟢 UP |
|
||||
| calypso-snmp | 100.103.48.78:9116 | 🟢 UP |
|
||||
| concord-nuc-node | 100.72.55.21:9100 | 🟢 UP |
|
||||
| setillo-node | 100.125.0.20:9100 | 🟢 UP |
|
||||
| setillo-snmp | 100.125.0.20:9116 | 🟢 UP |
|
||||
| truenas-node | 100.75.252.64:9100 | 🟢 UP |
|
||||
| proxmox-node | 100.87.12.28:9100 | 🟢 UP |
|
||||
| raspberry-pis (pi-5) | 100.77.151.40:9100 | 🟢 UP |
|
||||
|
||||
### Intentionally Offline Targets
|
||||
|
||||
| Job | Target | Reason |
|
||||
|-----|--------|--------|
|
||||
| raspberry-pis (pi-5-kevin) | 100.123.246.75:9100 | Intentionally offline |
|
||||
| vmi2076105-node | 100.99.156.20:9100 | Intentionally offline |
|
||||
|
||||
---
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
### Git-Linked Stacks
|
||||
- Most stacks are deployed from Gitea (`git.vish.gg/Vish/homelab`)
|
||||
- Branch: `wip`
|
||||
- Portainer pulls configs directly from the repo
|
||||
- Changes to repo configs will affect deployed stacks on next redeploy/update
|
||||
|
||||
### Standalone Containers
|
||||
The following containers are managed directly in Portainer (NOT Git-linked):
|
||||
- `portainer` / `portainer_edge_agent` - Infrastructure
|
||||
- `watchtower` - Auto-updates (on some servers)
|
||||
- `node-exporter` containers (where not bare metal)
|
||||
- Various testing/temporary containers
|
||||
|
||||
### Bare Metal Services
|
||||
Some services run directly on hosts, not in containers:
|
||||
- **Concord NUC**: node_exporter (port 9100)
|
||||
|
||||
---
|
||||
|
||||
## Common Issues & Solutions
|
||||
|
||||
### Issue: Watchtower restart loop on Edge Agent hosts
|
||||
**Symptom**: Watchtower continuously restarts with API version error
|
||||
**Cause**: Portainer Edge Agent uses older Docker API
|
||||
**Solution**: Add `DOCKER_API_VERSION=1.44` to watchtower container environment
|
||||
|
||||
### Issue: Port 9100 already in use for node_exporter container
|
||||
**Symptom**: Container fails to start, "address already in use"
|
||||
**Cause**: node_exporter running on bare metal
|
||||
**Solution**: Don't run containerized node_exporter; use the bare metal instance
|
||||
|
||||
### Issue: Invidious health check failing
|
||||
**Symptom**: Container shows "unhealthy" but works fine
|
||||
**Cause**: YouTube API changes causing /api/v1/trending to return 500
|
||||
**Solution**: This is cosmetic; the app works. Consider updating health check endpoint.
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Checklist
|
||||
|
||||
- [ ] Check Prometheus targets regularly for DOWN status
|
||||
- [ ] Monitor watchtower logs for update failures
|
||||
- [ ] Review Portainer for containers in restart loops
|
||||
- [ ] Keep Git repo configs in sync with running stacks
|
||||
- [ ] Document any manual container changes in this file
|
||||
Reference in New Issue
Block a user