111 lines
3.4 KiB
Markdown
111 lines
3.4 KiB
Markdown
# Tailscale Host Monitoring Status Report
|
|
|
|
## 📊 Current Status
|
|
|
|
**Generated:** February 15, 2026
|
|
|
|
### Monitored Tailscale Hosts (13 total)
|
|
|
|
#### ✅ Online Hosts (10)
|
|
- **atlantis-node** (100.83.230.112:9100) - Synology NAS
|
|
- **atlantis-snmp** (100.83.230.112) - SNMP monitoring
|
|
- **calypso-node** (100.103.48.78:9100) - Node exporter
|
|
- **calypso-snmp** (100.103.48.78) - SNMP monitoring
|
|
- **concord-nuc-node** (100.72.55.21:9100) - Intel NUC
|
|
- **proxmox-node** (100.87.12.28:9100) - Proxmox server
|
|
- **raspberry-pis** (100.77.151.40:9100) - Pi cluster node
|
|
- **setillo-node** (100.125.0.20:9100) - Node exporter
|
|
- **setillo-snmp** (100.125.0.20) - SNMP monitoring
|
|
- **truenas-node** (100.75.252.64:9100) - TrueNAS server
|
|
|
|
#### ❌ Offline Hosts (3)
|
|
- **homelab-node** (100.67.40.126:9100) - Main homelab VM
|
|
- **raspberry-pis** (100.123.246.75:9100) - Pi cluster node
|
|
- **vmi2076105-node** (100.99.156.20:9100) - VPS instance
|
|
|
|
## 🚨 Active Alerts
|
|
|
|
### Critical HostDown Alerts (2 firing)
|
|
1. **vmi2076105-node** (100.99.156.20:9100)
|
|
- Status: Firing since Feb 14, 07:57 UTC
|
|
- Duration: ~24 hours
|
|
- Notifications: Sent to ntfy + Signal
|
|
|
|
2. **homelab-node** (100.67.40.126:9100)
|
|
- Status: Firing since Feb 14, 09:23 UTC
|
|
- Duration: ~22 hours
|
|
- Notifications: Sent to ntfy + Signal
|
|
|
|
## 📬 Notification System Status
|
|
|
|
### ✅ Working Notification Channels
|
|
- **ntfy**: http://192.168.0.210:8081/homelab-alerts ✅
|
|
- **Signal**: Via signal-bridge (critical alerts) ✅
|
|
- **Alertmanager**: http://100.67.40.126:9093 ✅
|
|
|
|
### Test Results
|
|
- ntfy notification test: **PASSED** ✅
|
|
- Message delivery: **CONFIRMED** ✅
|
|
- Alert routing: **WORKING** ✅
|
|
|
|
## ⚙️ Monitoring Configuration
|
|
|
|
### Alert Rules
|
|
- **Trigger**: Host unreachable for 2+ minutes
|
|
- **Severity**: Critical (dual-channel notifications)
|
|
- **Query**: `up{job=~".*-node"} == 0`
|
|
- **Evaluation**: Every 30 seconds
|
|
|
|
### Notification Routing
|
|
- **Warning alerts** → ntfy only
|
|
- **Critical alerts** → ntfy + Signal
|
|
- **Resolved alerts** → Both channels
|
|
|
|
## 🔧 Infrastructure Details
|
|
|
|
### Monitoring Stack
|
|
- **Prometheus**: http://100.67.40.126:9090
|
|
- **Grafana**: http://100.67.40.126:3300
|
|
- **Alertmanager**: http://100.67.40.126:9093
|
|
- **Bridge Services**: ntfy-bridge (5001), signal-bridge (5000)
|
|
|
|
### Data Collection
|
|
- **Node Exporter**: System metrics on port 9100
|
|
- **SNMP Exporter**: Network device metrics on port 9116
|
|
- **Scrape Interval**: 15 seconds
|
|
- **Retention**: Default Prometheus retention
|
|
|
|
## 📋 Recommendations
|
|
|
|
### Immediate Actions
|
|
1. **Investigate offline hosts**:
|
|
- Check homelab-node (100.67.40.126) - main VM down
|
|
- Verify vmi2076105-node (100.99.156.20) - VPS status
|
|
- Check raspberry-pis node (100.123.246.75)
|
|
|
|
2. **Verify notifications**:
|
|
- Confirm you're receiving ntfy alerts on mobile
|
|
- Test Signal notifications for critical alerts
|
|
|
|
### Maintenance
|
|
- Monitor disk space on active hosts
|
|
- Review alert thresholds if needed
|
|
- Consider adding more monitoring targets
|
|
|
|
## 🧪 Testing
|
|
|
|
Use the test script to verify monitoring:
|
|
```bash
|
|
./scripts/test-tailscale-monitoring.sh
|
|
```
|
|
|
|
For manual testing:
|
|
1. Stop node_exporter on any host: `sudo systemctl stop node_exporter`
|
|
2. Wait 2+ minutes for alert to fire
|
|
3. Check ntfy app and Signal for notifications
|
|
4. Restart: `sudo systemctl start node_exporter`
|
|
|
|
---
|
|
|
|
**Last Updated:** February 15, 2026
|
|
**Next Review:** Weekly or when infrastructure changes |