Sanitized mirror from private repository - 2026-04-20 00:50:49 UTC
This commit is contained in:
146
docs/admin/tailscale-monitoring-status.md
Normal file
146
docs/admin/tailscale-monitoring-status.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# Tailscale Host Monitoring Status Report
|
||||
|
||||
> **⚠️ Historical Snapshot**: This document was generated on Feb 15, 2026. The alerts and offline status listed here are no longer current. For live node status, run `tailscale status` on the homelab VM or check Grafana at `http://100.67.40.126:3000`.
|
||||
|
||||
## 📊 Status Snapshot
|
||||
|
||||
**Generated:** February 15, 2026
|
||||
|
||||
### Monitored Tailscale Hosts (13 total)
|
||||
|
||||
#### ✅ Online Hosts (10)
|
||||
- **atlantis-node** (100.83.230.112:9100) - Synology NAS
|
||||
- **atlantis-snmp** (100.83.230.112) - SNMP monitoring
|
||||
- **calypso-node** (100.103.48.78:9100) - Node exporter
|
||||
- **calypso-snmp** (100.103.48.78) - SNMP monitoring
|
||||
- **concord-nuc-node** (100.72.55.21:9100) - Intel NUC
|
||||
- **proxmox-node** (100.87.12.28:9100) - Proxmox server
|
||||
- **raspberry-pis** (100.77.151.40:9100) - Pi cluster node
|
||||
- **setillo-node** (100.125.0.20:9100) - Node exporter
|
||||
- **setillo-snmp** (100.125.0.20) - SNMP monitoring
|
||||
- **truenas-node** (100.75.252.64:9100) - TrueNAS server
|
||||
|
||||
#### ❌ Offline Hosts (3)
|
||||
- **homelab-node** (100.67.40.126:9100) - Main homelab VM
|
||||
- **raspberry-pis** (100.123.246.75:9100) - Pi cluster node
|
||||
- **vmi2076105-node** (100.99.156.20:9100) - VPS instance
|
||||
|
||||
## 🚨 Active Alerts
|
||||
|
||||
### Critical HostDown Alerts (2 firing)
|
||||
1. **vmi2076105-node** (100.99.156.20:9100)
|
||||
- Status: Firing since Feb 14, 07:57 UTC
|
||||
- Duration: ~24 hours
|
||||
- Notifications: Sent to ntfy + Signal
|
||||
|
||||
2. **homelab-node** (100.67.40.126:9100)
|
||||
- Status: Firing since Feb 14, 09:23 UTC
|
||||
- Duration: ~22 hours
|
||||
- Notifications: Sent to ntfy + Signal
|
||||
|
||||
## 📬 Notification System Status
|
||||
|
||||
### ✅ Working Notification Channels
|
||||
- **ntfy**: http://192.168.0.210:8081/homelab-alerts ✅
|
||||
- **Signal**: Via signal-bridge (critical alerts) ✅
|
||||
- **Alertmanager**: http://100.67.40.126:9093 ✅
|
||||
|
||||
### Test Results
|
||||
- ntfy notification test: **PASSED** ✅
|
||||
- Message delivery: **CONFIRMED** ✅
|
||||
- Alert routing: **WORKING** ✅
|
||||
|
||||
## ⚙️ Monitoring Configuration
|
||||
|
||||
### Alert Rules
|
||||
- **Trigger**: Host unreachable for 2+ minutes
|
||||
- **Severity**: Critical (dual-channel notifications)
|
||||
- **Query**: `up{job=~".*-node"} == 0`
|
||||
- **Evaluation**: Every 30 seconds
|
||||
|
||||
### Notification Routing
|
||||
- **Warning alerts** → ntfy only
|
||||
- **Critical alerts** → ntfy + Signal
|
||||
- **Resolved alerts** → Both channels
|
||||
|
||||
## 🔧 Infrastructure Details
|
||||
|
||||
### Monitoring Stack
|
||||
- **Prometheus**: http://100.67.40.126:9090
|
||||
- **Grafana**: http://100.67.40.126:3000
|
||||
- **Alertmanager**: http://100.67.40.126:9093
|
||||
- **Bridge Services**: ntfy-bridge (5001), signal-bridge (5000)
|
||||
|
||||
### Data Collection
|
||||
- **Node Exporter**: System metrics on port 9100
|
||||
- **SNMP Exporter**: Network device metrics on port 9116
|
||||
- **Scrape Interval**: 15 seconds
|
||||
- **Retention**: Default Prometheus retention
|
||||
|
||||
## 📋 Recommendations
|
||||
|
||||
### Immediate Actions
|
||||
1. **Investigate offline hosts**:
|
||||
- Check homelab-node (100.67.40.126) - main VM down
|
||||
- Verify vmi2076105-node (100.99.156.20) - VPS status
|
||||
- Check raspberry-pis node (100.123.246.75)
|
||||
|
||||
2. **Verify notifications**:
|
||||
- Confirm you're receiving ntfy alerts on mobile
|
||||
- Test Signal notifications for critical alerts
|
||||
|
||||
### Maintenance
|
||||
- Monitor disk space on active hosts
|
||||
- Review alert thresholds if needed
|
||||
- Consider adding more monitoring targets
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
Use the test script to verify monitoring:
|
||||
```bash
|
||||
./scripts/test-tailscale-monitoring.sh
|
||||
```
|
||||
|
||||
For manual testing:
|
||||
1. Stop node_exporter on any host: `sudo systemctl stop node_exporter`
|
||||
2. Wait 2+ minutes for alert to fire
|
||||
3. Check ntfy app and Signal for notifications
|
||||
4. Restart: `sudo systemctl start node_exporter`
|
||||
|
||||
---
|
||||
|
||||
## 🟢 Verified Online Nodes (March 2026)
|
||||
|
||||
As of March 11, 2026, all 16 active nodes verified reachable via ping:
|
||||
|
||||
| Node | Tailscale IP | Role |
|
||||
|------|-------------|------|
|
||||
| atlantis | 100.83.230.112 | Primary NAS, exit node |
|
||||
| calypso | 100.103.48.78 | Secondary NAS, Headscale host |
|
||||
| setillo | 100.125.0.20 | Remote NAS, Tucson |
|
||||
| homelab | 100.67.40.126 | Main VM (this host) |
|
||||
| pve | 100.87.12.28 | Proxmox hypervisor |
|
||||
| vish-concord-nuc | 100.72.55.21 | Intel NUC, exit node |
|
||||
| pi-5 | 100.77.151.40 | Raspberry Pi 5 |
|
||||
| matrix-ubuntu | 100.85.21.51 | Atlantis VM |
|
||||
| guava | 100.75.252.64 | TrueNAS Scale |
|
||||
| jellyfish | 100.69.121.120 | Pi 5 media/NAS |
|
||||
| gl-mt3000 | 100.126.243.15 | GL.iNet Beryl AX (travel router, repeater behind GL-MT3600BE, exit node) |
|
||||
| gl-be3600 | 100.105.59.123 | GL.iNet Slate 7 (travel router, exit node) |
|
||||
| gl-mt3600be | 100.64.0.10 | GL.iNet Beryl 7 (remote primary gateway, subnet + exit node) |
|
||||
| homeassistant | 100.112.186.90 | HA Green (via remote subnet, behind GL-MT3600BE) |
|
||||
| seattle | 100.82.197.124 | Contabo VPS, exit node |
|
||||
| shinku-ryuu | 100.98.93.15 | Desktop workstation (Windows) |
|
||||
| moon | 100.64.0.6 | Debian x86_64, remote subnet (`192.168.12.223`, behind GL-MT3600BE) |
|
||||
| jellyfish | 100.69.121.120 | Remote workstation (behind GL-MT3600BE) |
|
||||
| headscale-test | 100.64.0.1 | Headscale test node |
|
||||
|
||||
### Notes
|
||||
- **moon** was migrated from public Tailscale (`dvish92@`) to Headscale on 2026-03-14. It is on the `192.168.12.0/24` subnet, now behind the GL-MT3600BE (Beryl 7) router (replaced GL-MT3000 on 2026-04-16). `accept_routes=true` is enabled so it can reach `192.168.0.0/24` (home LAN) via Calypso's subnet advertisement.
|
||||
- **guava** has `accept_routes=false` to prevent Calypso's `192.168.0.0/24` route from overriding its own LAN replies. See `docs/troubleshooting/guava-smb-incident-2026-03-14.md`.
|
||||
- **shinku-ryuu** also has `accept_routes=false` for the same reason.
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** April 2026
|
||||
**Note:** The Feb 2026 alerts (homelab-node and vmi2076105-node offline) were resolved. Both nodes are now online.
|
||||
Reference in New Issue
Block a user