131 lines
5.0 KiB
Markdown
131 lines
5.0 KiB
Markdown
# 📊 Monitoring and Alerting Setup
|
|
|
|
This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.
|
|
|
|
## 🧰 Monitoring Stack Overview
|
|
|
|
### Services Deployed
|
|
- **Grafana** (v12.4.0): Visualization and dashboarding
|
|
- **Prometheus**: Metrics collection and storage
|
|
- **Node Exporter**: Host-level metrics
|
|
- **SNMP Exporter**: Synology NAS metrics collection
|
|
|
|
### Architecture
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Services │───▶│ Prometheus │───▶│ Grafana │
|
|
│ (containers) │ │ (scraping) │ │ (visual) │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Hosts │ │ Exporters │ │ Dashboards│
|
|
│(node_exporter)│ │(snmp_exporter)│ │(Grafana UI) │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
```
|
|
|
|
## 🔧 Current Configuration
|
|
|
|
### Active Monitoring Services
|
|
| Service | Host | Port | URL | Purpose |
|
|
|---------|------|------|-----|---------|
|
|
| **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization |
|
|
| **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage |
|
|
| **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup |
|
|
| **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications |
|
|
| **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) |
|
|
| **DIUN** | Atlantis | — | ntfy topic `diun` | Docker image update detection |
|
|
| **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health |
|
|
|
|
### Prometheus Targets (14 active)
|
|
| Job | Target | Type | Status |
|
|
|-----|--------|------|--------|
|
|
| atlantis-node | atlantis | node_exporter | Up |
|
|
| atlantis-snmp | atlantis | SNMP exporter | Up |
|
|
| calypso-node | calypso | node_exporter | Up |
|
|
| calypso-snmp | calypso | SNMP exporter | Up |
|
|
| concord-nuc-node | concord-nuc | node_exporter | Up |
|
|
| homelab-node | homelab-vm | node_exporter | Up |
|
|
| node_exporter | homelab-vm | node_exporter (self) | Up |
|
|
| prometheus | localhost:9090 | self-scrape | Up |
|
|
| proxmox-node | proxmox | node_exporter | Up |
|
|
| raspberry-pis | pi-5 | node_exporter | Up |
|
|
| seattle-node | seattle | node_exporter | Up |
|
|
| setillo-node | setillo | node_exporter | Up |
|
|
| setillo-snmp | setillo | SNMP exporter | Up |
|
|
| truenas-node | guava | node_exporter | Up |
|
|
|
|
## 📈 Key Metrics Monitored
|
|
|
|
### System Resources
|
|
- CPU utilization percentage
|
|
- Memory usage and availability
|
|
- Disk space and I/O operations
|
|
- Network traffic and latency
|
|
|
|
### Service Availability
|
|
- HTTP response times (Uptime Kuma)
|
|
- Container restart counts
|
|
- Database connection status
|
|
- Backup success rates
|
|
|
|
### Network Health
|
|
- Tailscale connectivity status
|
|
- External service reachability
|
|
- DNS resolution times
|
|
- Cloudflare metrics
|
|
|
|
## ⚠️ Alerting Strategy
|
|
|
|
### Alert Levels
|
|
1. **Critical (Immediate Action)**
|
|
- Service downtime (>5 min)
|
|
- System resource exhaustion (<10% free)
|
|
- Backup failures
|
|
|
|
2. **Warning (Review Required)**
|
|
- High resource usage (>80%)
|
|
- Container restarts
|
|
- Slow response times
|
|
|
|
3. **Info (Monitoring Only)**
|
|
- New service deployments
|
|
- Configuration changes
|
|
- Routine maintenance
|
|
|
|
### Alert Channels
|
|
- ntfy notifications for critical issues
|
|
- Email alerts to administrators
|
|
- Slack integration for team communication
|
|
- Uptime Kuma dashboard for service status
|
|
|
|
## 📋 Maintenance Procedures
|
|
|
|
### Regular Tasks
|
|
1. **Daily**
|
|
- Review Uptime Kuma service status
|
|
- Check Prometheus metrics for anomalies
|
|
- Verify Grafana dashboards display correctly
|
|
|
|
2. **Weekly**
|
|
- Update dashboard panels if needed
|
|
- Review and update alert thresholds
|
|
- Validate alert routes are working properly
|
|
|
|
3. **Monthly**
|
|
- Audit alert configurations
|
|
- Test alert delivery mechanisms
|
|
- Review Prometheus storage usage
|
|
|
|
## 📚 Related Documentation
|
|
|
|
- [Image Update Guide](IMAGE_UPDATE_GUIDE.md) — Renovate, DIUN, Watchtower
|
|
- [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) — `health_check.yml`, `service_status.yml`
|
|
- [Backup Strategy](../infrastructure/backup-strategy.md) — backup monitoring
|
|
- [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) — accessing monitoring when internet is down
|
|
- [Disaster Recovery Procedures](disaster-recovery.md)
|
|
- [Security Hardening](security-hardening.md)
|
|
|
|
---
|
|
*Last updated: 2026*
|