# πŸ“Š Monitoring and Alerting Setup This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures. ## 🧰 Monitoring Stack Overview ### Services Deployed - **Grafana** (v12.4.0): Visualization and dashboarding - **Prometheus**: Metrics collection and storage - **Node Exporter**: Host-level metrics - **SNMP Exporter**: Synology NAS metrics collection ### Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Services │───▢│ Prometheus │───▢│ Grafana β”‚ β”‚ (containers) β”‚ β”‚ (scraping) β”‚ β”‚ (visual) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Hosts β”‚ β”‚ Exporters β”‚ β”‚ Dashboardsβ”‚ β”‚(node_exporter)β”‚ β”‚(snmp_exporter)β”‚ β”‚(Grafana UI) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## πŸ”§ Current Configuration ### Active Monitoring Services | Service | Host | Port | URL | Purpose | |---------|------|------|-----|---------| | **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization | | **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage | | **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup | | **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications | | **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) | | **DIUN** | Atlantis | β€” | ntfy topic `diun` | Docker image update detection | | **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health | ### Prometheus Targets (14 active) | Job | Target | Type | Status | |-----|--------|------|--------| | atlantis-node | atlantis | node_exporter | Up | | atlantis-snmp | atlantis | SNMP exporter | Up | | calypso-node | calypso | node_exporter | Up | | calypso-snmp | calypso | SNMP exporter | Up | | concord-nuc-node | concord-nuc | node_exporter | Up | | homelab-node | homelab-vm | node_exporter | Up | | node_exporter | homelab-vm | node_exporter (self) | Up | | prometheus | localhost:9090 | self-scrape | Up | | proxmox-node | proxmox | node_exporter | Up | | raspberry-pis | pi-5 | node_exporter | Up | | seattle-node | seattle | node_exporter | Up | | setillo-node | setillo | node_exporter | Up | | setillo-snmp | setillo | SNMP exporter | Up | | truenas-node | guava | node_exporter | Up | ## πŸ“ˆ Key Metrics Monitored ### System Resources - CPU utilization percentage - Memory usage and availability - Disk space and I/O operations - Network traffic and latency ### Service Availability - HTTP response times (Uptime Kuma) - Container restart counts - Database connection status - Backup success rates ### Network Health - Tailscale connectivity status - External service reachability - DNS resolution times - Cloudflare metrics ## ⚠️ Alerting Strategy ### Alert Levels 1. **Critical (Immediate Action)** - Service downtime (>5 min) - System resource exhaustion (<10% free) - Backup failures 2. **Warning (Review Required)** - High resource usage (>80%) - Container restarts - Slow response times 3. **Info (Monitoring Only)** - New service deployments - Configuration changes - Routine maintenance ### Alert Channels - ntfy notifications for critical issues - Email alerts to administrators - Slack integration for team communication - Uptime Kuma dashboard for service status ## πŸ“‹ Maintenance Procedures ### Regular Tasks 1. **Daily** - Review Uptime Kuma service status - Check Prometheus metrics for anomalies - Verify Grafana dashboards display correctly 2. **Weekly** - Update dashboard panels if needed - Review and update alert thresholds - Validate alert routes are working properly 3. **Monthly** - Audit alert configurations - Test alert delivery mechanisms - Review Prometheus storage usage ## πŸ“š Related Documentation - [Image Update Guide](IMAGE_UPDATE_GUIDE.md) β€” Renovate, DIUN, Watchtower - [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) β€” `health_check.yml`, `service_status.yml` - [Backup Strategy](../infrastructure/backup-strategy.md) β€” backup monitoring - [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) β€” accessing monitoring when internet is down - [Disaster Recovery Procedures](disaster-recovery.md) - [Security Hardening](security-hardening.md) --- *Last updated: 2026*