5.0 KiB
5.0 KiB
📊 Monitoring and Alerting Setup
This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.
🧰 Monitoring Stack Overview
Services Deployed
- Grafana (v12.4.0): Visualization and dashboarding
- Prometheus: Metrics collection and storage
- Node Exporter: Host-level metrics
- SNMP Exporter: Synology NAS metrics collection
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Services │───▶│ Prometheus │───▶│ Grafana │
│ (containers) │ │ (scraping) │ │ (visual) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Hosts │ │ Exporters │ │ Dashboards│
│(node_exporter)│ │(snmp_exporter)│ │(Grafana UI) │
└─────────────┘ └─────────────┘ └─────────────┘
🔧 Current Configuration
Active Monitoring Services
| Service | Host | Port | URL | Purpose |
|---|---|---|---|---|
| Grafana | Homelab VM | 3300 | https://gf.vish.gg |
Dashboards & visualization |
| Prometheus | Homelab VM | 9090 | http://192.168.0.210:9090 |
Metrics collection & storage |
| Alertmanager | Homelab VM | 9093 | http://192.168.0.210:9093 |
Alert routing & dedup |
| ntfy | Homelab VM | 8081 | https://ntfy.vish.gg |
Push notifications |
| Uptime Kuma | RPi 5 | 3001 | http://192.168.0.66:3001 or https://kuma.vish.gg |
Uptime monitoring (97 monitors) |
| DIUN | Atlantis | — | ntfy topic diun |
Docker image update detection |
| Scrutiny | Multiple | 8090 | http://192.168.0.210:8090 |
SMART disk health |
Prometheus Targets (14 active)
| Job | Target | Type | Status |
|---|---|---|---|
| atlantis-node | atlantis | node_exporter | Up |
| atlantis-snmp | atlantis | SNMP exporter | Up |
| calypso-node | calypso | node_exporter | Up |
| calypso-snmp | calypso | SNMP exporter | Up |
| concord-nuc-node | concord-nuc | node_exporter | Up |
| homelab-node | homelab-vm | node_exporter | Up |
| node_exporter | homelab-vm | node_exporter (self) | Up |
| prometheus | localhost:9090 | self-scrape | Up |
| proxmox-node | proxmox | node_exporter | Up |
| raspberry-pis | pi-5 | node_exporter | Up |
| seattle-node | seattle | node_exporter | Up |
| setillo-node | setillo | node_exporter | Up |
| setillo-snmp | setillo | SNMP exporter | Up |
| truenas-node | guava | node_exporter | Up |
📈 Key Metrics Monitored
System Resources
- CPU utilization percentage
- Memory usage and availability
- Disk space and I/O operations
- Network traffic and latency
Service Availability
- HTTP response times (Uptime Kuma)
- Container restart counts
- Database connection status
- Backup success rates
Network Health
- Tailscale connectivity status
- External service reachability
- DNS resolution times
- Cloudflare metrics
⚠️ Alerting Strategy
Alert Levels
-
Critical (Immediate Action)
- Service downtime (>5 min)
- System resource exhaustion (<10% free)
- Backup failures
-
Warning (Review Required)
- High resource usage (>80%)
- Container restarts
- Slow response times
-
Info (Monitoring Only)
- New service deployments
- Configuration changes
- Routine maintenance
Alert Channels
- ntfy notifications for critical issues
- Email alerts to administrators
- Slack integration for team communication
- Uptime Kuma dashboard for service status
📋 Maintenance Procedures
Regular Tasks
-
Daily
- Review Uptime Kuma service status
- Check Prometheus metrics for anomalies
- Verify Grafana dashboards display correctly
-
Weekly
- Update dashboard panels if needed
- Review and update alert thresholds
- Validate alert routes are working properly
-
Monthly
- Audit alert configurations
- Test alert delivery mechanisms
- Review Prometheus storage usage
📚 Related Documentation
- Image Update Guide — Renovate, DIUN, Watchtower
- Ansible Playbook Guide —
health_check.yml,service_status.yml - Backup Strategy — backup monitoring
- Offline & Remote Access — accessing monitoring when internet is down
- Disaster Recovery Procedures
- Security Hardening
Last updated: 2026