Files
homelab-optimized/docs/admin/monitoring-setup.md
Gitea Mirror Bot b25f28559d
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-05 05:32:08 UTC
2026-04-05 05:32:08 +00:00

131 lines
5.0 KiB
Markdown

# 📊 Monitoring and Alerting Setup
This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.
## 🧰 Monitoring Stack Overview
### Services Deployed
- **Grafana** (v12.4.0): Visualization and dashboarding
- **Prometheus**: Metrics collection and storage
- **Node Exporter**: Host-level metrics
- **SNMP Exporter**: Synology NAS metrics collection
### Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Services │───▶│ Prometheus │───▶│ Grafana │
│ (containers) │ │ (scraping) │ │ (visual) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Hosts │ │ Exporters │ │ Dashboards│
│(node_exporter)│ │(snmp_exporter)│ │(Grafana UI) │
└─────────────┘ └─────────────┘ └─────────────┘
```
## 🔧 Current Configuration
### Active Monitoring Services
| Service | Host | Port | URL | Purpose |
|---------|------|------|-----|---------|
| **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization |
| **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage |
| **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup |
| **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications |
| **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) |
| **DIUN** | Atlantis | — | ntfy topic `diun` | Docker image update detection |
| **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health |
### Prometheus Targets (14 active)
| Job | Target | Type | Status |
|-----|--------|------|--------|
| atlantis-node | atlantis | node_exporter | Up |
| atlantis-snmp | atlantis | SNMP exporter | Up |
| calypso-node | calypso | node_exporter | Up |
| calypso-snmp | calypso | SNMP exporter | Up |
| concord-nuc-node | concord-nuc | node_exporter | Up |
| homelab-node | homelab-vm | node_exporter | Up |
| node_exporter | homelab-vm | node_exporter (self) | Up |
| prometheus | localhost:9090 | self-scrape | Up |
| proxmox-node | proxmox | node_exporter | Up |
| raspberry-pis | pi-5 | node_exporter | Up |
| seattle-node | seattle | node_exporter | Up |
| setillo-node | setillo | node_exporter | Up |
| setillo-snmp | setillo | SNMP exporter | Up |
| truenas-node | guava | node_exporter | Up |
## 📈 Key Metrics Monitored
### System Resources
- CPU utilization percentage
- Memory usage and availability
- Disk space and I/O operations
- Network traffic and latency
### Service Availability
- HTTP response times (Uptime Kuma)
- Container restart counts
- Database connection status
- Backup success rates
### Network Health
- Tailscale connectivity status
- External service reachability
- DNS resolution times
- Cloudflare metrics
## ⚠️ Alerting Strategy
### Alert Levels
1. **Critical (Immediate Action)**
- Service downtime (>5 min)
- System resource exhaustion (<10% free)
- Backup failures
2. **Warning (Review Required)**
- High resource usage (>80%)
- Container restarts
- Slow response times
3. **Info (Monitoring Only)**
- New service deployments
- Configuration changes
- Routine maintenance
### Alert Channels
- ntfy notifications for critical issues
- Email alerts to administrators
- Slack integration for team communication
- Uptime Kuma dashboard for service status
## 📋 Maintenance Procedures
### Regular Tasks
1. **Daily**
- Review Uptime Kuma service status
- Check Prometheus metrics for anomalies
- Verify Grafana dashboards display correctly
2. **Weekly**
- Update dashboard panels if needed
- Review and update alert thresholds
- Validate alert routes are working properly
3. **Monthly**
- Audit alert configurations
- Test alert delivery mechanisms
- Review Prometheus storage usage
## 📚 Related Documentation
- [Image Update Guide](IMAGE_UPDATE_GUIDE.md) — Renovate, DIUN, Watchtower
- [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) — `health_check.yml`, `service_status.yml`
- [Backup Strategy](../infrastructure/backup-strategy.md) — backup monitoring
- [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) — accessing monitoring when internet is down
- [Disaster Recovery Procedures](disaster-recovery.md)
- [Security Hardening](security-hardening.md)
---
*Last updated: 2026*