Sanitized mirror from private repository - 2026-04-05 10:08:22 UTC
This commit is contained in:
130
docs/admin/monitoring-setup.md
Normal file
130
docs/admin/monitoring-setup.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# 📊 Monitoring and Alerting Setup
|
||||
|
||||
This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.
|
||||
|
||||
## 🧰 Monitoring Stack Overview
|
||||
|
||||
### Services Deployed
|
||||
- **Grafana** (v12.4.0): Visualization and dashboarding
|
||||
- **Prometheus**: Metrics collection and storage
|
||||
- **Node Exporter**: Host-level metrics
|
||||
- **SNMP Exporter**: Synology NAS metrics collection
|
||||
|
||||
### Architecture
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Services │───▶│ Prometheus │───▶│ Grafana │
|
||||
│ (containers) │ │ (scraping) │ │ (visual) │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Hosts │ │ Exporters │ │ Dashboards│
|
||||
│(node_exporter)│ │(snmp_exporter)│ │(Grafana UI) │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
## 🔧 Current Configuration
|
||||
|
||||
### Active Monitoring Services
|
||||
| Service | Host | Port | URL | Purpose |
|
||||
|---------|------|------|-----|---------|
|
||||
| **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization |
|
||||
| **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage |
|
||||
| **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup |
|
||||
| **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications |
|
||||
| **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) |
|
||||
| **DIUN** | Atlantis | — | ntfy topic `diun` | Docker image update detection |
|
||||
| **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health |
|
||||
|
||||
### Prometheus Targets (14 active)
|
||||
| Job | Target | Type | Status |
|
||||
|-----|--------|------|--------|
|
||||
| atlantis-node | atlantis | node_exporter | Up |
|
||||
| atlantis-snmp | atlantis | SNMP exporter | Up |
|
||||
| calypso-node | calypso | node_exporter | Up |
|
||||
| calypso-snmp | calypso | SNMP exporter | Up |
|
||||
| concord-nuc-node | concord-nuc | node_exporter | Up |
|
||||
| homelab-node | homelab-vm | node_exporter | Up |
|
||||
| node_exporter | homelab-vm | node_exporter (self) | Up |
|
||||
| prometheus | localhost:9090 | self-scrape | Up |
|
||||
| proxmox-node | proxmox | node_exporter | Up |
|
||||
| raspberry-pis | pi-5 | node_exporter | Up |
|
||||
| seattle-node | seattle | node_exporter | Up |
|
||||
| setillo-node | setillo | node_exporter | Up |
|
||||
| setillo-snmp | setillo | SNMP exporter | Up |
|
||||
| truenas-node | guava | node_exporter | Up |
|
||||
|
||||
## 📈 Key Metrics Monitored
|
||||
|
||||
### System Resources
|
||||
- CPU utilization percentage
|
||||
- Memory usage and availability
|
||||
- Disk space and I/O operations
|
||||
- Network traffic and latency
|
||||
|
||||
### Service Availability
|
||||
- HTTP response times (Uptime Kuma)
|
||||
- Container restart counts
|
||||
- Database connection status
|
||||
- Backup success rates
|
||||
|
||||
### Network Health
|
||||
- Tailscale connectivity status
|
||||
- External service reachability
|
||||
- DNS resolution times
|
||||
- Cloudflare metrics
|
||||
|
||||
## ⚠️ Alerting Strategy
|
||||
|
||||
### Alert Levels
|
||||
1. **Critical (Immediate Action)**
|
||||
- Service downtime (>5 min)
|
||||
- System resource exhaustion (<10% free)
|
||||
- Backup failures
|
||||
|
||||
2. **Warning (Review Required)**
|
||||
- High resource usage (>80%)
|
||||
- Container restarts
|
||||
- Slow response times
|
||||
|
||||
3. **Info (Monitoring Only)**
|
||||
- New service deployments
|
||||
- Configuration changes
|
||||
- Routine maintenance
|
||||
|
||||
### Alert Channels
|
||||
- ntfy notifications for critical issues
|
||||
- Email alerts to administrators
|
||||
- Slack integration for team communication
|
||||
- Uptime Kuma dashboard for service status
|
||||
|
||||
## 📋 Maintenance Procedures
|
||||
|
||||
### Regular Tasks
|
||||
1. **Daily**
|
||||
- Review Uptime Kuma service status
|
||||
- Check Prometheus metrics for anomalies
|
||||
- Verify Grafana dashboards display correctly
|
||||
|
||||
2. **Weekly**
|
||||
- Update dashboard panels if needed
|
||||
- Review and update alert thresholds
|
||||
- Validate alert routes are working properly
|
||||
|
||||
3. **Monthly**
|
||||
- Audit alert configurations
|
||||
- Test alert delivery mechanisms
|
||||
- Review Prometheus storage usage
|
||||
|
||||
## 📚 Related Documentation
|
||||
|
||||
- [Image Update Guide](IMAGE_UPDATE_GUIDE.md) — Renovate, DIUN, Watchtower
|
||||
- [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) — `health_check.yml`, `service_status.yml`
|
||||
- [Backup Strategy](../infrastructure/backup-strategy.md) — backup monitoring
|
||||
- [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) — accessing monitoring when internet is down
|
||||
- [Disaster Recovery Procedures](disaster-recovery.md)
|
||||
- [Security Hardening](security-hardening.md)
|
||||
|
||||
---
|
||||
*Last updated: 2026*
|
||||
Reference in New Issue
Block a user