Sanitized mirror from private repository - 2026-04-05 10:08:22 UTC

2026-04-05 10:08:22 +00:00
commit 0067767ff4
1394 changed files with 355699 additions and 0 deletions
--- a/docs/admin/monitoring-setup.md
+++ b/docs/admin/monitoring-setup.md
@@ -0,0 +1,130 @@
+# 📊 Monitoring and Alerting Setup
+
+This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.
+
+## 🧰 Monitoring Stack Overview
+
+### Services Deployed
+- **Grafana** (v12.4.0): Visualization and dashboarding
+- **Prometheus**: Metrics collection and storage  
+- **Node Exporter**: Host-level metrics
+- **SNMP Exporter**: Synology NAS metrics collection
+
+### Architecture
+```
+┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+│   Services  │───▶│   Prometheus  │───▶│   Grafana   │
+│   (containers) │    │   (scraping)  │    │   (visual)  │
+└─────────────┘    └─────────────┘    └─────────────┘
+     │                   │                  │
+     ▼                   ▼                  ▼
+┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+│   Hosts     │    │   Exporters │    │   Dashboards│
+│(node_exporter)│    │(snmp_exporter)│    │(Grafana UI) │
+└─────────────┘    └─────────────┘    └─────────────┘
+```
+
+## 🔧 Current Configuration
+
+### Active Monitoring Services
+| Service | Host | Port | URL | Purpose |
+|---------|------|------|-----|---------|
+| **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization |
+| **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage |
+| **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup |
+| **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications |
+| **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) |
+| **DIUN** | Atlantis | — | ntfy topic `diun` | Docker image update detection |
+| **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health |
+
+### Prometheus Targets (14 active)
+| Job | Target | Type | Status |
+|-----|--------|------|--------|
+| atlantis-node | atlantis | node_exporter | Up |
+| atlantis-snmp | atlantis | SNMP exporter | Up |
+| calypso-node | calypso | node_exporter | Up |
+| calypso-snmp | calypso | SNMP exporter | Up |
+| concord-nuc-node | concord-nuc | node_exporter | Up |
+| homelab-node | homelab-vm | node_exporter | Up |
+| node_exporter | homelab-vm | node_exporter (self) | Up |
+| prometheus | localhost:9090 | self-scrape | Up |
+| proxmox-node | proxmox | node_exporter | Up |
+| raspberry-pis | pi-5 | node_exporter | Up |
+| seattle-node | seattle | node_exporter | Up |
+| setillo-node | setillo | node_exporter | Up |
+| setillo-snmp | setillo | SNMP exporter | Up |
+| truenas-node | guava | node_exporter | Up |
+
+## 📈 Key Metrics Monitored
+
+### System Resources
+- CPU utilization percentage
+- Memory usage and availability
+- Disk space and I/O operations  
+- Network traffic and latency
+
+### Service Availability
+- HTTP response times (Uptime Kuma)
+- Container restart counts
+- Database connection status
+- Backup success rates
+
+### Network Health
+- Tailscale connectivity status
+- External service reachability 
+- DNS resolution times
+- Cloudflare metrics
+
+## ⚠️ Alerting Strategy
+
+### Alert Levels
+1. **Critical (Immediate Action)**
+   - Service downtime (>5 min)
+   - System resource exhaustion (<10% free)
+   - Backup failures
+
+2. **Warning (Review Required)**
+   - High resource usage (>80%)
+   - Container restarts
+   - Slow response times 
+
+3. **Info (Monitoring Only)**  
+   - New service deployments
+   - Configuration changes
+   - Routine maintenance
+
+### Alert Channels
+- ntfy notifications for critical issues
+- Email alerts to administrators  
+- Slack integration for team communication
+- Uptime Kuma dashboard for service status
+
+## 📋 Maintenance Procedures
+
+### Regular Tasks
+1. **Daily**
+   - Review Uptime Kuma service status
+   - Check Prometheus metrics for anomalies
+   - Verify Grafana dashboards display correctly
+
+2. **Weekly**  
+   - Update dashboard panels if needed
+   - Review and update alert thresholds
+   - Validate alert routes are working properly
+
+3. **Monthly**
+   - Audit alert configurations
+   - Test alert delivery mechanisms
+   - Review Prometheus storage usage
+
+## 📚 Related Documentation
+
+- [Image Update Guide](IMAGE_UPDATE_GUIDE.md) — Renovate, DIUN, Watchtower
+- [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) — `health_check.yml`, `service_status.yml`
+- [Backup Strategy](../infrastructure/backup-strategy.md) — backup monitoring
+- [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) — accessing monitoring when internet is down
+- [Disaster Recovery Procedures](disaster-recovery.md)
+- [Security Hardening](security-hardening.md)
+
+---
+*Last updated: 2026*