# 📊 Monitoring and Alerting Setup

This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.

## 🧰 Monitoring Stack Overview

### Services Deployed
- **Grafana** (v12.4.0): Visualization and dashboarding
- **Prometheus**: Metrics collection and storage  
- **Node Exporter**: Host-level metrics
- **SNMP Exporter**: Synology NAS metrics collection

### Architecture
```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Services  │───▶│   Prometheus  │───▶│   Grafana   │
│   (containers) │    │   (scraping)  │    │   (visual)  │
└─────────────┘    └─────────────┘    └─────────────┘
     │                   │                  │
     ▼                   ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Hosts     │    │   Exporters │    │   Dashboards│
│(node_exporter)│    │(snmp_exporter)│    │(Grafana UI) │
└─────────────┘    └─────────────┘    └─────────────┘
```

## 🔧 Current Configuration

### Active Monitoring Services
| Service | Host | Port | URL | Purpose |
|---------|------|------|-----|---------|
| **Grafana** | Homelab VM | 3300 | `https://gf.vish.gg` | Dashboards & visualization |
| **Prometheus** | Homelab VM | 9090 | `http://192.168.0.210:9090` | Metrics collection & storage |
| **Alertmanager** | Homelab VM | 9093 | `http://192.168.0.210:9093` | Alert routing & dedup |
| **ntfy** | Homelab VM | 8081 | `https://ntfy.vish.gg` | Push notifications |
| **Uptime Kuma** | RPi 5 | 3001 | `http://192.168.0.66:3001` or `https://kuma.vish.gg` | Uptime monitoring (97 monitors) |
| **DIUN** | Atlantis | — | ntfy topic `diun` | Docker image update detection |
| **Scrutiny** | Multiple | 8090 | `http://192.168.0.210:8090` | SMART disk health |

### Prometheus Targets (14 active)
| Job | Target | Type | Status |
|-----|--------|------|--------|
| atlantis-node | atlantis | node_exporter | Up |
| atlantis-snmp | atlantis | SNMP exporter | Up |
| calypso-node | calypso | node_exporter | Up |
| calypso-snmp | calypso | SNMP exporter | Up |
| concord-nuc-node | concord-nuc | node_exporter | Up |
| homelab-node | homelab-vm | node_exporter | Up |
| node_exporter | homelab-vm | node_exporter (self) | Up |
| prometheus | localhost:9090 | self-scrape | Up |
| proxmox-node | proxmox | node_exporter | Up |
| raspberry-pis | pi-5 | node_exporter | Up |
| seattle-node | seattle | node_exporter | Up |
| setillo-node | setillo | node_exporter | Up |
| setillo-snmp | setillo | SNMP exporter | Up |
| truenas-node | guava | node_exporter | Up |

## 📈 Key Metrics Monitored

### System Resources
- CPU utilization percentage
- Memory usage and availability
- Disk space and I/O operations  
- Network traffic and latency

### Service Availability
- HTTP response times (Uptime Kuma)
- Container restart counts
- Database connection status
- Backup success rates

### Network Health
- Tailscale connectivity status
- External service reachability 
- DNS resolution times
- Cloudflare metrics

## ⚠️ Alerting Strategy

### Alert Levels
1. **Critical (Immediate Action)**
   - Service downtime (>5 min)
   - System resource exhaustion (<10% free)
   - Backup failures

2. **Warning (Review Required)**
   - High resource usage (>80%)
   - Container restarts
   - Slow response times 

3. **Info (Monitoring Only)**  
   - New service deployments
   - Configuration changes
   - Routine maintenance

### Alert Channels
- ntfy notifications for critical issues
- Email alerts to administrators  
- Slack integration for team communication
- Uptime Kuma dashboard for service status

## 📋 Maintenance Procedures

### Regular Tasks
1. **Daily**
   - Review Uptime Kuma service status
   - Check Prometheus metrics for anomalies
   - Verify Grafana dashboards display correctly

2. **Weekly**  
   - Update dashboard panels if needed
   - Review and update alert thresholds
   - Validate alert routes are working properly

3. **Monthly**
   - Audit alert configurations
   - Test alert delivery mechanisms
   - Review Prometheus storage usage

## 📚 Related Documentation

- [Image Update Guide](IMAGE_UPDATE_GUIDE.md) — Renovate, DIUN, Watchtower
- [Ansible Playbook Guide](ANSIBLE_PLAYBOOK_GUIDE.md) — `health_check.yml`, `service_status.yml`
- [Backup Strategy](../infrastructure/backup-strategy.md) — backup monitoring
- [Offline & Remote Access](../infrastructure/offline-and-remote-access.md) — accessing monitoring when internet is down
- [Disaster Recovery Procedures](disaster-recovery.md)
- [Security Hardening](security-hardening.md)

---
*Last updated: 2026*