# 📊 Monitoring Infrastructure *Docker-based monitoring stack for comprehensive homelab observability* ## Overview This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment. ## Architecture ### Core Components - **Prometheus** - Metrics collection and storage - **Grafana** - Visualization and dashboards - **AlertManager** - Alert routing and management - **Node Exporter** - System metrics collection - **cAdvisor** - Container metrics collection ### Deployment Structure ``` monitoring/ ├── prometheus/ │ ├── prometheus.yml # Main configuration │ ├── alert-rules.yml # Alert definitions │ └── targets/ # Service discovery configs ├── grafana/ │ ├── provisioning/ # Dashboard and datasource configs │ └── dashboards/ # JSON dashboard definitions ├── alertmanager/ │ └── alertmanager.yml # Alert routing configuration └── docker-compose.yml # Complete monitoring stack ``` ## Service Endpoints ### Internal Access - **Prometheus**: `http://prometheus:9090` - **Grafana**: `http://grafana:3000` - **AlertManager**: `http://alertmanager:9093` ### External Access (via Nginx Proxy Manager) - **Grafana**: `https://grafana.vish.gg` - **Prometheus**: `https://prometheus.vish.gg` (admin only) - **AlertManager**: `https://alerts.vish.gg` (admin only) ## Metrics Collection ### System Metrics - **Node Exporter**: CPU, memory, disk, network statistics - **SNMP Exporter**: Network equipment monitoring - **Blackbox Exporter**: Service availability checks ### Container Metrics - **cAdvisor**: Docker container resource usage - **Portainer metrics**: Container orchestration metrics - **Docker daemon metrics**: Docker engine statistics ### Application Metrics - **Plex**: Media server performance metrics - **Nginx**: Web server access and performance - **Database metrics**: PostgreSQL, Redis performance ### Custom Metrics - **Backup status**: Success/failure rates - **Storage usage**: Disk space across all hosts - **Network performance**: Bandwidth and latency ## Dashboard Categories ### Infrastructure Dashboards - **Host Overview**: System resource utilization - **Network Performance**: Bandwidth and connectivity - **Storage Monitoring**: Disk usage and health - **Docker Containers**: Container resource usage ### Service Dashboards - **Media Services**: Plex, Arr suite performance - **Web Services**: Nginx, application response times - **Database Performance**: Query performance and connections - **Backup Monitoring**: Backup job status and trends ### Security Dashboards - **Authentication Events**: Login attempts and failures - **Network Security**: Firewall logs and intrusion attempts - **Certificate Monitoring**: SSL certificate expiration - **Vulnerability Scanning**: Security scan results ## Alert Configuration ### Critical Alerts - **Host down**: System unreachable - **High resource usage**: CPU/Memory > 90% - **Disk space critical**: < 10% free space - **Service unavailable**: Key services down ### Warning Alerts - **High resource usage**: CPU/Memory > 80% - **Disk space low**: < 20% free space - **Certificate expiring**: < 30 days to expiration - **Backup failures**: Failed backup jobs ### Info Alerts - **System updates**: Available updates - **Maintenance windows**: Scheduled maintenance - **Performance trends**: Unusual patterns - **Capacity planning**: Resource growth trends ## Data Retention ### Prometheus Retention - **Raw metrics**: 15 days high resolution - **Downsampled**: 90 days medium resolution - **Long-term**: 1 year low resolution ### Grafana Data - **Dashboards**: Version controlled in Git - **User preferences**: Backed up weekly - **Annotations**: Retained for 1 year ### Log Retention - **Application logs**: 30 days - **System logs**: 90 days - **Audit logs**: 1 year - **Security logs**: 2 years ## Backup and Recovery ### Configuration Backup ```bash # Backup Prometheus configuration docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/ # Backup Grafana dashboards docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/ ``` ### Data Backup ```bash # Backup Prometheus data docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/ # Backup Grafana database docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db" ``` ### Disaster Recovery 1. **Restore configurations** from backup 2. **Redeploy containers** with restored configs 3. **Import historical data** if needed 4. **Verify alert routing** and dashboard functionality ## Performance Optimization ### Prometheus Optimization - **Recording rules**: Pre-calculate expensive queries - **Metric relabeling**: Reduce cardinality - **Storage optimization**: Efficient time series storage - **Query optimization**: Efficient PromQL queries ### Grafana Optimization - **Dashboard caching**: Reduce query load - **Panel optimization**: Efficient visualizations - **User management**: Role-based access control - **Plugin management**: Only necessary plugins ### Network Optimization - **Local metrics**: Minimize network traffic - **Compression**: Enable metric compression - **Batching**: Batch metric collection - **Filtering**: Collect only necessary metrics ## Troubleshooting ### Common Issues #### High Memory Usage ```bash # Check Prometheus memory usage docker stats prometheus # Reduce retention period # Edit prometheus.yml: --storage.tsdb.retention.time=7d ``` #### Missing Metrics ```bash # Check target status curl http://prometheus:9090/api/v1/targets # Verify service discovery curl http://prometheus:9090/api/v1/label/__name__/values ``` #### Dashboard Loading Issues ```bash # Check Grafana logs docker logs grafana # Verify datasource connectivity curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up ``` ### Monitoring Health Checks ```bash # Prometheus health curl http://prometheus:9090/-/healthy # Grafana health curl http://grafana:3000/api/health # AlertManager health curl http://alertmanager:9093/-/healthy ``` ## Security Configuration ### Authentication - **Grafana**: OAuth integration with Authentik - **Prometheus**: Basic auth via reverse proxy - **AlertManager**: Basic auth via reverse proxy ### Network Security - **Internal network**: Isolated Docker network - **Reverse proxy**: Nginx Proxy Manager - **SSL termination**: Let's Encrypt certificates - **Access control**: IP-based restrictions ### Data Security - **Encryption at rest**: Encrypted storage volumes - **Encryption in transit**: TLS for all communications - **Access logging**: Comprehensive audit trails - **Regular updates**: Automated security updates ## Integration Points ### External Systems - **NTFY**: Push notifications for alerts - **Email**: Backup notification channel - **Slack**: Team notifications (optional) - **PagerDuty**: Escalation for critical alerts ### Automation - **Ansible**: Configuration management - **GitOps**: Version-controlled configurations - **CI/CD**: Automated deployment pipeline - **Backup automation**: Scheduled backups ## Future Enhancements ### Planned Features - **Log aggregation**: Centralized log management - **Distributed tracing**: Application tracing - **Synthetic monitoring**: Proactive service testing - **Machine learning**: Anomaly detection ### Scaling Considerations - **High availability**: Multi-instance deployment - **Load balancing**: Distribute query load - **Federation**: Multi-cluster monitoring - **Storage scaling**: Efficient long-term storage --- **Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems