Sanitized mirror from private repository - 2026-04-05 13:06:07 UTC

2026-04-05 13:06:07 +00:00
commit da2060f709
1401 changed files with 358437 additions and 0 deletions
--- a/docs/infrastructure/docker/monitoring/README.md
+++ b/docs/infrastructure/docker/monitoring/README.md
@@ -0,0 +1,261 @@
+# 📊 Monitoring Infrastructure
+
+*Docker-based monitoring stack for comprehensive homelab observability*
+
+## Overview
+This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.
+
+## Architecture
+
+### Core Components
+- **Prometheus** - Metrics collection and storage
+- **Grafana** - Visualization and dashboards
+- **AlertManager** - Alert routing and management
+- **Node Exporter** - System metrics collection
+- **cAdvisor** - Container metrics collection
+
+### Deployment Structure
+```
+monitoring/
+├── prometheus/
+│   ├── prometheus.yml      # Main configuration
+│   ├── alert-rules.yml     # Alert definitions
+│   └── targets/           # Service discovery configs
+├── grafana/
+│   ├── provisioning/      # Dashboard and datasource configs
+│   └── dashboards/        # JSON dashboard definitions
+├── alertmanager/
+│   └── alertmanager.yml   # Alert routing configuration
+└── docker-compose.yml     # Complete monitoring stack
+```
+
+## Service Endpoints
+
+### Internal Access
+- **Prometheus**: `http://prometheus:9090`
+- **Grafana**: `http://grafana:3000`
+- **AlertManager**: `http://alertmanager:9093`
+
+### External Access (via Nginx Proxy Manager)
+- **Grafana**: `https://grafana.vish.gg`
+- **Prometheus**: `https://prometheus.vish.gg` (admin only)
+- **AlertManager**: `https://alerts.vish.gg` (admin only)
+
+## Metrics Collection
+
+### System Metrics
+- **Node Exporter**: CPU, memory, disk, network statistics
+- **SNMP Exporter**: Network equipment monitoring
+- **Blackbox Exporter**: Service availability checks
+
+### Container Metrics
+- **cAdvisor**: Docker container resource usage
+- **Portainer metrics**: Container orchestration metrics
+- **Docker daemon metrics**: Docker engine statistics
+
+### Application Metrics
+- **Plex**: Media server performance metrics
+- **Nginx**: Web server access and performance
+- **Database metrics**: PostgreSQL, Redis performance
+
+### Custom Metrics
+- **Backup status**: Success/failure rates
+- **Storage usage**: Disk space across all hosts
+- **Network performance**: Bandwidth and latency
+
+## Dashboard Categories
+
+### Infrastructure Dashboards
+- **Host Overview**: System resource utilization
+- **Network Performance**: Bandwidth and connectivity
+- **Storage Monitoring**: Disk usage and health
+- **Docker Containers**: Container resource usage
+
+### Service Dashboards
+- **Media Services**: Plex, Arr suite performance
+- **Web Services**: Nginx, application response times
+- **Database Performance**: Query performance and connections
+- **Backup Monitoring**: Backup job status and trends
+
+### Security Dashboards
+- **Authentication Events**: Login attempts and failures
+- **Network Security**: Firewall logs and intrusion attempts
+- **Certificate Monitoring**: SSL certificate expiration
+- **Vulnerability Scanning**: Security scan results
+
+## Alert Configuration
+
+### Critical Alerts
+- **Host down**: System unreachable
+- **High resource usage**: CPU/Memory > 90%
+- **Disk space critical**: < 10% free space
+- **Service unavailable**: Key services down
+
+### Warning Alerts
+- **High resource usage**: CPU/Memory > 80%
+- **Disk space low**: < 20% free space
+- **Certificate expiring**: < 30 days to expiration
+- **Backup failures**: Failed backup jobs
+
+### Info Alerts
+- **System updates**: Available updates
+- **Maintenance windows**: Scheduled maintenance
+- **Performance trends**: Unusual patterns
+- **Capacity planning**: Resource growth trends
+
+## Data Retention
+
+### Prometheus Retention
+- **Raw metrics**: 15 days high resolution
+- **Downsampled**: 90 days medium resolution
+- **Long-term**: 1 year low resolution
+
+### Grafana Data
+- **Dashboards**: Version controlled in Git
+- **User preferences**: Backed up weekly
+- **Annotations**: Retained for 1 year
+
+### Log Retention
+- **Application logs**: 30 days
+- **System logs**: 90 days
+- **Audit logs**: 1 year
+- **Security logs**: 2 years
+
+## Backup and Recovery
+
+### Configuration Backup
+```bash
+# Backup Prometheus configuration
+docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/
+
+# Backup Grafana dashboards
+docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
+```
+
+### Data Backup
+```bash
+# Backup Prometheus data
+docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/
+
+# Backup Grafana database
+docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
+```
+
+### Disaster Recovery
+1. **Restore configurations** from backup
+2. **Redeploy containers** with restored configs
+3. **Import historical data** if needed
+4. **Verify alert routing** and dashboard functionality
+
+## Performance Optimization
+
+### Prometheus Optimization
+- **Recording rules**: Pre-calculate expensive queries
+- **Metric relabeling**: Reduce cardinality
+- **Storage optimization**: Efficient time series storage
+- **Query optimization**: Efficient PromQL queries
+
+### Grafana Optimization
+- **Dashboard caching**: Reduce query load
+- **Panel optimization**: Efficient visualizations
+- **User management**: Role-based access control
+- **Plugin management**: Only necessary plugins
+
+### Network Optimization
+- **Local metrics**: Minimize network traffic
+- **Compression**: Enable metric compression
+- **Batching**: Batch metric collection
+- **Filtering**: Collect only necessary metrics
+
+## Troubleshooting
+
+### Common Issues
+
+#### High Memory Usage
+```bash
+# Check Prometheus memory usage
+docker stats prometheus
+
+# Reduce retention period
+# Edit prometheus.yml: --storage.tsdb.retention.time=7d
+```
+
+#### Missing Metrics
+```bash
+# Check target status
+curl http://prometheus:9090/api/v1/targets
+
+# Verify service discovery
+curl http://prometheus:9090/api/v1/label/__name__/values
+```
+
+#### Dashboard Loading Issues
+```bash
+# Check Grafana logs
+docker logs grafana
+
+# Verify datasource connectivity
+curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
+```
+
+### Monitoring Health Checks
+```bash
+# Prometheus health
+curl http://prometheus:9090/-/healthy
+
+# Grafana health
+curl http://grafana:3000/api/health
+
+# AlertManager health
+curl http://alertmanager:9093/-/healthy
+```
+
+## Security Configuration
+
+### Authentication
+- **Grafana**: OAuth integration with Authentik
+- **Prometheus**: Basic auth via reverse proxy
+- **AlertManager**: Basic auth via reverse proxy
+
+### Network Security
+- **Internal network**: Isolated Docker network
+- **Reverse proxy**: Nginx Proxy Manager
+- **SSL termination**: Let's Encrypt certificates
+- **Access control**: IP-based restrictions
+
+### Data Security
+- **Encryption at rest**: Encrypted storage volumes
+- **Encryption in transit**: TLS for all communications
+- **Access logging**: Comprehensive audit trails
+- **Regular updates**: Automated security updates
+
+## Integration Points
+
+### External Systems
+- **NTFY**: Push notifications for alerts
+- **Email**: Backup notification channel
+- **Slack**: Team notifications (optional)
+- **PagerDuty**: Escalation for critical alerts
+
+### Automation
+- **Ansible**: Configuration management
+- **GitOps**: Version-controlled configurations
+- **CI/CD**: Automated deployment pipeline
+- **Backup automation**: Scheduled backups
+
+## Future Enhancements
+
+### Planned Features
+- **Log aggregation**: Centralized log management
+- **Distributed tracing**: Application tracing
+- **Synthetic monitoring**: Proactive service testing
+- **Machine learning**: Anomaly detection
+
+### Scaling Considerations
+- **High availability**: Multi-instance deployment
+- **Load balancing**: Distribute query load
+- **Federation**: Multi-cluster monitoring
+- **Storage scaling**: Efficient long-term storage
+
+---
+**Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems