📊 Monitoring Infrastructure
Docker-based monitoring stack for comprehensive homelab observability
Overview
This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.
Architecture
Core Components
- Prometheus - Metrics collection and storage
- Grafana - Visualization and dashboards
- AlertManager - Alert routing and management
- Node Exporter - System metrics collection
- cAdvisor - Container metrics collection
Deployment Structure
monitoring/
├── prometheus/
│ ├── prometheus.yml # Main configuration
│ ├── alert-rules.yml # Alert definitions
│ └── targets/ # Service discovery configs
├── grafana/
│ ├── provisioning/ # Dashboard and datasource configs
│ └── dashboards/ # JSON dashboard definitions
├── alertmanager/
│ └── alertmanager.yml # Alert routing configuration
└── docker-compose.yml # Complete monitoring stack
Service Endpoints
Internal Access
- Prometheus:
http://prometheus:9090 - Grafana:
http://grafana:3000 - AlertManager:
http://alertmanager:9093
External Access (via Nginx Proxy Manager)
- Grafana:
https://grafana.vish.gg - Prometheus:
https://prometheus.vish.gg(admin only) - AlertManager:
https://alerts.vish.gg(admin only)
Metrics Collection
System Metrics
- Node Exporter: CPU, memory, disk, network statistics
- SNMP Exporter: Network equipment monitoring
- Blackbox Exporter: Service availability checks
Container Metrics
- cAdvisor: Docker container resource usage
- Portainer metrics: Container orchestration metrics
- Docker daemon metrics: Docker engine statistics
Application Metrics
- Plex: Media server performance metrics
- Nginx: Web server access and performance
- Database metrics: PostgreSQL, Redis performance
Custom Metrics
- Backup status: Success/failure rates
- Storage usage: Disk space across all hosts
- Network performance: Bandwidth and latency
Dashboard Categories
Infrastructure Dashboards
- Host Overview: System resource utilization
- Network Performance: Bandwidth and connectivity
- Storage Monitoring: Disk usage and health
- Docker Containers: Container resource usage
Service Dashboards
- Media Services: Plex, Arr suite performance
- Web Services: Nginx, application response times
- Database Performance: Query performance and connections
- Backup Monitoring: Backup job status and trends
Security Dashboards
- Authentication Events: Login attempts and failures
- Network Security: Firewall logs and intrusion attempts
- Certificate Monitoring: SSL certificate expiration
- Vulnerability Scanning: Security scan results
Alert Configuration
Critical Alerts
- Host down: System unreachable
- High resource usage: CPU/Memory > 90%
- Disk space critical: < 10% free space
- Service unavailable: Key services down
Warning Alerts
- High resource usage: CPU/Memory > 80%
- Disk space low: < 20% free space
- Certificate expiring: < 30 days to expiration
- Backup failures: Failed backup jobs
Info Alerts
- System updates: Available updates
- Maintenance windows: Scheduled maintenance
- Performance trends: Unusual patterns
- Capacity planning: Resource growth trends
Data Retention
Prometheus Retention
- Raw metrics: 15 days high resolution
- Downsampled: 90 days medium resolution
- Long-term: 1 year low resolution
Grafana Data
- Dashboards: Version controlled in Git
- User preferences: Backed up weekly
- Annotations: Retained for 1 year
Log Retention
- Application logs: 30 days
- System logs: 90 days
- Audit logs: 1 year
- Security logs: 2 years
Backup and Recovery
Configuration Backup
# Backup Prometheus configuration
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/
# Backup Grafana dashboards
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
Data Backup
# Backup Prometheus data
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/
# Backup Grafana database
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
Disaster Recovery
- Restore configurations from backup
- Redeploy containers with restored configs
- Import historical data if needed
- Verify alert routing and dashboard functionality
Performance Optimization
Prometheus Optimization
- Recording rules: Pre-calculate expensive queries
- Metric relabeling: Reduce cardinality
- Storage optimization: Efficient time series storage
- Query optimization: Efficient PromQL queries
Grafana Optimization
- Dashboard caching: Reduce query load
- Panel optimization: Efficient visualizations
- User management: Role-based access control
- Plugin management: Only necessary plugins
Network Optimization
- Local metrics: Minimize network traffic
- Compression: Enable metric compression
- Batching: Batch metric collection
- Filtering: Collect only necessary metrics
Troubleshooting
Common Issues
High Memory Usage
# Check Prometheus memory usage
docker stats prometheus
# Reduce retention period
# Edit prometheus.yml: --storage.tsdb.retention.time=7d
Missing Metrics
# Check target status
curl http://prometheus:9090/api/v1/targets
# Verify service discovery
curl http://prometheus:9090/api/v1/label/__name__/values
Dashboard Loading Issues
# Check Grafana logs
docker logs grafana
# Verify datasource connectivity
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
Monitoring Health Checks
# Prometheus health
curl http://prometheus:9090/-/healthy
# Grafana health
curl http://grafana:3000/api/health
# AlertManager health
curl http://alertmanager:9093/-/healthy
Security Configuration
Authentication
- Grafana: OAuth integration with Authentik
- Prometheus: Basic auth via reverse proxy
- AlertManager: Basic auth via reverse proxy
Network Security
- Internal network: Isolated Docker network
- Reverse proxy: Nginx Proxy Manager
- SSL termination: Let's Encrypt certificates
- Access control: IP-based restrictions
Data Security
- Encryption at rest: Encrypted storage volumes
- Encryption in transit: TLS for all communications
- Access logging: Comprehensive audit trails
- Regular updates: Automated security updates
Integration Points
External Systems
- NTFY: Push notifications for alerts
- Email: Backup notification channel
- Slack: Team notifications (optional)
- PagerDuty: Escalation for critical alerts
Automation
- Ansible: Configuration management
- GitOps: Version-controlled configurations
- CI/CD: Automated deployment pipeline
- Backup automation: Scheduled backups
Future Enhancements
Planned Features
- Log aggregation: Centralized log management
- Distributed tracing: Application tracing
- Synthetic monitoring: Proactive service testing
- Machine learning: Anomaly detection
Scaling Considerations
- High availability: Multi-instance deployment
- Load balancing: Distribute query load
- Federation: Multi-cluster monitoring
- Storage scaling: Efficient long-term storage
Status: ✅ Comprehensive monitoring infrastructure operational across all homelab systems