261 lines
7.7 KiB
Markdown
261 lines
7.7 KiB
Markdown
# 📊 Monitoring Infrastructure
|
|
|
|
*Docker-based monitoring stack for comprehensive homelab observability*
|
|
|
|
## Overview
|
|
This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
- **Prometheus** - Metrics collection and storage
|
|
- **Grafana** - Visualization and dashboards
|
|
- **AlertManager** - Alert routing and management
|
|
- **Node Exporter** - System metrics collection
|
|
- **cAdvisor** - Container metrics collection
|
|
|
|
### Deployment Structure
|
|
```
|
|
monitoring/
|
|
├── prometheus/
|
|
│ ├── prometheus.yml # Main configuration
|
|
│ ├── alert-rules.yml # Alert definitions
|
|
│ └── targets/ # Service discovery configs
|
|
├── grafana/
|
|
│ ├── provisioning/ # Dashboard and datasource configs
|
|
│ └── dashboards/ # JSON dashboard definitions
|
|
├── alertmanager/
|
|
│ └── alertmanager.yml # Alert routing configuration
|
|
└── docker-compose.yml # Complete monitoring stack
|
|
```
|
|
|
|
## Service Endpoints
|
|
|
|
### Internal Access
|
|
- **Prometheus**: `http://prometheus:9090`
|
|
- **Grafana**: `http://grafana:3000`
|
|
- **AlertManager**: `http://alertmanager:9093`
|
|
|
|
### External Access (via Nginx Proxy Manager)
|
|
- **Grafana**: `https://grafana.vish.gg`
|
|
- **Prometheus**: `https://prometheus.vish.gg` (admin only)
|
|
- **AlertManager**: `https://alerts.vish.gg` (admin only)
|
|
|
|
## Metrics Collection
|
|
|
|
### System Metrics
|
|
- **Node Exporter**: CPU, memory, disk, network statistics
|
|
- **SNMP Exporter**: Network equipment monitoring
|
|
- **Blackbox Exporter**: Service availability checks
|
|
|
|
### Container Metrics
|
|
- **cAdvisor**: Docker container resource usage
|
|
- **Portainer metrics**: Container orchestration metrics
|
|
- **Docker daemon metrics**: Docker engine statistics
|
|
|
|
### Application Metrics
|
|
- **Plex**: Media server performance metrics
|
|
- **Nginx**: Web server access and performance
|
|
- **Database metrics**: PostgreSQL, Redis performance
|
|
|
|
### Custom Metrics
|
|
- **Backup status**: Success/failure rates
|
|
- **Storage usage**: Disk space across all hosts
|
|
- **Network performance**: Bandwidth and latency
|
|
|
|
## Dashboard Categories
|
|
|
|
### Infrastructure Dashboards
|
|
- **Host Overview**: System resource utilization
|
|
- **Network Performance**: Bandwidth and connectivity
|
|
- **Storage Monitoring**: Disk usage and health
|
|
- **Docker Containers**: Container resource usage
|
|
|
|
### Service Dashboards
|
|
- **Media Services**: Plex, Arr suite performance
|
|
- **Web Services**: Nginx, application response times
|
|
- **Database Performance**: Query performance and connections
|
|
- **Backup Monitoring**: Backup job status and trends
|
|
|
|
### Security Dashboards
|
|
- **Authentication Events**: Login attempts and failures
|
|
- **Network Security**: Firewall logs and intrusion attempts
|
|
- **Certificate Monitoring**: SSL certificate expiration
|
|
- **Vulnerability Scanning**: Security scan results
|
|
|
|
## Alert Configuration
|
|
|
|
### Critical Alerts
|
|
- **Host down**: System unreachable
|
|
- **High resource usage**: CPU/Memory > 90%
|
|
- **Disk space critical**: < 10% free space
|
|
- **Service unavailable**: Key services down
|
|
|
|
### Warning Alerts
|
|
- **High resource usage**: CPU/Memory > 80%
|
|
- **Disk space low**: < 20% free space
|
|
- **Certificate expiring**: < 30 days to expiration
|
|
- **Backup failures**: Failed backup jobs
|
|
|
|
### Info Alerts
|
|
- **System updates**: Available updates
|
|
- **Maintenance windows**: Scheduled maintenance
|
|
- **Performance trends**: Unusual patterns
|
|
- **Capacity planning**: Resource growth trends
|
|
|
|
## Data Retention
|
|
|
|
### Prometheus Retention
|
|
- **Raw metrics**: 15 days high resolution
|
|
- **Downsampled**: 90 days medium resolution
|
|
- **Long-term**: 1 year low resolution
|
|
|
|
### Grafana Data
|
|
- **Dashboards**: Version controlled in Git
|
|
- **User preferences**: Backed up weekly
|
|
- **Annotations**: Retained for 1 year
|
|
|
|
### Log Retention
|
|
- **Application logs**: 30 days
|
|
- **System logs**: 90 days
|
|
- **Audit logs**: 1 year
|
|
- **Security logs**: 2 years
|
|
|
|
## Backup and Recovery
|
|
|
|
### Configuration Backup
|
|
```bash
|
|
# Backup Prometheus configuration
|
|
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/
|
|
|
|
# Backup Grafana dashboards
|
|
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
|
|
```
|
|
|
|
### Data Backup
|
|
```bash
|
|
# Backup Prometheus data
|
|
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/
|
|
|
|
# Backup Grafana database
|
|
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
|
|
```
|
|
|
|
### Disaster Recovery
|
|
1. **Restore configurations** from backup
|
|
2. **Redeploy containers** with restored configs
|
|
3. **Import historical data** if needed
|
|
4. **Verify alert routing** and dashboard functionality
|
|
|
|
## Performance Optimization
|
|
|
|
### Prometheus Optimization
|
|
- **Recording rules**: Pre-calculate expensive queries
|
|
- **Metric relabeling**: Reduce cardinality
|
|
- **Storage optimization**: Efficient time series storage
|
|
- **Query optimization**: Efficient PromQL queries
|
|
|
|
### Grafana Optimization
|
|
- **Dashboard caching**: Reduce query load
|
|
- **Panel optimization**: Efficient visualizations
|
|
- **User management**: Role-based access control
|
|
- **Plugin management**: Only necessary plugins
|
|
|
|
### Network Optimization
|
|
- **Local metrics**: Minimize network traffic
|
|
- **Compression**: Enable metric compression
|
|
- **Batching**: Batch metric collection
|
|
- **Filtering**: Collect only necessary metrics
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### High Memory Usage
|
|
```bash
|
|
# Check Prometheus memory usage
|
|
docker stats prometheus
|
|
|
|
# Reduce retention period
|
|
# Edit prometheus.yml: --storage.tsdb.retention.time=7d
|
|
```
|
|
|
|
#### Missing Metrics
|
|
```bash
|
|
# Check target status
|
|
curl http://prometheus:9090/api/v1/targets
|
|
|
|
# Verify service discovery
|
|
curl http://prometheus:9090/api/v1/label/__name__/values
|
|
```
|
|
|
|
#### Dashboard Loading Issues
|
|
```bash
|
|
# Check Grafana logs
|
|
docker logs grafana
|
|
|
|
# Verify datasource connectivity
|
|
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
|
|
```
|
|
|
|
### Monitoring Health Checks
|
|
```bash
|
|
# Prometheus health
|
|
curl http://prometheus:9090/-/healthy
|
|
|
|
# Grafana health
|
|
curl http://grafana:3000/api/health
|
|
|
|
# AlertManager health
|
|
curl http://alertmanager:9093/-/healthy
|
|
```
|
|
|
|
## Security Configuration
|
|
|
|
### Authentication
|
|
- **Grafana**: OAuth integration with Authentik
|
|
- **Prometheus**: Basic auth via reverse proxy
|
|
- **AlertManager**: Basic auth via reverse proxy
|
|
|
|
### Network Security
|
|
- **Internal network**: Isolated Docker network
|
|
- **Reverse proxy**: Nginx Proxy Manager
|
|
- **SSL termination**: Let's Encrypt certificates
|
|
- **Access control**: IP-based restrictions
|
|
|
|
### Data Security
|
|
- **Encryption at rest**: Encrypted storage volumes
|
|
- **Encryption in transit**: TLS for all communications
|
|
- **Access logging**: Comprehensive audit trails
|
|
- **Regular updates**: Automated security updates
|
|
|
|
## Integration Points
|
|
|
|
### External Systems
|
|
- **NTFY**: Push notifications for alerts
|
|
- **Email**: Backup notification channel
|
|
- **Slack**: Team notifications (optional)
|
|
- **PagerDuty**: Escalation for critical alerts
|
|
|
|
### Automation
|
|
- **Ansible**: Configuration management
|
|
- **GitOps**: Version-controlled configurations
|
|
- **CI/CD**: Automated deployment pipeline
|
|
- **Backup automation**: Scheduled backups
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
- **Log aggregation**: Centralized log management
|
|
- **Distributed tracing**: Application tracing
|
|
- **Synthetic monitoring**: Proactive service testing
|
|
- **Machine learning**: Anomaly detection
|
|
|
|
### Scaling Considerations
|
|
- **High availability**: Multi-instance deployment
|
|
- **Load balancing**: Distribute query load
|
|
- **Federation**: Multi-cluster monitoring
|
|
- **Storage scaling**: Efficient long-term storage
|
|
|
|
---
|
|
**Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems |