Sanitized mirror from private repository - 2026-03-26 10:25:55 UTC
This commit is contained in:
261
docs/infrastructure/docker/monitoring/README.md
Normal file
261
docs/infrastructure/docker/monitoring/README.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# 📊 Monitoring Infrastructure
|
||||
|
||||
*Docker-based monitoring stack for comprehensive homelab observability*
|
||||
|
||||
## Overview
|
||||
This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Components
|
||||
- **Prometheus** - Metrics collection and storage
|
||||
- **Grafana** - Visualization and dashboards
|
||||
- **AlertManager** - Alert routing and management
|
||||
- **Node Exporter** - System metrics collection
|
||||
- **cAdvisor** - Container metrics collection
|
||||
|
||||
### Deployment Structure
|
||||
```
|
||||
monitoring/
|
||||
├── prometheus/
|
||||
│ ├── prometheus.yml # Main configuration
|
||||
│ ├── alert-rules.yml # Alert definitions
|
||||
│ └── targets/ # Service discovery configs
|
||||
├── grafana/
|
||||
│ ├── provisioning/ # Dashboard and datasource configs
|
||||
│ └── dashboards/ # JSON dashboard definitions
|
||||
├── alertmanager/
|
||||
│ └── alertmanager.yml # Alert routing configuration
|
||||
└── docker-compose.yml # Complete monitoring stack
|
||||
```
|
||||
|
||||
## Service Endpoints
|
||||
|
||||
### Internal Access
|
||||
- **Prometheus**: `http://prometheus:9090`
|
||||
- **Grafana**: `http://grafana:3000`
|
||||
- **AlertManager**: `http://alertmanager:9093`
|
||||
|
||||
### External Access (via Nginx Proxy Manager)
|
||||
- **Grafana**: `https://grafana.vish.gg`
|
||||
- **Prometheus**: `https://prometheus.vish.gg` (admin only)
|
||||
- **AlertManager**: `https://alerts.vish.gg` (admin only)
|
||||
|
||||
## Metrics Collection
|
||||
|
||||
### System Metrics
|
||||
- **Node Exporter**: CPU, memory, disk, network statistics
|
||||
- **SNMP Exporter**: Network equipment monitoring
|
||||
- **Blackbox Exporter**: Service availability checks
|
||||
|
||||
### Container Metrics
|
||||
- **cAdvisor**: Docker container resource usage
|
||||
- **Portainer metrics**: Container orchestration metrics
|
||||
- **Docker daemon metrics**: Docker engine statistics
|
||||
|
||||
### Application Metrics
|
||||
- **Plex**: Media server performance metrics
|
||||
- **Nginx**: Web server access and performance
|
||||
- **Database metrics**: PostgreSQL, Redis performance
|
||||
|
||||
### Custom Metrics
|
||||
- **Backup status**: Success/failure rates
|
||||
- **Storage usage**: Disk space across all hosts
|
||||
- **Network performance**: Bandwidth and latency
|
||||
|
||||
## Dashboard Categories
|
||||
|
||||
### Infrastructure Dashboards
|
||||
- **Host Overview**: System resource utilization
|
||||
- **Network Performance**: Bandwidth and connectivity
|
||||
- **Storage Monitoring**: Disk usage and health
|
||||
- **Docker Containers**: Container resource usage
|
||||
|
||||
### Service Dashboards
|
||||
- **Media Services**: Plex, Arr suite performance
|
||||
- **Web Services**: Nginx, application response times
|
||||
- **Database Performance**: Query performance and connections
|
||||
- **Backup Monitoring**: Backup job status and trends
|
||||
|
||||
### Security Dashboards
|
||||
- **Authentication Events**: Login attempts and failures
|
||||
- **Network Security**: Firewall logs and intrusion attempts
|
||||
- **Certificate Monitoring**: SSL certificate expiration
|
||||
- **Vulnerability Scanning**: Security scan results
|
||||
|
||||
## Alert Configuration
|
||||
|
||||
### Critical Alerts
|
||||
- **Host down**: System unreachable
|
||||
- **High resource usage**: CPU/Memory > 90%
|
||||
- **Disk space critical**: < 10% free space
|
||||
- **Service unavailable**: Key services down
|
||||
|
||||
### Warning Alerts
|
||||
- **High resource usage**: CPU/Memory > 80%
|
||||
- **Disk space low**: < 20% free space
|
||||
- **Certificate expiring**: < 30 days to expiration
|
||||
- **Backup failures**: Failed backup jobs
|
||||
|
||||
### Info Alerts
|
||||
- **System updates**: Available updates
|
||||
- **Maintenance windows**: Scheduled maintenance
|
||||
- **Performance trends**: Unusual patterns
|
||||
- **Capacity planning**: Resource growth trends
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Prometheus Retention
|
||||
- **Raw metrics**: 15 days high resolution
|
||||
- **Downsampled**: 90 days medium resolution
|
||||
- **Long-term**: 1 year low resolution
|
||||
|
||||
### Grafana Data
|
||||
- **Dashboards**: Version controlled in Git
|
||||
- **User preferences**: Backed up weekly
|
||||
- **Annotations**: Retained for 1 year
|
||||
|
||||
### Log Retention
|
||||
- **Application logs**: 30 days
|
||||
- **System logs**: 90 days
|
||||
- **Audit logs**: 1 year
|
||||
- **Security logs**: 2 years
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Configuration Backup
|
||||
```bash
|
||||
# Backup Prometheus configuration
|
||||
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/
|
||||
|
||||
# Backup Grafana dashboards
|
||||
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
|
||||
```
|
||||
|
||||
### Data Backup
|
||||
```bash
|
||||
# Backup Prometheus data
|
||||
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/
|
||||
|
||||
# Backup Grafana database
|
||||
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
|
||||
```
|
||||
|
||||
### Disaster Recovery
|
||||
1. **Restore configurations** from backup
|
||||
2. **Redeploy containers** with restored configs
|
||||
3. **Import historical data** if needed
|
||||
4. **Verify alert routing** and dashboard functionality
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Prometheus Optimization
|
||||
- **Recording rules**: Pre-calculate expensive queries
|
||||
- **Metric relabeling**: Reduce cardinality
|
||||
- **Storage optimization**: Efficient time series storage
|
||||
- **Query optimization**: Efficient PromQL queries
|
||||
|
||||
### Grafana Optimization
|
||||
- **Dashboard caching**: Reduce query load
|
||||
- **Panel optimization**: Efficient visualizations
|
||||
- **User management**: Role-based access control
|
||||
- **Plugin management**: Only necessary plugins
|
||||
|
||||
### Network Optimization
|
||||
- **Local metrics**: Minimize network traffic
|
||||
- **Compression**: Enable metric compression
|
||||
- **Batching**: Batch metric collection
|
||||
- **Filtering**: Collect only necessary metrics
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### High Memory Usage
|
||||
```bash
|
||||
# Check Prometheus memory usage
|
||||
docker stats prometheus
|
||||
|
||||
# Reduce retention period
|
||||
# Edit prometheus.yml: --storage.tsdb.retention.time=7d
|
||||
```
|
||||
|
||||
#### Missing Metrics
|
||||
```bash
|
||||
# Check target status
|
||||
curl http://prometheus:9090/api/v1/targets
|
||||
|
||||
# Verify service discovery
|
||||
curl http://prometheus:9090/api/v1/label/__name__/values
|
||||
```
|
||||
|
||||
#### Dashboard Loading Issues
|
||||
```bash
|
||||
# Check Grafana logs
|
||||
docker logs grafana
|
||||
|
||||
# Verify datasource connectivity
|
||||
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
|
||||
```
|
||||
|
||||
### Monitoring Health Checks
|
||||
```bash
|
||||
# Prometheus health
|
||||
curl http://prometheus:9090/-/healthy
|
||||
|
||||
# Grafana health
|
||||
curl http://grafana:3000/api/health
|
||||
|
||||
# AlertManager health
|
||||
curl http://alertmanager:9093/-/healthy
|
||||
```
|
||||
|
||||
## Security Configuration
|
||||
|
||||
### Authentication
|
||||
- **Grafana**: OAuth integration with Authentik
|
||||
- **Prometheus**: Basic auth via reverse proxy
|
||||
- **AlertManager**: Basic auth via reverse proxy
|
||||
|
||||
### Network Security
|
||||
- **Internal network**: Isolated Docker network
|
||||
- **Reverse proxy**: Nginx Proxy Manager
|
||||
- **SSL termination**: Let's Encrypt certificates
|
||||
- **Access control**: IP-based restrictions
|
||||
|
||||
### Data Security
|
||||
- **Encryption at rest**: Encrypted storage volumes
|
||||
- **Encryption in transit**: TLS for all communications
|
||||
- **Access logging**: Comprehensive audit trails
|
||||
- **Regular updates**: Automated security updates
|
||||
|
||||
## Integration Points
|
||||
|
||||
### External Systems
|
||||
- **NTFY**: Push notifications for alerts
|
||||
- **Email**: Backup notification channel
|
||||
- **Slack**: Team notifications (optional)
|
||||
- **PagerDuty**: Escalation for critical alerts
|
||||
|
||||
### Automation
|
||||
- **Ansible**: Configuration management
|
||||
- **GitOps**: Version-controlled configurations
|
||||
- **CI/CD**: Automated deployment pipeline
|
||||
- **Backup automation**: Scheduled backups
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
- **Log aggregation**: Centralized log management
|
||||
- **Distributed tracing**: Application tracing
|
||||
- **Synthetic monitoring**: Proactive service testing
|
||||
- **Machine learning**: Anomaly detection
|
||||
|
||||
### Scaling Considerations
|
||||
- **High availability**: Multi-instance deployment
|
||||
- **Load balancing**: Distribute query load
|
||||
- **Federation**: Multi-cluster monitoring
|
||||
- **Storage scaling**: Efficient long-term storage
|
||||
|
||||
---
|
||||
**Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems
|
||||
Reference in New Issue
Block a user