Sanitized mirror from private repository - 2026-04-05 13:06:07 UTC
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled

This commit is contained in:
Gitea Mirror Bot
2026-04-05 13:06:07 +00:00
commit da2060f709
1401 changed files with 358437 additions and 0 deletions

View File

@@ -0,0 +1,261 @@
# 📊 Monitoring Infrastructure
*Docker-based monitoring stack for comprehensive homelab observability*
## Overview
This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.
## Architecture
### Core Components
- **Prometheus** - Metrics collection and storage
- **Grafana** - Visualization and dashboards
- **AlertManager** - Alert routing and management
- **Node Exporter** - System metrics collection
- **cAdvisor** - Container metrics collection
### Deployment Structure
```
monitoring/
├── prometheus/
│ ├── prometheus.yml # Main configuration
│ ├── alert-rules.yml # Alert definitions
│ └── targets/ # Service discovery configs
├── grafana/
│ ├── provisioning/ # Dashboard and datasource configs
│ └── dashboards/ # JSON dashboard definitions
├── alertmanager/
│ └── alertmanager.yml # Alert routing configuration
└── docker-compose.yml # Complete monitoring stack
```
## Service Endpoints
### Internal Access
- **Prometheus**: `http://prometheus:9090`
- **Grafana**: `http://grafana:3000`
- **AlertManager**: `http://alertmanager:9093`
### External Access (via Nginx Proxy Manager)
- **Grafana**: `https://grafana.vish.gg`
- **Prometheus**: `https://prometheus.vish.gg` (admin only)
- **AlertManager**: `https://alerts.vish.gg` (admin only)
## Metrics Collection
### System Metrics
- **Node Exporter**: CPU, memory, disk, network statistics
- **SNMP Exporter**: Network equipment monitoring
- **Blackbox Exporter**: Service availability checks
### Container Metrics
- **cAdvisor**: Docker container resource usage
- **Portainer metrics**: Container orchestration metrics
- **Docker daemon metrics**: Docker engine statistics
### Application Metrics
- **Plex**: Media server performance metrics
- **Nginx**: Web server access and performance
- **Database metrics**: PostgreSQL, Redis performance
### Custom Metrics
- **Backup status**: Success/failure rates
- **Storage usage**: Disk space across all hosts
- **Network performance**: Bandwidth and latency
## Dashboard Categories
### Infrastructure Dashboards
- **Host Overview**: System resource utilization
- **Network Performance**: Bandwidth and connectivity
- **Storage Monitoring**: Disk usage and health
- **Docker Containers**: Container resource usage
### Service Dashboards
- **Media Services**: Plex, Arr suite performance
- **Web Services**: Nginx, application response times
- **Database Performance**: Query performance and connections
- **Backup Monitoring**: Backup job status and trends
### Security Dashboards
- **Authentication Events**: Login attempts and failures
- **Network Security**: Firewall logs and intrusion attempts
- **Certificate Monitoring**: SSL certificate expiration
- **Vulnerability Scanning**: Security scan results
## Alert Configuration
### Critical Alerts
- **Host down**: System unreachable
- **High resource usage**: CPU/Memory > 90%
- **Disk space critical**: < 10% free space
- **Service unavailable**: Key services down
### Warning Alerts
- **High resource usage**: CPU/Memory > 80%
- **Disk space low**: < 20% free space
- **Certificate expiring**: < 30 days to expiration
- **Backup failures**: Failed backup jobs
### Info Alerts
- **System updates**: Available updates
- **Maintenance windows**: Scheduled maintenance
- **Performance trends**: Unusual patterns
- **Capacity planning**: Resource growth trends
## Data Retention
### Prometheus Retention
- **Raw metrics**: 15 days high resolution
- **Downsampled**: 90 days medium resolution
- **Long-term**: 1 year low resolution
### Grafana Data
- **Dashboards**: Version controlled in Git
- **User preferences**: Backed up weekly
- **Annotations**: Retained for 1 year
### Log Retention
- **Application logs**: 30 days
- **System logs**: 90 days
- **Audit logs**: 1 year
- **Security logs**: 2 years
## Backup and Recovery
### Configuration Backup
```bash
# Backup Prometheus configuration
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/
# Backup Grafana dashboards
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
```
### Data Backup
```bash
# Backup Prometheus data
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/
# Backup Grafana database
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
```
### Disaster Recovery
1. **Restore configurations** from backup
2. **Redeploy containers** with restored configs
3. **Import historical data** if needed
4. **Verify alert routing** and dashboard functionality
## Performance Optimization
### Prometheus Optimization
- **Recording rules**: Pre-calculate expensive queries
- **Metric relabeling**: Reduce cardinality
- **Storage optimization**: Efficient time series storage
- **Query optimization**: Efficient PromQL queries
### Grafana Optimization
- **Dashboard caching**: Reduce query load
- **Panel optimization**: Efficient visualizations
- **User management**: Role-based access control
- **Plugin management**: Only necessary plugins
### Network Optimization
- **Local metrics**: Minimize network traffic
- **Compression**: Enable metric compression
- **Batching**: Batch metric collection
- **Filtering**: Collect only necessary metrics
## Troubleshooting
### Common Issues
#### High Memory Usage
```bash
# Check Prometheus memory usage
docker stats prometheus
# Reduce retention period
# Edit prometheus.yml: --storage.tsdb.retention.time=7d
```
#### Missing Metrics
```bash
# Check target status
curl http://prometheus:9090/api/v1/targets
# Verify service discovery
curl http://prometheus:9090/api/v1/label/__name__/values
```
#### Dashboard Loading Issues
```bash
# Check Grafana logs
docker logs grafana
# Verify datasource connectivity
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
```
### Monitoring Health Checks
```bash
# Prometheus health
curl http://prometheus:9090/-/healthy
# Grafana health
curl http://grafana:3000/api/health
# AlertManager health
curl http://alertmanager:9093/-/healthy
```
## Security Configuration
### Authentication
- **Grafana**: OAuth integration with Authentik
- **Prometheus**: Basic auth via reverse proxy
- **AlertManager**: Basic auth via reverse proxy
### Network Security
- **Internal network**: Isolated Docker network
- **Reverse proxy**: Nginx Proxy Manager
- **SSL termination**: Let's Encrypt certificates
- **Access control**: IP-based restrictions
### Data Security
- **Encryption at rest**: Encrypted storage volumes
- **Encryption in transit**: TLS for all communications
- **Access logging**: Comprehensive audit trails
- **Regular updates**: Automated security updates
## Integration Points
### External Systems
- **NTFY**: Push notifications for alerts
- **Email**: Backup notification channel
- **Slack**: Team notifications (optional)
- **PagerDuty**: Escalation for critical alerts
### Automation
- **Ansible**: Configuration management
- **GitOps**: Version-controlled configurations
- **CI/CD**: Automated deployment pipeline
- **Backup automation**: Scheduled backups
## Future Enhancements
### Planned Features
- **Log aggregation**: Centralized log management
- **Distributed tracing**: Application tracing
- **Synthetic monitoring**: Proactive service testing
- **Machine learning**: Anomaly detection
### Scaling Considerations
- **High availability**: Multi-instance deployment
- **Load balancing**: Distribute query load
- **Federation**: Multi-cluster monitoring
- **Storage scaling**: Efficient long-term storage
---
**Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems