homelab-optimized/docs/infrastructure/docker/monitoring/README.md

# 📊 Monitoring Infrastructure

*Docker-based monitoring stack for comprehensive homelab observability*

## Overview
This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.

## Architecture

### Core Components
- **Prometheus** - Metrics collection and storage
- **Grafana** - Visualization and dashboards
- **AlertManager** - Alert routing and management
- **Node Exporter** - System metrics collection
- **cAdvisor** - Container metrics collection

### Deployment Structure
```
monitoring/
├── prometheus/
│   ├── prometheus.yml      # Main configuration
│   ├── alert-rules.yml     # Alert definitions
│   └── targets/           # Service discovery configs
├── grafana/
│   ├── provisioning/      # Dashboard and datasource configs
│   └── dashboards/        # JSON dashboard definitions
├── alertmanager/
│   └── alertmanager.yml   # Alert routing configuration
└── docker-compose.yml     # Complete monitoring stack
```

## Service Endpoints

### Internal Access
- **Prometheus**: `http://prometheus:9090`
- **Grafana**: `http://grafana:3000`
- **AlertManager**: `http://alertmanager:9093`

### External Access (via Nginx Proxy Manager)
- **Grafana**: `https://grafana.vish.gg`
- **Prometheus**: `https://prometheus.vish.gg` (admin only)
- **AlertManager**: `https://alerts.vish.gg` (admin only)

## Metrics Collection

### System Metrics
- **Node Exporter**: CPU, memory, disk, network statistics
- **SNMP Exporter**: Network equipment monitoring
- **Blackbox Exporter**: Service availability checks

### Container Metrics
- **cAdvisor**: Docker container resource usage
- **Portainer metrics**: Container orchestration metrics
- **Docker daemon metrics**: Docker engine statistics

### Application Metrics
- **Plex**: Media server performance metrics
- **Nginx**: Web server access and performance
- **Database metrics**: PostgreSQL, Redis performance

### Custom Metrics
- **Backup status**: Success/failure rates
- **Storage usage**: Disk space across all hosts
- **Network performance**: Bandwidth and latency

## Dashboard Categories

### Infrastructure Dashboards
- **Host Overview**: System resource utilization
- **Network Performance**: Bandwidth and connectivity
- **Storage Monitoring**: Disk usage and health
- **Docker Containers**: Container resource usage

### Service Dashboards
- **Media Services**: Plex, Arr suite performance
- **Web Services**: Nginx, application response times
- **Database Performance**: Query performance and connections
- **Backup Monitoring**: Backup job status and trends

### Security Dashboards
- **Authentication Events**: Login attempts and failures
- **Network Security**: Firewall logs and intrusion attempts
- **Certificate Monitoring**: SSL certificate expiration
- **Vulnerability Scanning**: Security scan results

## Alert Configuration

### Critical Alerts
- **Host down**: System unreachable
- **High resource usage**: CPU/Memory > 90%
- **Disk space critical**: < 10% free space
- **Service unavailable**: Key services down

### Warning Alerts
- **High resource usage**: CPU/Memory > 80%
- **Disk space low**: < 20% free space
- **Certificate expiring**: < 30 days to expiration
- **Backup failures**: Failed backup jobs

### Info Alerts
- **System updates**: Available updates
- **Maintenance windows**: Scheduled maintenance
- **Performance trends**: Unusual patterns
- **Capacity planning**: Resource growth trends

## Data Retention

### Prometheus Retention
- **Raw metrics**: 15 days high resolution
- **Downsampled**: 90 days medium resolution
- **Long-term**: 1 year low resolution

### Grafana Data
- **Dashboards**: Version controlled in Git
- **User preferences**: Backed up weekly
- **Annotations**: Retained for 1 year

### Log Retention
- **Application logs**: 30 days
- **System logs**: 90 days
- **Audit logs**: 1 year
- **Security logs**: 2 years

## Backup and Recovery

### Configuration Backup
```bash
# Backup Prometheus configuration
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/

# Backup Grafana dashboards
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/
```

### Data Backup
```bash
# Backup Prometheus data
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/

# Backup Grafana database
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"
```

### Disaster Recovery
1. **Restore configurations** from backup
2. **Redeploy containers** with restored configs
3. **Import historical data** if needed
4. **Verify alert routing** and dashboard functionality

## Performance Optimization

### Prometheus Optimization
- **Recording rules**: Pre-calculate expensive queries
- **Metric relabeling**: Reduce cardinality
- **Storage optimization**: Efficient time series storage
- **Query optimization**: Efficient PromQL queries

### Grafana Optimization
- **Dashboard caching**: Reduce query load
- **Panel optimization**: Efficient visualizations
- **User management**: Role-based access control
- **Plugin management**: Only necessary plugins

### Network Optimization
- **Local metrics**: Minimize network traffic
- **Compression**: Enable metric compression
- **Batching**: Batch metric collection
- **Filtering**: Collect only necessary metrics

## Troubleshooting

### Common Issues

#### High Memory Usage
```bash
# Check Prometheus memory usage
docker stats prometheus

# Reduce retention period
# Edit prometheus.yml: --storage.tsdb.retention.time=7d
```

#### Missing Metrics
```bash
# Check target status
curl http://prometheus:9090/api/v1/targets

# Verify service discovery
curl http://prometheus:9090/api/v1/label/__name__/values
```

#### Dashboard Loading Issues
```bash
# Check Grafana logs
docker logs grafana

# Verify datasource connectivity
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up
```

### Monitoring Health Checks
```bash
# Prometheus health
curl http://prometheus:9090/-/healthy

# Grafana health
curl http://grafana:3000/api/health

# AlertManager health
curl http://alertmanager:9093/-/healthy
```

## Security Configuration

### Authentication
- **Grafana**: OAuth integration with Authentik
- **Prometheus**: Basic auth via reverse proxy
- **AlertManager**: Basic auth via reverse proxy

### Network Security
- **Internal network**: Isolated Docker network
- **Reverse proxy**: Nginx Proxy Manager
- **SSL termination**: Let's Encrypt certificates
- **Access control**: IP-based restrictions

### Data Security
- **Encryption at rest**: Encrypted storage volumes
- **Encryption in transit**: TLS for all communications
- **Access logging**: Comprehensive audit trails
- **Regular updates**: Automated security updates

## Integration Points

### External Systems
- **NTFY**: Push notifications for alerts
- **Email**: Backup notification channel
- **Slack**: Team notifications (optional)
- **PagerDuty**: Escalation for critical alerts

### Automation
- **Ansible**: Configuration management
- **GitOps**: Version-controlled configurations
- **CI/CD**: Automated deployment pipeline
- **Backup automation**: Scheduled backups

## Future Enhancements

### Planned Features
- **Log aggregation**: Centralized log management
- **Distributed tracing**: Application tracing
- **Synthetic monitoring**: Proactive service testing
- **Machine learning**: Anomaly detection

### Scaling Considerations
- **High availability**: Multi-instance deployment
- **Load balancing**: Distribute query load
- **Federation**: Multi-cluster monitoring
- **Storage scaling**: Efficient long-term storage

---
**Status**: ✅ Comprehensive monitoring infrastructure operational across all homelab systems