Files
homelab-optimized/docs/infrastructure/docker/monitoring
Gitea Mirror Bot 1ab33b1e66
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-19 09:48:50 UTC
2026-04-19 09:48:50 +00:00
..

📊 Monitoring Infrastructure

Docker-based monitoring stack for comprehensive homelab observability

Overview

This directory contains the Docker-based monitoring infrastructure that provides comprehensive observability across the entire homelab environment.

Architecture

Core Components

  • Prometheus - Metrics collection and storage
  • Grafana - Visualization and dashboards
  • AlertManager - Alert routing and management
  • Node Exporter - System metrics collection
  • cAdvisor - Container metrics collection

Deployment Structure

monitoring/
├── prometheus/
│   ├── prometheus.yml      # Main configuration
│   ├── alert-rules.yml     # Alert definitions
│   └── targets/           # Service discovery configs
├── grafana/
│   ├── provisioning/      # Dashboard and datasource configs
│   └── dashboards/        # JSON dashboard definitions
├── alertmanager/
│   └── alertmanager.yml   # Alert routing configuration
└── docker-compose.yml     # Complete monitoring stack

Service Endpoints

Internal Access

  • Prometheus: http://prometheus:9090
  • Grafana: http://grafana:3000
  • AlertManager: http://alertmanager:9093

External Access (via Nginx Proxy Manager)

  • Grafana: https://grafana.vish.gg
  • Prometheus: https://prometheus.vish.gg (admin only)
  • AlertManager: https://alerts.vish.gg (admin only)

Metrics Collection

System Metrics

  • Node Exporter: CPU, memory, disk, network statistics
  • SNMP Exporter: Network equipment monitoring
  • Blackbox Exporter: Service availability checks

Container Metrics

  • cAdvisor: Docker container resource usage
  • Portainer metrics: Container orchestration metrics
  • Docker daemon metrics: Docker engine statistics

Application Metrics

  • Plex: Media server performance metrics
  • Nginx: Web server access and performance
  • Database metrics: PostgreSQL, Redis performance

Custom Metrics

  • Backup status: Success/failure rates
  • Storage usage: Disk space across all hosts
  • Network performance: Bandwidth and latency

Dashboard Categories

Infrastructure Dashboards

  • Host Overview: System resource utilization
  • Network Performance: Bandwidth and connectivity
  • Storage Monitoring: Disk usage and health
  • Docker Containers: Container resource usage

Service Dashboards

  • Media Services: Plex, Arr suite performance
  • Web Services: Nginx, application response times
  • Database Performance: Query performance and connections
  • Backup Monitoring: Backup job status and trends

Security Dashboards

  • Authentication Events: Login attempts and failures
  • Network Security: Firewall logs and intrusion attempts
  • Certificate Monitoring: SSL certificate expiration
  • Vulnerability Scanning: Security scan results

Alert Configuration

Critical Alerts

  • Host down: System unreachable
  • High resource usage: CPU/Memory > 90%
  • Disk space critical: < 10% free space
  • Service unavailable: Key services down

Warning Alerts

  • High resource usage: CPU/Memory > 80%
  • Disk space low: < 20% free space
  • Certificate expiring: < 30 days to expiration
  • Backup failures: Failed backup jobs

Info Alerts

  • System updates: Available updates
  • Maintenance windows: Scheduled maintenance
  • Performance trends: Unusual patterns
  • Capacity planning: Resource growth trends

Data Retention

Prometheus Retention

  • Raw metrics: 15 days high resolution
  • Downsampled: 90 days medium resolution
  • Long-term: 1 year low resolution

Grafana Data

  • Dashboards: Version controlled in Git
  • User preferences: Backed up weekly
  • Annotations: Retained for 1 year

Log Retention

  • Application logs: 30 days
  • System logs: 90 days
  • Audit logs: 1 year
  • Security logs: 2 years

Backup and Recovery

Configuration Backup

# Backup Prometheus configuration
docker exec prometheus tar -czf /backup/prometheus-config-$(date +%Y%m%d).tar.gz /etc/prometheus/

# Backup Grafana dashboards
docker exec grafana tar -czf /backup/grafana-dashboards-$(date +%Y%m%d).tar.gz /var/lib/grafana/

Data Backup

# Backup Prometheus data
docker exec prometheus tar -czf /backup/prometheus-data-$(date +%Y%m%d).tar.gz /prometheus/

# Backup Grafana database
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /backup/grafana-$(date +%Y%m%d).db"

Disaster Recovery

  1. Restore configurations from backup
  2. Redeploy containers with restored configs
  3. Import historical data if needed
  4. Verify alert routing and dashboard functionality

Performance Optimization

Prometheus Optimization

  • Recording rules: Pre-calculate expensive queries
  • Metric relabeling: Reduce cardinality
  • Storage optimization: Efficient time series storage
  • Query optimization: Efficient PromQL queries

Grafana Optimization

  • Dashboard caching: Reduce query load
  • Panel optimization: Efficient visualizations
  • User management: Role-based access control
  • Plugin management: Only necessary plugins

Network Optimization

  • Local metrics: Minimize network traffic
  • Compression: Enable metric compression
  • Batching: Batch metric collection
  • Filtering: Collect only necessary metrics

Troubleshooting

Common Issues

High Memory Usage

# Check Prometheus memory usage
docker stats prometheus

# Reduce retention period
# Edit prometheus.yml: --storage.tsdb.retention.time=7d

Missing Metrics

# Check target status
curl http://prometheus:9090/api/v1/targets

# Verify service discovery
curl http://prometheus:9090/api/v1/label/__name__/values

Dashboard Loading Issues

# Check Grafana logs
docker logs grafana

# Verify datasource connectivity
curl http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up

Monitoring Health Checks

# Prometheus health
curl http://prometheus:9090/-/healthy

# Grafana health
curl http://grafana:3000/api/health

# AlertManager health
curl http://alertmanager:9093/-/healthy

Security Configuration

Authentication

  • Grafana: OAuth integration with Authentik
  • Prometheus: Basic auth via reverse proxy
  • AlertManager: Basic auth via reverse proxy

Network Security

  • Internal network: Isolated Docker network
  • Reverse proxy: Nginx Proxy Manager
  • SSL termination: Let's Encrypt certificates
  • Access control: IP-based restrictions

Data Security

  • Encryption at rest: Encrypted storage volumes
  • Encryption in transit: TLS for all communications
  • Access logging: Comprehensive audit trails
  • Regular updates: Automated security updates

Integration Points

External Systems

  • NTFY: Push notifications for alerts
  • Email: Backup notification channel
  • Slack: Team notifications (optional)
  • PagerDuty: Escalation for critical alerts

Automation

  • Ansible: Configuration management
  • GitOps: Version-controlled configurations
  • CI/CD: Automated deployment pipeline
  • Backup automation: Scheduled backups

Future Enhancements

Planned Features

  • Log aggregation: Centralized log management
  • Distributed tracing: Application tracing
  • Synthetic monitoring: Proactive service testing
  • Machine learning: Anomaly detection

Scaling Considerations

  • High availability: Multi-instance deployment
  • Load balancing: Distribute query load
  • Federation: Multi-cluster monitoring
  • Storage scaling: Efficient long-term storage

Status: Comprehensive monitoring infrastructure operational across all homelab systems