Files
homelab-optimized/MONITORING_ARCHITECTURE.md
Gitea Mirror Bot ad3550ffea
Some checks failed
Documentation / Build Docusaurus (push) Failing after 7s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-12 11:57:19 UTC
2026-03-12 11:57:19 +00:00

246 lines
9.2 KiB
Markdown

# 📊 Monitoring Architecture
*Comprehensive monitoring and observability infrastructure for Vish's homelab*
## 🎯 Overview
The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.
## 🏗️ Architecture Components
### Core Monitoring Stack
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Grafana │ │ Prometheus │ │ AlertManager │
│ Visualization │◄───┤ Metrics Store │◄───┤ Alerting │
│ gf.vish.gg │ │ Port 9090 │ │ Port 9093 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
└────────────────────────┼────────────────────────┘
┌─────────────────┐
│ Exporters │
│ Node, SNMP, │
│ Container │
└─────────────────┘
```
### Data Collection Layer
#### Node Exporters
- **Location**: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
- **Port**: 9100
- **Metrics**: CPU, memory, disk, network, system stats
- **Frequency**: 15-second scrape interval
#### SNMP Monitoring
- **Targets**: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
- **Metrics**: Storage usage, temperature, RAID status, network interfaces
- **Protocol**: SNMPv2c with community strings
- **Frequency**: 30-second scrape interval
#### Container Monitoring
- **cAdvisor**: Container resource usage and performance
- **Docker Metrics**: Container health, restart counts, image info
- **Portainer Integration**: Stack deployment status
## 📈 Metrics Collection
### System Metrics
- **CPU Usage**: Per-core utilization, load averages, context switches
- **Memory**: Usage, available, buffers, cache, swap
- **Storage**: Disk usage, I/O operations, read/write rates
- **Network**: Interface statistics, bandwidth utilization, packet counts
### Application Metrics
- **Container Health**: Running status, restart counts, resource limits
- **Service Availability**: HTTP response codes, response times
- **Database Performance**: Query times, connection counts
- **Custom Metrics**: Application-specific KPIs
### Infrastructure Metrics
- **NAS Health**: RAID status, disk temperatures, volume usage
- **Network Performance**: Latency, throughput, packet loss
- **Power Consumption**: UPS status, power draw (where available)
- **Environmental**: Temperature sensors, fan speeds
## 📊 Visualization & Dashboards
### Grafana Configuration
- **URL**: https://gf.vish.gg
- **Version**: Latest stable
- **Authentication**: Integrated with Authentik SSO
- **Data Sources**: Prometheus, InfluxDB (legacy)
### Dashboard Categories
#### Infrastructure Overview
- **System Health**: Multi-host overview with key metrics
- **Resource Utilization**: CPU, memory, storage across all hosts
- **Network Performance**: Bandwidth, latency, connectivity status
- **Storage Analytics**: Disk usage trends, RAID health, backup status
#### Service Monitoring
- **Container Status**: All running containers with health indicators
- **Application Performance**: Response times, error rates, throughput
- **GitOps Deployments**: Stack status, deployment history
- **Gaming Services**: Player counts, server performance, uptime
#### Specialized Dashboards
- **Synology NAS**: Detailed storage and system metrics
- **Tailscale Mesh**: VPN connectivity and performance
- **Security Monitoring**: Failed login attempts, firewall activity
- **Backup Verification**: Backup job status and data integrity
## 🚨 Alerting System
### AlertManager Configuration
- **High Availability**: Clustered deployment across multiple hosts
- **Notification Channels**: NTFY, email, webhook integrations
- **Alert Routing**: Based on severity, service, and host labels
- **Silencing**: Maintenance windows and temporary suppressions
### Alert Rules
#### Critical Alerts
- **Host Down**: Node exporter unreachable for > 5 minutes
- **High CPU**: Sustained > 90% CPU usage for > 10 minutes
- **Memory Exhaustion**: Available memory < 5% for > 5 minutes
- **Disk Full**: Filesystem usage > 95%
- **Service Down**: Critical service unavailable for > 2 minutes
#### Warning Alerts
- **High Resource Usage**: CPU > 80% or memory > 85% for > 15 minutes
- **Disk Space**: Filesystem usage > 85%
- **Container Restart**: Container restarted > 3 times in 1 hour
- **Network Issues**: High packet loss or latency spikes
#### Informational Alerts
- **Backup Completion**: Daily backup job status
- **Security Events**: SSH login attempts, firewall blocks
- **System Updates**: Available package updates
- **Certificate Expiry**: SSL certificates expiring within 30 days
## 🔧 Configuration Management
### Prometheus Configuration
```yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert-rules.yml"
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']
- job_name: 'snmp-synology'
static_configs:
- targets: ['192.168.0.200', '192.168.0.201']
metrics_path: /snmp
params:
module: [synology]
```
### Alert Rules
- **File**: `prometheus/alert-rules.yml`
- **Validation**: Automated syntax checking in CI/CD
- **Testing**: Alert rule unit tests for reliability
- **Documentation**: Each rule includes description and runbook links
## 📱 Notification System
### NTFY Integration
- **Server**: Self-hosted NTFY instance
- **Topics**: Separate channels for different alert severities
- **Mobile Apps**: Push notifications to admin devices
- **Web Interface**: Browser-based notification viewing
### Notification Routing
```
Critical Alerts → NTFY + Email + SMS
Warning Alerts → NTFY + Email
Info Alerts → NTFY only
Maintenance → Dedicated maintenance channel
```
## 🔍 Log Management
### Centralized Logging
- **Collection**: Docker log drivers, syslog forwarding
- **Storage**: Local retention with rotation policies
- **Analysis**: Grafana Loki for log aggregation and search
- **Correlation**: Metrics and logs correlation in Grafana
### Log Sources
- **System Logs**: Syslog from all hosts
- **Container Logs**: Docker container stdout/stderr
- **Application Logs**: Service-specific log files
- **Security Logs**: Auth logs, firewall logs, intrusion detection
## 📊 Performance Optimization
### Query Optimization
- **Recording Rules**: Pre-computed expensive queries
- **Retention Policies**: Tiered storage with different retention periods
- **Downsampling**: Reduced resolution for historical data
- **Indexing**: Optimized label indexing for fast queries
### Resource Management
- **Memory Tuning**: Prometheus memory configuration
- **Storage Optimization**: Efficient time series storage
- **Network Efficiency**: Compression and batching
- **Caching**: Query result caching in Grafana
## 🔐 Security & Access Control
### Authentication
- **SSO Integration**: Authentik-based authentication
- **Role-Based Access**: Different permission levels
- **API Security**: Token-based API access
- **Network Security**: Internal network access only
### Data Protection
- **Encryption**: TLS for all communications
- **Backup**: Regular backup of monitoring data
- **Retention**: Compliance with data retention policies
- **Privacy**: Sensitive data scrubbing and anonymization
## 🚀 Future Enhancements
### Planned Improvements
- **Distributed Tracing**: OpenTelemetry integration
- **Machine Learning**: Anomaly detection and predictive alerting
- **Mobile Dashboard**: Dedicated mobile monitoring app
- **Advanced Analytics**: Custom metrics and business intelligence
### Scalability Considerations
- **Federation**: Multi-cluster Prometheus federation
- **High Availability**: Redundant monitoring infrastructure
- **Performance**: Horizontal scaling capabilities
- **Integration**: Additional data sources and exporters
## 📚 Documentation & Runbooks
### Operational Procedures
- **Alert Response**: Step-by-step incident response procedures
- **Maintenance**: Monitoring system maintenance procedures
- **Troubleshooting**: Common issues and resolution steps
- **Capacity Planning**: Resource growth and scaling guidelines
### Training Materials
- **Dashboard Usage**: Guide for reading and interpreting dashboards
- **Alert Management**: How to handle and resolve alerts
- **Query Language**: PromQL tutorial and best practices
- **Custom Metrics**: Adding new metrics and dashboards
---
**Architecture Version**: 2.0
**Last Updated**: February 24, 2026
**Status**: ✅ **PRODUCTION** - Full monitoring coverage
**Metrics Retention**: 15 days high-resolution, 1 year downsampled