Sanitized mirror from private repository - 2026-03-16 09:26:06 UTC
This commit is contained in:
246
MONITORING_ARCHITECTURE.md
Normal file
246
MONITORING_ARCHITECTURE.md
Normal file
@@ -0,0 +1,246 @@
|
||||
# 📊 Monitoring Architecture
|
||||
|
||||
*Comprehensive monitoring and observability infrastructure for Vish's homelab*
|
||||
|
||||
## 🎯 Overview
|
||||
|
||||
The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.
|
||||
|
||||
## 🏗️ Architecture Components
|
||||
|
||||
### Core Monitoring Stack
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Grafana │ │ Prometheus │ │ AlertManager │
|
||||
│ Visualization │◄───┤ Metrics Store │◄───┤ Alerting │
|
||||
│ gf.vish.gg │ │ Port 9090 │ │ Port 9093 │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
▲ ▲ ▲
|
||||
│ │ │
|
||||
└────────────────────────┼────────────────────────┘
|
||||
│
|
||||
┌─────────────────┐
|
||||
│ Exporters │
|
||||
│ Node, SNMP, │
|
||||
│ Container │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
### Data Collection Layer
|
||||
|
||||
#### Node Exporters
|
||||
- **Location**: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
|
||||
- **Port**: 9100
|
||||
- **Metrics**: CPU, memory, disk, network, system stats
|
||||
- **Frequency**: 15-second scrape interval
|
||||
|
||||
#### SNMP Monitoring
|
||||
- **Targets**: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
|
||||
- **Metrics**: Storage usage, temperature, RAID status, network interfaces
|
||||
- **Protocol**: SNMPv2c with community strings
|
||||
- **Frequency**: 30-second scrape interval
|
||||
|
||||
#### Container Monitoring
|
||||
- **cAdvisor**: Container resource usage and performance
|
||||
- **Docker Metrics**: Container health, restart counts, image info
|
||||
- **Portainer Integration**: Stack deployment status
|
||||
|
||||
## 📈 Metrics Collection
|
||||
|
||||
### System Metrics
|
||||
- **CPU Usage**: Per-core utilization, load averages, context switches
|
||||
- **Memory**: Usage, available, buffers, cache, swap
|
||||
- **Storage**: Disk usage, I/O operations, read/write rates
|
||||
- **Network**: Interface statistics, bandwidth utilization, packet counts
|
||||
|
||||
### Application Metrics
|
||||
- **Container Health**: Running status, restart counts, resource limits
|
||||
- **Service Availability**: HTTP response codes, response times
|
||||
- **Database Performance**: Query times, connection counts
|
||||
- **Custom Metrics**: Application-specific KPIs
|
||||
|
||||
### Infrastructure Metrics
|
||||
- **NAS Health**: RAID status, disk temperatures, volume usage
|
||||
- **Network Performance**: Latency, throughput, packet loss
|
||||
- **Power Consumption**: UPS status, power draw (where available)
|
||||
- **Environmental**: Temperature sensors, fan speeds
|
||||
|
||||
## 📊 Visualization & Dashboards
|
||||
|
||||
### Grafana Configuration
|
||||
- **URL**: https://gf.vish.gg
|
||||
- **Version**: Latest stable
|
||||
- **Authentication**: Integrated with Authentik SSO
|
||||
- **Data Sources**: Prometheus, InfluxDB (legacy)
|
||||
|
||||
### Dashboard Categories
|
||||
|
||||
#### Infrastructure Overview
|
||||
- **System Health**: Multi-host overview with key metrics
|
||||
- **Resource Utilization**: CPU, memory, storage across all hosts
|
||||
- **Network Performance**: Bandwidth, latency, connectivity status
|
||||
- **Storage Analytics**: Disk usage trends, RAID health, backup status
|
||||
|
||||
#### Service Monitoring
|
||||
- **Container Status**: All running containers with health indicators
|
||||
- **Application Performance**: Response times, error rates, throughput
|
||||
- **GitOps Deployments**: Stack status, deployment history
|
||||
- **Gaming Services**: Player counts, server performance, uptime
|
||||
|
||||
#### Specialized Dashboards
|
||||
- **Synology NAS**: Detailed storage and system metrics
|
||||
- **Tailscale Mesh**: VPN connectivity and performance
|
||||
- **Security Monitoring**: Failed login attempts, firewall activity
|
||||
- **Backup Verification**: Backup job status and data integrity
|
||||
|
||||
## 🚨 Alerting System
|
||||
|
||||
### AlertManager Configuration
|
||||
- **High Availability**: Clustered deployment across multiple hosts
|
||||
- **Notification Channels**: NTFY, email, webhook integrations
|
||||
- **Alert Routing**: Based on severity, service, and host labels
|
||||
- **Silencing**: Maintenance windows and temporary suppressions
|
||||
|
||||
### Alert Rules
|
||||
|
||||
#### Critical Alerts
|
||||
- **Host Down**: Node exporter unreachable for > 5 minutes
|
||||
- **High CPU**: Sustained > 90% CPU usage for > 10 minutes
|
||||
- **Memory Exhaustion**: Available memory < 5% for > 5 minutes
|
||||
- **Disk Full**: Filesystem usage > 95%
|
||||
- **Service Down**: Critical service unavailable for > 2 minutes
|
||||
|
||||
#### Warning Alerts
|
||||
- **High Resource Usage**: CPU > 80% or memory > 85% for > 15 minutes
|
||||
- **Disk Space**: Filesystem usage > 85%
|
||||
- **Container Restart**: Container restarted > 3 times in 1 hour
|
||||
- **Network Issues**: High packet loss or latency spikes
|
||||
|
||||
#### Informational Alerts
|
||||
- **Backup Completion**: Daily backup job status
|
||||
- **Security Events**: SSH login attempts, firewall blocks
|
||||
- **System Updates**: Available package updates
|
||||
- **Certificate Expiry**: SSL certificates expiring within 30 days
|
||||
|
||||
## 🔧 Configuration Management
|
||||
|
||||
### Prometheus Configuration
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
rule_files:
|
||||
- "alert-rules.yml"
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'node-exporter'
|
||||
static_configs:
|
||||
- targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']
|
||||
|
||||
- job_name: 'snmp-synology'
|
||||
static_configs:
|
||||
- targets: ['192.168.0.200', '192.168.0.201']
|
||||
metrics_path: /snmp
|
||||
params:
|
||||
module: [synology]
|
||||
```
|
||||
|
||||
### Alert Rules
|
||||
- **File**: `prometheus/alert-rules.yml`
|
||||
- **Validation**: Automated syntax checking in CI/CD
|
||||
- **Testing**: Alert rule unit tests for reliability
|
||||
- **Documentation**: Each rule includes description and runbook links
|
||||
|
||||
## 📱 Notification System
|
||||
|
||||
### NTFY Integration
|
||||
- **Server**: Self-hosted NTFY instance
|
||||
- **Topics**: Separate channels for different alert severities
|
||||
- **Mobile Apps**: Push notifications to admin devices
|
||||
- **Web Interface**: Browser-based notification viewing
|
||||
|
||||
### Notification Routing
|
||||
```
|
||||
Critical Alerts → NTFY + Email + SMS
|
||||
Warning Alerts → NTFY + Email
|
||||
Info Alerts → NTFY only
|
||||
Maintenance → Dedicated maintenance channel
|
||||
```
|
||||
|
||||
## 🔍 Log Management
|
||||
|
||||
### Centralized Logging
|
||||
- **Collection**: Docker log drivers, syslog forwarding
|
||||
- **Storage**: Local retention with rotation policies
|
||||
- **Analysis**: Grafana Loki for log aggregation and search
|
||||
- **Correlation**: Metrics and logs correlation in Grafana
|
||||
|
||||
### Log Sources
|
||||
- **System Logs**: Syslog from all hosts
|
||||
- **Container Logs**: Docker container stdout/stderr
|
||||
- **Application Logs**: Service-specific log files
|
||||
- **Security Logs**: Auth logs, firewall logs, intrusion detection
|
||||
|
||||
## 📊 Performance Optimization
|
||||
|
||||
### Query Optimization
|
||||
- **Recording Rules**: Pre-computed expensive queries
|
||||
- **Retention Policies**: Tiered storage with different retention periods
|
||||
- **Downsampling**: Reduced resolution for historical data
|
||||
- **Indexing**: Optimized label indexing for fast queries
|
||||
|
||||
### Resource Management
|
||||
- **Memory Tuning**: Prometheus memory configuration
|
||||
- **Storage Optimization**: Efficient time series storage
|
||||
- **Network Efficiency**: Compression and batching
|
||||
- **Caching**: Query result caching in Grafana
|
||||
|
||||
## 🔐 Security & Access Control
|
||||
|
||||
### Authentication
|
||||
- **SSO Integration**: Authentik-based authentication
|
||||
- **Role-Based Access**: Different permission levels
|
||||
- **API Security**: Token-based API access
|
||||
- **Network Security**: Internal network access only
|
||||
|
||||
### Data Protection
|
||||
- **Encryption**: TLS for all communications
|
||||
- **Backup**: Regular backup of monitoring data
|
||||
- **Retention**: Compliance with data retention policies
|
||||
- **Privacy**: Sensitive data scrubbing and anonymization
|
||||
|
||||
## 🚀 Future Enhancements
|
||||
|
||||
### Planned Improvements
|
||||
- **Distributed Tracing**: OpenTelemetry integration
|
||||
- **Machine Learning**: Anomaly detection and predictive alerting
|
||||
- **Mobile Dashboard**: Dedicated mobile monitoring app
|
||||
- **Advanced Analytics**: Custom metrics and business intelligence
|
||||
|
||||
### Scalability Considerations
|
||||
- **Federation**: Multi-cluster Prometheus federation
|
||||
- **High Availability**: Redundant monitoring infrastructure
|
||||
- **Performance**: Horizontal scaling capabilities
|
||||
- **Integration**: Additional data sources and exporters
|
||||
|
||||
## 📚 Documentation & Runbooks
|
||||
|
||||
### Operational Procedures
|
||||
- **Alert Response**: Step-by-step incident response procedures
|
||||
- **Maintenance**: Monitoring system maintenance procedures
|
||||
- **Troubleshooting**: Common issues and resolution steps
|
||||
- **Capacity Planning**: Resource growth and scaling guidelines
|
||||
|
||||
### Training Materials
|
||||
- **Dashboard Usage**: Guide for reading and interpreting dashboards
|
||||
- **Alert Management**: How to handle and resolve alerts
|
||||
- **Query Language**: PromQL tutorial and best practices
|
||||
- **Custom Metrics**: Adding new metrics and dashboards
|
||||
|
||||
---
|
||||
|
||||
**Architecture Version**: 2.0
|
||||
**Last Updated**: February 24, 2026
|
||||
**Status**: ✅ **PRODUCTION** - Full monitoring coverage
|
||||
**Metrics Retention**: 15 days high-resolution, 1 year downsampled
|
||||
Reference in New Issue
Block a user