Sanitized mirror from private repository - 2026-03-16 09:26:06 UTC

2026-03-16 09:26:06 +00:00
commit 231ee6e3d4
1217 changed files with 304021 additions and 0 deletions
--- a/MONITORING_ARCHITECTURE.md
+++ b/MONITORING_ARCHITECTURE.md
@@ -0,0 +1,246 @@
+# 📊 Monitoring Architecture
+
+*Comprehensive monitoring and observability infrastructure for Vish's homelab*
+
+## 🎯 Overview
+
+The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.
+
+## 🏗️ Architecture Components
+
+### Core Monitoring Stack
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│    Grafana      │    │   Prometheus    │    │  AlertManager   │
+│  Visualization  │◄───┤  Metrics Store  │◄───┤   Alerting      │
+│   gf.vish.gg    │    │   Port 9090     │    │   Port 9093     │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+         ▲                        ▲                        ▲
+         │                        │                        │
+         └────────────────────────┼────────────────────────┘
+                                  │
+                    ┌─────────────────┐
+                    │   Exporters     │
+                    │  Node, SNMP,    │
+                    │  Container      │
+                    └─────────────────┘
+```
+
+### Data Collection Layer
+
+#### Node Exporters
+- **Location**: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
+- **Port**: 9100
+- **Metrics**: CPU, memory, disk, network, system stats
+- **Frequency**: 15-second scrape interval
+
+#### SNMP Monitoring
+- **Targets**: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
+- **Metrics**: Storage usage, temperature, RAID status, network interfaces
+- **Protocol**: SNMPv2c with community strings
+- **Frequency**: 30-second scrape interval
+
+#### Container Monitoring
+- **cAdvisor**: Container resource usage and performance
+- **Docker Metrics**: Container health, restart counts, image info
+- **Portainer Integration**: Stack deployment status
+
+## 📈 Metrics Collection
+
+### System Metrics
+- **CPU Usage**: Per-core utilization, load averages, context switches
+- **Memory**: Usage, available, buffers, cache, swap
+- **Storage**: Disk usage, I/O operations, read/write rates
+- **Network**: Interface statistics, bandwidth utilization, packet counts
+
+### Application Metrics
+- **Container Health**: Running status, restart counts, resource limits
+- **Service Availability**: HTTP response codes, response times
+- **Database Performance**: Query times, connection counts
+- **Custom Metrics**: Application-specific KPIs
+
+### Infrastructure Metrics
+- **NAS Health**: RAID status, disk temperatures, volume usage
+- **Network Performance**: Latency, throughput, packet loss
+- **Power Consumption**: UPS status, power draw (where available)
+- **Environmental**: Temperature sensors, fan speeds
+
+## 📊 Visualization & Dashboards
+
+### Grafana Configuration
+- **URL**: https://gf.vish.gg
+- **Version**: Latest stable
+- **Authentication**: Integrated with Authentik SSO
+- **Data Sources**: Prometheus, InfluxDB (legacy)
+
+### Dashboard Categories
+
+#### Infrastructure Overview
+- **System Health**: Multi-host overview with key metrics
+- **Resource Utilization**: CPU, memory, storage across all hosts
+- **Network Performance**: Bandwidth, latency, connectivity status
+- **Storage Analytics**: Disk usage trends, RAID health, backup status
+
+#### Service Monitoring
+- **Container Status**: All running containers with health indicators
+- **Application Performance**: Response times, error rates, throughput
+- **GitOps Deployments**: Stack status, deployment history
+- **Gaming Services**: Player counts, server performance, uptime
+
+#### Specialized Dashboards
+- **Synology NAS**: Detailed storage and system metrics
+- **Tailscale Mesh**: VPN connectivity and performance
+- **Security Monitoring**: Failed login attempts, firewall activity
+- **Backup Verification**: Backup job status and data integrity
+
+## 🚨 Alerting System
+
+### AlertManager Configuration
+- **High Availability**: Clustered deployment across multiple hosts
+- **Notification Channels**: NTFY, email, webhook integrations
+- **Alert Routing**: Based on severity, service, and host labels
+- **Silencing**: Maintenance windows and temporary suppressions
+
+### Alert Rules
+
+#### Critical Alerts
+- **Host Down**: Node exporter unreachable for > 5 minutes
+- **High CPU**: Sustained > 90% CPU usage for > 10 minutes
+- **Memory Exhaustion**: Available memory < 5% for > 5 minutes
+- **Disk Full**: Filesystem usage > 95%
+- **Service Down**: Critical service unavailable for > 2 minutes
+
+#### Warning Alerts
+- **High Resource Usage**: CPU > 80% or memory > 85% for > 15 minutes
+- **Disk Space**: Filesystem usage > 85%
+- **Container Restart**: Container restarted > 3 times in 1 hour
+- **Network Issues**: High packet loss or latency spikes
+
+#### Informational Alerts
+- **Backup Completion**: Daily backup job status
+- **Security Events**: SSH login attempts, firewall blocks
+- **System Updates**: Available package updates
+- **Certificate Expiry**: SSL certificates expiring within 30 days
+
+## 🔧 Configuration Management
+
+### Prometheus Configuration
+```yaml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+
+rule_files:
+  - "alert-rules.yml"
+
+scrape_configs:
+  - job_name: 'node-exporter'
+    static_configs:
+      - targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']
+  
+  - job_name: 'snmp-synology'
+    static_configs:
+      - targets: ['192.168.0.200', '192.168.0.201']
+    metrics_path: /snmp
+    params:
+      module: [synology]
+```
+
+### Alert Rules
+- **File**: `prometheus/alert-rules.yml`
+- **Validation**: Automated syntax checking in CI/CD
+- **Testing**: Alert rule unit tests for reliability
+- **Documentation**: Each rule includes description and runbook links
+
+## 📱 Notification System
+
+### NTFY Integration
+- **Server**: Self-hosted NTFY instance
+- **Topics**: Separate channels for different alert severities
+- **Mobile Apps**: Push notifications to admin devices
+- **Web Interface**: Browser-based notification viewing
+
+### Notification Routing
+```
+Critical Alerts → NTFY + Email + SMS
+Warning Alerts → NTFY + Email
+Info Alerts → NTFY only
+Maintenance → Dedicated maintenance channel
+```
+
+## 🔍 Log Management
+
+### Centralized Logging
+- **Collection**: Docker log drivers, syslog forwarding
+- **Storage**: Local retention with rotation policies
+- **Analysis**: Grafana Loki for log aggregation and search
+- **Correlation**: Metrics and logs correlation in Grafana
+
+### Log Sources
+- **System Logs**: Syslog from all hosts
+- **Container Logs**: Docker container stdout/stderr
+- **Application Logs**: Service-specific log files
+- **Security Logs**: Auth logs, firewall logs, intrusion detection
+
+## 📊 Performance Optimization
+
+### Query Optimization
+- **Recording Rules**: Pre-computed expensive queries
+- **Retention Policies**: Tiered storage with different retention periods
+- **Downsampling**: Reduced resolution for historical data
+- **Indexing**: Optimized label indexing for fast queries
+
+### Resource Management
+- **Memory Tuning**: Prometheus memory configuration
+- **Storage Optimization**: Efficient time series storage
+- **Network Efficiency**: Compression and batching
+- **Caching**: Query result caching in Grafana
+
+## 🔐 Security & Access Control
+
+### Authentication
+- **SSO Integration**: Authentik-based authentication
+- **Role-Based Access**: Different permission levels
+- **API Security**: Token-based API access
+- **Network Security**: Internal network access only
+
+### Data Protection
+- **Encryption**: TLS for all communications
+- **Backup**: Regular backup of monitoring data
+- **Retention**: Compliance with data retention policies
+- **Privacy**: Sensitive data scrubbing and anonymization
+
+## 🚀 Future Enhancements
+
+### Planned Improvements
+- **Distributed Tracing**: OpenTelemetry integration
+- **Machine Learning**: Anomaly detection and predictive alerting
+- **Mobile Dashboard**: Dedicated mobile monitoring app
+- **Advanced Analytics**: Custom metrics and business intelligence
+
+### Scalability Considerations
+- **Federation**: Multi-cluster Prometheus federation
+- **High Availability**: Redundant monitoring infrastructure
+- **Performance**: Horizontal scaling capabilities
+- **Integration**: Additional data sources and exporters
+
+## 📚 Documentation & Runbooks
+
+### Operational Procedures
+- **Alert Response**: Step-by-step incident response procedures
+- **Maintenance**: Monitoring system maintenance procedures
+- **Troubleshooting**: Common issues and resolution steps
+- **Capacity Planning**: Resource growth and scaling guidelines
+
+### Training Materials
+- **Dashboard Usage**: Guide for reading and interpreting dashboards
+- **Alert Management**: How to handle and resolve alerts
+- **Query Language**: PromQL tutorial and best practices
+- **Custom Metrics**: Adding new metrics and dashboards
+
+---
+
+**Architecture Version**: 2.0  
+**Last Updated**: February 24, 2026  
+**Status**: ✅ **PRODUCTION** - Full monitoring coverage  
+**Metrics Retention**: 15 days high-resolution, 1 year downsampled