# 📊 Monitoring Architecture *Comprehensive monitoring and observability infrastructure for Vish's homelab* ## 🎯 Overview The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager. ## 🏗️ Architecture Components ### Core Monitoring Stack ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Grafana │ │ Prometheus │ │ AlertManager │ │ Visualization │◄───┤ Metrics Store │◄───┤ Alerting │ │ gf.vish.gg │ │ Port 9090 │ │ Port 9093 │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ▲ ▲ ▲ │ │ │ └────────────────────────┼────────────────────────┘ │ ┌─────────────────┐ │ Exporters │ │ Node, SNMP, │ │ Container │ └─────────────────┘ ``` ### Data Collection Layer #### Node Exporters - **Location**: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5) - **Port**: 9100 - **Metrics**: CPU, memory, disk, network, system stats - **Frequency**: 15-second scrape interval #### SNMP Monitoring - **Targets**: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+) - **Metrics**: Storage usage, temperature, RAID status, network interfaces - **Protocol**: SNMPv2c with community strings - **Frequency**: 30-second scrape interval #### Container Monitoring - **cAdvisor**: Container resource usage and performance - **Docker Metrics**: Container health, restart counts, image info - **Portainer Integration**: Stack deployment status ## 📈 Metrics Collection ### System Metrics - **CPU Usage**: Per-core utilization, load averages, context switches - **Memory**: Usage, available, buffers, cache, swap - **Storage**: Disk usage, I/O operations, read/write rates - **Network**: Interface statistics, bandwidth utilization, packet counts ### Application Metrics - **Container Health**: Running status, restart counts, resource limits - **Service Availability**: HTTP response codes, response times - **Database Performance**: Query times, connection counts - **Custom Metrics**: Application-specific KPIs ### Infrastructure Metrics - **NAS Health**: RAID status, disk temperatures, volume usage - **Network Performance**: Latency, throughput, packet loss - **Power Consumption**: UPS status, power draw (where available) - **Environmental**: Temperature sensors, fan speeds ## 📊 Visualization & Dashboards ### Grafana Configuration - **URL**: https://gf.vish.gg - **Version**: Latest stable - **Authentication**: Integrated with Authentik SSO - **Data Sources**: Prometheus, InfluxDB (legacy) ### Dashboard Categories #### Infrastructure Overview - **System Health**: Multi-host overview with key metrics - **Resource Utilization**: CPU, memory, storage across all hosts - **Network Performance**: Bandwidth, latency, connectivity status - **Storage Analytics**: Disk usage trends, RAID health, backup status #### Service Monitoring - **Container Status**: All running containers with health indicators - **Application Performance**: Response times, error rates, throughput - **GitOps Deployments**: Stack status, deployment history - **Gaming Services**: Player counts, server performance, uptime #### Specialized Dashboards - **Synology NAS**: Detailed storage and system metrics - **Tailscale Mesh**: VPN connectivity and performance - **Security Monitoring**: Failed login attempts, firewall activity - **Backup Verification**: Backup job status and data integrity ## 🚨 Alerting System ### AlertManager Configuration - **High Availability**: Clustered deployment across multiple hosts - **Notification Channels**: NTFY, email, webhook integrations - **Alert Routing**: Based on severity, service, and host labels - **Silencing**: Maintenance windows and temporary suppressions ### Alert Rules #### Critical Alerts - **Host Down**: Node exporter unreachable for > 5 minutes - **High CPU**: Sustained > 90% CPU usage for > 10 minutes - **Memory Exhaustion**: Available memory < 5% for > 5 minutes - **Disk Full**: Filesystem usage > 95% - **Service Down**: Critical service unavailable for > 2 minutes #### Warning Alerts - **High Resource Usage**: CPU > 80% or memory > 85% for > 15 minutes - **Disk Space**: Filesystem usage > 85% - **Container Restart**: Container restarted > 3 times in 1 hour - **Network Issues**: High packet loss or latency spikes #### Informational Alerts - **Backup Completion**: Daily backup job status - **Security Events**: SSH login attempts, firewall blocks - **System Updates**: Available package updates - **Certificate Expiry**: SSL certificates expiring within 30 days ## 🔧 Configuration Management ### Prometheus Configuration ```yaml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - "alert-rules.yml" scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['atlantis:9100', 'calypso:9100', 'concord:9100'] - job_name: 'snmp-synology' static_configs: - targets: ['192.168.0.200', '192.168.0.201'] metrics_path: /snmp params: module: [synology] ``` ### Alert Rules - **File**: `prometheus/alert-rules.yml` - **Validation**: Automated syntax checking in CI/CD - **Testing**: Alert rule unit tests for reliability - **Documentation**: Each rule includes description and runbook links ## 📱 Notification System ### NTFY Integration - **Server**: Self-hosted NTFY instance - **Topics**: Separate channels for different alert severities - **Mobile Apps**: Push notifications to admin devices - **Web Interface**: Browser-based notification viewing ### Notification Routing ``` Critical Alerts → NTFY + Email + SMS Warning Alerts → NTFY + Email Info Alerts → NTFY only Maintenance → Dedicated maintenance channel ``` ## 🔍 Log Management ### Centralized Logging - **Collection**: Docker log drivers, syslog forwarding - **Storage**: Local retention with rotation policies - **Analysis**: Grafana Loki for log aggregation and search - **Correlation**: Metrics and logs correlation in Grafana ### Log Sources - **System Logs**: Syslog from all hosts - **Container Logs**: Docker container stdout/stderr - **Application Logs**: Service-specific log files - **Security Logs**: Auth logs, firewall logs, intrusion detection ## 📊 Performance Optimization ### Query Optimization - **Recording Rules**: Pre-computed expensive queries - **Retention Policies**: Tiered storage with different retention periods - **Downsampling**: Reduced resolution for historical data - **Indexing**: Optimized label indexing for fast queries ### Resource Management - **Memory Tuning**: Prometheus memory configuration - **Storage Optimization**: Efficient time series storage - **Network Efficiency**: Compression and batching - **Caching**: Query result caching in Grafana ## 🔐 Security & Access Control ### Authentication - **SSO Integration**: Authentik-based authentication - **Role-Based Access**: Different permission levels - **API Security**: Token-based API access - **Network Security**: Internal network access only ### Data Protection - **Encryption**: TLS for all communications - **Backup**: Regular backup of monitoring data - **Retention**: Compliance with data retention policies - **Privacy**: Sensitive data scrubbing and anonymization ## 🚀 Future Enhancements ### Planned Improvements - **Distributed Tracing**: OpenTelemetry integration - **Machine Learning**: Anomaly detection and predictive alerting - **Mobile Dashboard**: Dedicated mobile monitoring app - **Advanced Analytics**: Custom metrics and business intelligence ### Scalability Considerations - **Federation**: Multi-cluster Prometheus federation - **High Availability**: Redundant monitoring infrastructure - **Performance**: Horizontal scaling capabilities - **Integration**: Additional data sources and exporters ## 📚 Documentation & Runbooks ### Operational Procedures - **Alert Response**: Step-by-step incident response procedures - **Maintenance**: Monitoring system maintenance procedures - **Troubleshooting**: Common issues and resolution steps - **Capacity Planning**: Resource growth and scaling guidelines ### Training Materials - **Dashboard Usage**: Guide for reading and interpreting dashboards - **Alert Management**: How to handle and resolve alerts - **Query Language**: PromQL tutorial and best practices - **Custom Metrics**: Adding new metrics and dashboards --- **Architecture Version**: 2.0 **Last Updated**: February 24, 2026 **Status**: ✅ **PRODUCTION** - Full monitoring coverage **Metrics Retention**: 15 days high-resolution, 1 year downsampled