homelab-optimized/MONITORING_ARCHITECTURE.md

# 📊 Monitoring Architecture

*Comprehensive monitoring and observability infrastructure for Vish's homelab*

## 🎯 Overview

The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.

## 🏗️ Architecture Components

### Core Monitoring Stack
```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│    Grafana      │    │   Prometheus    │    │  AlertManager   │
│  Visualization  │◄───┤  Metrics Store  │◄───┤   Alerting      │
│   gf.vish.gg    │    │   Port 9090     │    │   Port 9093     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         ▲                        ▲                        ▲
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  │
                    ┌─────────────────┐
                    │   Exporters     │
                    │  Node, SNMP,    │
                    │  Container      │
                    └─────────────────┘
```

### Data Collection Layer

#### Node Exporters
- **Location**: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
- **Port**: 9100
- **Metrics**: CPU, memory, disk, network, system stats
- **Frequency**: 15-second scrape interval

#### SNMP Monitoring
- **Targets**: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
- **Metrics**: Storage usage, temperature, RAID status, network interfaces
- **Protocol**: SNMPv2c with community strings
- **Frequency**: 30-second scrape interval

#### Container Monitoring
- **cAdvisor**: Container resource usage and performance
- **Docker Metrics**: Container health, restart counts, image info
- **Portainer Integration**: Stack deployment status

## 📈 Metrics Collection

### System Metrics
- **CPU Usage**: Per-core utilization, load averages, context switches
- **Memory**: Usage, available, buffers, cache, swap
- **Storage**: Disk usage, I/O operations, read/write rates
- **Network**: Interface statistics, bandwidth utilization, packet counts

### Application Metrics
- **Container Health**: Running status, restart counts, resource limits
- **Service Availability**: HTTP response codes, response times
- **Database Performance**: Query times, connection counts
- **Custom Metrics**: Application-specific KPIs

### Infrastructure Metrics
- **NAS Health**: RAID status, disk temperatures, volume usage
- **Network Performance**: Latency, throughput, packet loss
- **Power Consumption**: UPS status, power draw (where available)
- **Environmental**: Temperature sensors, fan speeds

## 📊 Visualization & Dashboards

### Grafana Configuration
- **URL**: https://gf.vish.gg
- **Version**: Latest stable
- **Authentication**: Integrated with Authentik SSO
- **Data Sources**: Prometheus, InfluxDB (legacy)

### Dashboard Categories

#### Infrastructure Overview
- **System Health**: Multi-host overview with key metrics
- **Resource Utilization**: CPU, memory, storage across all hosts
- **Network Performance**: Bandwidth, latency, connectivity status
- **Storage Analytics**: Disk usage trends, RAID health, backup status

#### Service Monitoring
- **Container Status**: All running containers with health indicators
- **Application Performance**: Response times, error rates, throughput
- **GitOps Deployments**: Stack status, deployment history
- **Gaming Services**: Player counts, server performance, uptime

#### Specialized Dashboards
- **Synology NAS**: Detailed storage and system metrics
- **Tailscale Mesh**: VPN connectivity and performance
- **Security Monitoring**: Failed login attempts, firewall activity
- **Backup Verification**: Backup job status and data integrity

## 🚨 Alerting System

### AlertManager Configuration
- **High Availability**: Clustered deployment across multiple hosts
- **Notification Channels**: NTFY, email, webhook integrations
- **Alert Routing**: Based on severity, service, and host labels
- **Silencing**: Maintenance windows and temporary suppressions

### Alert Rules

#### Critical Alerts
- **Host Down**: Node exporter unreachable for > 5 minutes
- **High CPU**: Sustained > 90% CPU usage for > 10 minutes
- **Memory Exhaustion**: Available memory < 5% for > 5 minutes
- **Disk Full**: Filesystem usage > 95%
- **Service Down**: Critical service unavailable for > 2 minutes

#### Warning Alerts
- **High Resource Usage**: CPU > 80% or memory > 85% for > 15 minutes
- **Disk Space**: Filesystem usage > 85%
- **Container Restart**: Container restarted > 3 times in 1 hour
- **Network Issues**: High packet loss or latency spikes

#### Informational Alerts
- **Backup Completion**: Daily backup job status
- **Security Events**: SSH login attempts, firewall blocks
- **System Updates**: Available package updates
- **Certificate Expiry**: SSL certificates expiring within 30 days

## 🔧 Configuration Management

### Prometheus Configuration
```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert-rules.yml"

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']

  - job_name: 'snmp-synology'
    static_configs:
      - targets: ['192.168.0.200', '192.168.0.201']
    metrics_path: /snmp
    params:
      module: [synology]
```

### Alert Rules
- **File**: `prometheus/alert-rules.yml`
- **Validation**: Automated syntax checking in CI/CD
- **Testing**: Alert rule unit tests for reliability
- **Documentation**: Each rule includes description and runbook links

## 📱 Notification System

### NTFY Integration
- **Server**: Self-hosted NTFY instance
- **Topics**: Separate channels for different alert severities
- **Mobile Apps**: Push notifications to admin devices
- **Web Interface**: Browser-based notification viewing

### Notification Routing
```
Critical Alerts → NTFY + Email + SMS
Warning Alerts → NTFY + Email
Info Alerts → NTFY only
Maintenance → Dedicated maintenance channel
```

## 🔍 Log Management

### Centralized Logging
- **Collection**: Docker log drivers, syslog forwarding
- **Storage**: Local retention with rotation policies
- **Analysis**: Grafana Loki for log aggregation and search
- **Correlation**: Metrics and logs correlation in Grafana

### Log Sources
- **System Logs**: Syslog from all hosts
- **Container Logs**: Docker container stdout/stderr
- **Application Logs**: Service-specific log files
- **Security Logs**: Auth logs, firewall logs, intrusion detection

## 📊 Performance Optimization

### Query Optimization
- **Recording Rules**: Pre-computed expensive queries
- **Retention Policies**: Tiered storage with different retention periods
- **Downsampling**: Reduced resolution for historical data
- **Indexing**: Optimized label indexing for fast queries

### Resource Management
- **Memory Tuning**: Prometheus memory configuration
- **Storage Optimization**: Efficient time series storage
- **Network Efficiency**: Compression and batching
- **Caching**: Query result caching in Grafana

## 🔐 Security & Access Control

### Authentication
- **SSO Integration**: Authentik-based authentication
- **Role-Based Access**: Different permission levels
- **API Security**: Token-based API access
- **Network Security**: Internal network access only

### Data Protection
- **Encryption**: TLS for all communications
- **Backup**: Regular backup of monitoring data
- **Retention**: Compliance with data retention policies
- **Privacy**: Sensitive data scrubbing and anonymization

## 🚀 Future Enhancements

### Planned Improvements
- **Distributed Tracing**: OpenTelemetry integration
- **Machine Learning**: Anomaly detection and predictive alerting
- **Mobile Dashboard**: Dedicated mobile monitoring app
- **Advanced Analytics**: Custom metrics and business intelligence

### Scalability Considerations
- **Federation**: Multi-cluster Prometheus federation
- **High Availability**: Redundant monitoring infrastructure
- **Performance**: Horizontal scaling capabilities
- **Integration**: Additional data sources and exporters

## 📚 Documentation & Runbooks

### Operational Procedures
- **Alert Response**: Step-by-step incident response procedures
- **Maintenance**: Monitoring system maintenance procedures
- **Troubleshooting**: Common issues and resolution steps
- **Capacity Planning**: Resource growth and scaling guidelines

### Training Materials
- **Dashboard Usage**: Guide for reading and interpreting dashboards
- **Alert Management**: How to handle and resolve alerts
- **Query Language**: PromQL tutorial and best practices
- **Custom Metrics**: Adding new metrics and dashboards

---

**Architecture Version**: 2.0
**Last Updated**: February 24, 2026
**Status**: ✅ **PRODUCTION** - Full monitoring coverage
**Metrics Retention**: 15 days high-resolution, 1 year downsampled