Files
homelab-optimized/MONITORING_ARCHITECTURE.md
Gitea Mirror Bot 57b1fe47f2
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-19 08:15:48 UTC
2026-04-19 08:15:48 +00:00

9.2 KiB

📊 Monitoring Architecture

Comprehensive monitoring and observability infrastructure for Vish's homelab

🎯 Overview

The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.

🏗️ Architecture Components

Core Monitoring Stack

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│    Grafana      │    │   Prometheus    │    │  AlertManager   │
│  Visualization  │◄───┤  Metrics Store  │◄───┤   Alerting      │
│   gf.vish.gg    │    │   Port 9090     │    │   Port 9093     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         ▲                        ▲                        ▲
         │                        │                        │
         └────────────────────────┼────────────────────────┘
                                  │
                    ┌─────────────────┐
                    │   Exporters     │
                    │  Node, SNMP,    │
                    │  Container      │
                    └─────────────────┘

Data Collection Layer

Node Exporters

  • Location: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
  • Port: 9100
  • Metrics: CPU, memory, disk, network, system stats
  • Frequency: 15-second scrape interval

SNMP Monitoring

  • Targets: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
  • Metrics: Storage usage, temperature, RAID status, network interfaces
  • Protocol: SNMPv2c with community strings
  • Frequency: 30-second scrape interval

Container Monitoring

  • cAdvisor: Container resource usage and performance
  • Docker Metrics: Container health, restart counts, image info
  • Portainer Integration: Stack deployment status

📈 Metrics Collection

System Metrics

  • CPU Usage: Per-core utilization, load averages, context switches
  • Memory: Usage, available, buffers, cache, swap
  • Storage: Disk usage, I/O operations, read/write rates
  • Network: Interface statistics, bandwidth utilization, packet counts

Application Metrics

  • Container Health: Running status, restart counts, resource limits
  • Service Availability: HTTP response codes, response times
  • Database Performance: Query times, connection counts
  • Custom Metrics: Application-specific KPIs

Infrastructure Metrics

  • NAS Health: RAID status, disk temperatures, volume usage
  • Network Performance: Latency, throughput, packet loss
  • Power Consumption: UPS status, power draw (where available)
  • Environmental: Temperature sensors, fan speeds

📊 Visualization & Dashboards

Grafana Configuration

  • URL: https://gf.vish.gg
  • Version: Latest stable
  • Authentication: Integrated with Authentik SSO
  • Data Sources: Prometheus, InfluxDB (legacy)

Dashboard Categories

Infrastructure Overview

  • System Health: Multi-host overview with key metrics
  • Resource Utilization: CPU, memory, storage across all hosts
  • Network Performance: Bandwidth, latency, connectivity status
  • Storage Analytics: Disk usage trends, RAID health, backup status

Service Monitoring

  • Container Status: All running containers with health indicators
  • Application Performance: Response times, error rates, throughput
  • GitOps Deployments: Stack status, deployment history
  • Gaming Services: Player counts, server performance, uptime

Specialized Dashboards

  • Synology NAS: Detailed storage and system metrics
  • Tailscale Mesh: VPN connectivity and performance
  • Security Monitoring: Failed login attempts, firewall activity
  • Backup Verification: Backup job status and data integrity

🚨 Alerting System

AlertManager Configuration

  • High Availability: Clustered deployment across multiple hosts
  • Notification Channels: NTFY, email, webhook integrations
  • Alert Routing: Based on severity, service, and host labels
  • Silencing: Maintenance windows and temporary suppressions

Alert Rules

Critical Alerts

  • Host Down: Node exporter unreachable for > 5 minutes
  • High CPU: Sustained > 90% CPU usage for > 10 minutes
  • Memory Exhaustion: Available memory < 5% for > 5 minutes
  • Disk Full: Filesystem usage > 95%
  • Service Down: Critical service unavailable for > 2 minutes

Warning Alerts

  • High Resource Usage: CPU > 80% or memory > 85% for > 15 minutes
  • Disk Space: Filesystem usage > 85%
  • Container Restart: Container restarted > 3 times in 1 hour
  • Network Issues: High packet loss or latency spikes

Informational Alerts

  • Backup Completion: Daily backup job status
  • Security Events: SSH login attempts, firewall blocks
  • System Updates: Available package updates
  • Certificate Expiry: SSL certificates expiring within 30 days

🔧 Configuration Management

Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert-rules.yml"

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']
  
  - job_name: 'snmp-synology'
    static_configs:
      - targets: ['192.168.0.200', '192.168.0.201']
    metrics_path: /snmp
    params:
      module: [synology]

Alert Rules

  • File: prometheus/alert-rules.yml
  • Validation: Automated syntax checking in CI/CD
  • Testing: Alert rule unit tests for reliability
  • Documentation: Each rule includes description and runbook links

📱 Notification System

NTFY Integration

  • Server: Self-hosted NTFY instance
  • Topics: Separate channels for different alert severities
  • Mobile Apps: Push notifications to admin devices
  • Web Interface: Browser-based notification viewing

Notification Routing

Critical Alerts → NTFY + Email + SMS
Warning Alerts → NTFY + Email
Info Alerts → NTFY only
Maintenance → Dedicated maintenance channel

🔍 Log Management

Centralized Logging

  • Collection: Docker log drivers, syslog forwarding
  • Storage: Local retention with rotation policies
  • Analysis: Grafana Loki for log aggregation and search
  • Correlation: Metrics and logs correlation in Grafana

Log Sources

  • System Logs: Syslog from all hosts
  • Container Logs: Docker container stdout/stderr
  • Application Logs: Service-specific log files
  • Security Logs: Auth logs, firewall logs, intrusion detection

📊 Performance Optimization

Query Optimization

  • Recording Rules: Pre-computed expensive queries
  • Retention Policies: Tiered storage with different retention periods
  • Downsampling: Reduced resolution for historical data
  • Indexing: Optimized label indexing for fast queries

Resource Management

  • Memory Tuning: Prometheus memory configuration
  • Storage Optimization: Efficient time series storage
  • Network Efficiency: Compression and batching
  • Caching: Query result caching in Grafana

🔐 Security & Access Control

Authentication

  • SSO Integration: Authentik-based authentication
  • Role-Based Access: Different permission levels
  • API Security: Token-based API access
  • Network Security: Internal network access only

Data Protection

  • Encryption: TLS for all communications
  • Backup: Regular backup of monitoring data
  • Retention: Compliance with data retention policies
  • Privacy: Sensitive data scrubbing and anonymization

🚀 Future Enhancements

Planned Improvements

  • Distributed Tracing: OpenTelemetry integration
  • Machine Learning: Anomaly detection and predictive alerting
  • Mobile Dashboard: Dedicated mobile monitoring app
  • Advanced Analytics: Custom metrics and business intelligence

Scalability Considerations

  • Federation: Multi-cluster Prometheus federation
  • High Availability: Redundant monitoring infrastructure
  • Performance: Horizontal scaling capabilities
  • Integration: Additional data sources and exporters

📚 Documentation & Runbooks

Operational Procedures

  • Alert Response: Step-by-step incident response procedures
  • Maintenance: Monitoring system maintenance procedures
  • Troubleshooting: Common issues and resolution steps
  • Capacity Planning: Resource growth and scaling guidelines

Training Materials

  • Dashboard Usage: Guide for reading and interpreting dashboards
  • Alert Management: How to handle and resolve alerts
  • Query Language: PromQL tutorial and best practices
  • Custom Metrics: Adding new metrics and dashboards

Architecture Version: 2.0
Last Updated: February 24, 2026
Status: PRODUCTION - Full monitoring coverage
Metrics Retention: 15 days high-resolution, 1 year downsampled