9.2 KiB
9.2 KiB
📊 Monitoring Architecture
Comprehensive monitoring and observability infrastructure for Vish's homelab
🎯 Overview
The homelab monitoring architecture provides complete observability across all infrastructure components, services, and applications using a modern monitoring stack built on Prometheus, Grafana, and AlertManager.
🏗️ Architecture Components
Core Monitoring Stack
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Grafana │ │ Prometheus │ │ AlertManager │
│ Visualization │◄───┤ Metrics Store │◄───┤ Alerting │
│ gf.vish.gg │ │ Port 9090 │ │ Port 9093 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
└────────────────────────┼────────────────────────┘
│
┌─────────────────┐
│ Exporters │
│ Node, SNMP, │
│ Container │
└─────────────────┘
Data Collection Layer
Node Exporters
- Location: All hosts (Atlantis, Calypso, Concord NUC, Homelab VM, RPi5)
- Port: 9100
- Metrics: CPU, memory, disk, network, system stats
- Frequency: 15-second scrape interval
SNMP Monitoring
- Targets: Synology NAS devices (Atlantis DS1823xs+, Calypso DS723+)
- Metrics: Storage usage, temperature, RAID status, network interfaces
- Protocol: SNMPv2c with community strings
- Frequency: 30-second scrape interval
Container Monitoring
- cAdvisor: Container resource usage and performance
- Docker Metrics: Container health, restart counts, image info
- Portainer Integration: Stack deployment status
📈 Metrics Collection
System Metrics
- CPU Usage: Per-core utilization, load averages, context switches
- Memory: Usage, available, buffers, cache, swap
- Storage: Disk usage, I/O operations, read/write rates
- Network: Interface statistics, bandwidth utilization, packet counts
Application Metrics
- Container Health: Running status, restart counts, resource limits
- Service Availability: HTTP response codes, response times
- Database Performance: Query times, connection counts
- Custom Metrics: Application-specific KPIs
Infrastructure Metrics
- NAS Health: RAID status, disk temperatures, volume usage
- Network Performance: Latency, throughput, packet loss
- Power Consumption: UPS status, power draw (where available)
- Environmental: Temperature sensors, fan speeds
📊 Visualization & Dashboards
Grafana Configuration
- URL: https://gf.vish.gg
- Version: Latest stable
- Authentication: Integrated with Authentik SSO
- Data Sources: Prometheus, InfluxDB (legacy)
Dashboard Categories
Infrastructure Overview
- System Health: Multi-host overview with key metrics
- Resource Utilization: CPU, memory, storage across all hosts
- Network Performance: Bandwidth, latency, connectivity status
- Storage Analytics: Disk usage trends, RAID health, backup status
Service Monitoring
- Container Status: All running containers with health indicators
- Application Performance: Response times, error rates, throughput
- GitOps Deployments: Stack status, deployment history
- Gaming Services: Player counts, server performance, uptime
Specialized Dashboards
- Synology NAS: Detailed storage and system metrics
- Tailscale Mesh: VPN connectivity and performance
- Security Monitoring: Failed login attempts, firewall activity
- Backup Verification: Backup job status and data integrity
🚨 Alerting System
AlertManager Configuration
- High Availability: Clustered deployment across multiple hosts
- Notification Channels: NTFY, email, webhook integrations
- Alert Routing: Based on severity, service, and host labels
- Silencing: Maintenance windows and temporary suppressions
Alert Rules
Critical Alerts
- Host Down: Node exporter unreachable for > 5 minutes
- High CPU: Sustained > 90% CPU usage for > 10 minutes
- Memory Exhaustion: Available memory < 5% for > 5 minutes
- Disk Full: Filesystem usage > 95%
- Service Down: Critical service unavailable for > 2 minutes
Warning Alerts
- High Resource Usage: CPU > 80% or memory > 85% for > 15 minutes
- Disk Space: Filesystem usage > 85%
- Container Restart: Container restarted > 3 times in 1 hour
- Network Issues: High packet loss or latency spikes
Informational Alerts
- Backup Completion: Daily backup job status
- Security Events: SSH login attempts, firewall blocks
- System Updates: Available package updates
- Certificate Expiry: SSL certificates expiring within 30 days
🔧 Configuration Management
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert-rules.yml"
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['atlantis:9100', 'calypso:9100', 'concord:9100']
- job_name: 'snmp-synology'
static_configs:
- targets: ['192.168.0.200', '192.168.0.201']
metrics_path: /snmp
params:
module: [synology]
Alert Rules
- File:
prometheus/alert-rules.yml - Validation: Automated syntax checking in CI/CD
- Testing: Alert rule unit tests for reliability
- Documentation: Each rule includes description and runbook links
📱 Notification System
NTFY Integration
- Server: Self-hosted NTFY instance
- Topics: Separate channels for different alert severities
- Mobile Apps: Push notifications to admin devices
- Web Interface: Browser-based notification viewing
Notification Routing
Critical Alerts → NTFY + Email + SMS
Warning Alerts → NTFY + Email
Info Alerts → NTFY only
Maintenance → Dedicated maintenance channel
🔍 Log Management
Centralized Logging
- Collection: Docker log drivers, syslog forwarding
- Storage: Local retention with rotation policies
- Analysis: Grafana Loki for log aggregation and search
- Correlation: Metrics and logs correlation in Grafana
Log Sources
- System Logs: Syslog from all hosts
- Container Logs: Docker container stdout/stderr
- Application Logs: Service-specific log files
- Security Logs: Auth logs, firewall logs, intrusion detection
📊 Performance Optimization
Query Optimization
- Recording Rules: Pre-computed expensive queries
- Retention Policies: Tiered storage with different retention periods
- Downsampling: Reduced resolution for historical data
- Indexing: Optimized label indexing for fast queries
Resource Management
- Memory Tuning: Prometheus memory configuration
- Storage Optimization: Efficient time series storage
- Network Efficiency: Compression and batching
- Caching: Query result caching in Grafana
🔐 Security & Access Control
Authentication
- SSO Integration: Authentik-based authentication
- Role-Based Access: Different permission levels
- API Security: Token-based API access
- Network Security: Internal network access only
Data Protection
- Encryption: TLS for all communications
- Backup: Regular backup of monitoring data
- Retention: Compliance with data retention policies
- Privacy: Sensitive data scrubbing and anonymization
🚀 Future Enhancements
Planned Improvements
- Distributed Tracing: OpenTelemetry integration
- Machine Learning: Anomaly detection and predictive alerting
- Mobile Dashboard: Dedicated mobile monitoring app
- Advanced Analytics: Custom metrics and business intelligence
Scalability Considerations
- Federation: Multi-cluster Prometheus federation
- High Availability: Redundant monitoring infrastructure
- Performance: Horizontal scaling capabilities
- Integration: Additional data sources and exporters
📚 Documentation & Runbooks
Operational Procedures
- Alert Response: Step-by-step incident response procedures
- Maintenance: Monitoring system maintenance procedures
- Troubleshooting: Common issues and resolution steps
- Capacity Planning: Resource growth and scaling guidelines
Training Materials
- Dashboard Usage: Guide for reading and interpreting dashboards
- Alert Management: How to handle and resolve alerts
- Query Language: PromQL tutorial and best practices
- Custom Metrics: Adding new metrics and dashboards
Architecture Version: 2.0
Last Updated: February 24, 2026
Status: ✅ PRODUCTION - Full monitoring coverage
Metrics Retention: 15 days high-resolution, 1 year downsampled