Files
homelab-optimized/docs/admin/monitoring-setup.md
Gitea Mirror Bot 8664c8417c
Some checks failed
Documentation / Build Docusaurus (push) Failing after 9m20s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-30 00:10:29 UTC
2026-03-30 00:10:29 +00:00

5.0 KiB

📊 Monitoring and Alerting Setup

This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.

🧰 Monitoring Stack Overview

Services Deployed

  • Grafana (v12.4.0): Visualization and dashboarding
  • Prometheus: Metrics collection and storage
  • Node Exporter: Host-level metrics
  • SNMP Exporter: Synology NAS metrics collection

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Services  │───▶│   Prometheus  │───▶│   Grafana   │
│   (containers) │    │   (scraping)  │    │   (visual)  │
└─────────────┘    └─────────────┘    └─────────────┘
     │                   │                  │
     ▼                   ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Hosts     │    │   Exporters │    │   Dashboards│
│(node_exporter)│    │(snmp_exporter)│    │(Grafana UI) │
└─────────────┘    └─────────────┘    └─────────────┘

🔧 Current Configuration

Active Monitoring Services

Service Host Port URL Purpose
Grafana Homelab VM 3300 https://gf.vish.gg Dashboards & visualization
Prometheus Homelab VM 9090 http://192.168.0.210:9090 Metrics collection & storage
Alertmanager Homelab VM 9093 http://192.168.0.210:9093 Alert routing & dedup
ntfy Homelab VM 8081 https://ntfy.vish.gg Push notifications
Uptime Kuma RPi 5 3001 http://192.168.0.66:3001 or https://kuma.vish.gg Uptime monitoring (97 monitors)
DIUN Atlantis ntfy topic diun Docker image update detection
Scrutiny Multiple 8090 http://192.168.0.210:8090 SMART disk health

Prometheus Targets (14 active)

Job Target Type Status
atlantis-node atlantis node_exporter Up
atlantis-snmp atlantis SNMP exporter Up
calypso-node calypso node_exporter Up
calypso-snmp calypso SNMP exporter Up
concord-nuc-node concord-nuc node_exporter Up
homelab-node homelab-vm node_exporter Up
node_exporter homelab-vm node_exporter (self) Up
prometheus localhost:9090 self-scrape Up
proxmox-node proxmox node_exporter Up
raspberry-pis pi-5 node_exporter Up
seattle-node seattle node_exporter Up
setillo-node setillo node_exporter Up
setillo-snmp setillo SNMP exporter Up
truenas-node guava node_exporter Up

📈 Key Metrics Monitored

System Resources

  • CPU utilization percentage
  • Memory usage and availability
  • Disk space and I/O operations
  • Network traffic and latency

Service Availability

  • HTTP response times (Uptime Kuma)
  • Container restart counts
  • Database connection status
  • Backup success rates

Network Health

  • Tailscale connectivity status
  • External service reachability
  • DNS resolution times
  • Cloudflare metrics

⚠️ Alerting Strategy

Alert Levels

  1. Critical (Immediate Action)

    • Service downtime (>5 min)
    • System resource exhaustion (<10% free)
    • Backup failures
  2. Warning (Review Required)

    • High resource usage (>80%)
    • Container restarts
    • Slow response times
  3. Info (Monitoring Only)

    • New service deployments
    • Configuration changes
    • Routine maintenance

Alert Channels

  • ntfy notifications for critical issues
  • Email alerts to administrators
  • Slack integration for team communication
  • Uptime Kuma dashboard for service status

📋 Maintenance Procedures

Regular Tasks

  1. Daily

    • Review Uptime Kuma service status
    • Check Prometheus metrics for anomalies
    • Verify Grafana dashboards display correctly
  2. Weekly

    • Update dashboard panels if needed
    • Review and update alert thresholds
    • Validate alert routes are working properly
  3. Monthly

    • Audit alert configurations
    • Test alert delivery mechanisms
    • Review Prometheus storage usage

Last updated: 2026