Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 8664c8417c

Documentation / Build Docusaurus (push) Failing after 9m20s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-03-30 00:10:29 UTC

2026-03-30 00:10:29 +00:00

5.0 KiB

Raw Blame History

📊 Monitoring and Alerting Setup

This document details the monitoring and alerting infrastructure for the homelab environment, providing configuration guidance and operational procedures.

🧰 Monitoring Stack Overview

Services Deployed

Grafana (v12.4.0): Visualization and dashboarding
Prometheus: Metrics collection and storage
Node Exporter: Host-level metrics
SNMP Exporter: Synology NAS metrics collection

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Services  │───▶│   Prometheus  │───▶│   Grafana   │
│   (containers) │    │   (scraping)  │    │   (visual)  │
└─────────────┘    └─────────────┘    └─────────────┘
     │                   │                  │
     ▼                   ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Hosts     │    │   Exporters │    │   Dashboards│
│(node_exporter)│    │(snmp_exporter)│    │(Grafana UI) │
└─────────────┘    └─────────────┘    └─────────────┘

🔧 Current Configuration

Active Monitoring Services

Service	Host	Port	URL	Purpose
Grafana	Homelab VM	3300	`https://gf.vish.gg`	Dashboards & visualization
Prometheus	Homelab VM	9090	`http://192.168.0.210:9090`	Metrics collection & storage
Alertmanager	Homelab VM	9093	`http://192.168.0.210:9093`	Alert routing & dedup
ntfy	Homelab VM	8081	`https://ntfy.vish.gg`	Push notifications
Uptime Kuma	RPi 5	3001	`http://192.168.0.66:3001` or `https://kuma.vish.gg`	Uptime monitoring (97 monitors)
DIUN	Atlantis	—	ntfy topic `diun`	Docker image update detection
Scrutiny	Multiple	8090	`http://192.168.0.210:8090`	SMART disk health

Prometheus Targets (14 active)

Job	Target	Type	Status
atlantis-node	atlantis	node_exporter	Up
atlantis-snmp	atlantis	SNMP exporter	Up
calypso-node	calypso	node_exporter	Up
calypso-snmp	calypso	SNMP exporter	Up
concord-nuc-node	concord-nuc	node_exporter	Up
homelab-node	homelab-vm	node_exporter	Up
node_exporter	homelab-vm	node_exporter (self)	Up
prometheus	localhost:9090	self-scrape	Up
proxmox-node	proxmox	node_exporter	Up
raspberry-pis	pi-5	node_exporter	Up
seattle-node	seattle	node_exporter	Up
setillo-node	setillo	node_exporter	Up
setillo-snmp	setillo	SNMP exporter	Up
truenas-node	guava	node_exporter	Up

📈 Key Metrics Monitored

System Resources

CPU utilization percentage
Memory usage and availability
Disk space and I/O operations
Network traffic and latency

Service Availability

HTTP response times (Uptime Kuma)
Container restart counts
Database connection status
Backup success rates

Network Health

Tailscale connectivity status
External service reachability
DNS resolution times
Cloudflare metrics

⚠️ Alerting Strategy

Alert Levels

Critical (Immediate Action)
- Service downtime (>5 min)
- System resource exhaustion (<10% free)
- Backup failures
Warning (Review Required)
- High resource usage (>80%)
- Container restarts
- Slow response times
Info (Monitoring Only)
- New service deployments
- Configuration changes
- Routine maintenance

Alert Channels

ntfy notifications for critical issues
Email alerts to administrators
Slack integration for team communication
Uptime Kuma dashboard for service status

📋 Maintenance Procedures

Regular Tasks

Daily
- Review Uptime Kuma service status
- Check Prometheus metrics for anomalies
- Verify Grafana dashboards display correctly
Weekly
- Update dashboard panels if needed
- Review and update alert thresholds
- Validate alert routes are working properly
Monthly
- Audit alert configurations
- Test alert delivery mechanisms
- Review Prometheus storage usage

Image Update Guide — Renovate, DIUN, Watchtower
Ansible Playbook Guide — health_check.yml, service_status.yml
Backup Strategy — backup monitoring
Offline & Remote Access — accessing monitoring when internet is down
Disaster Recovery Procedures
Security Hardening

Last updated: 2026

5.0 KiB Raw Blame History