Files
homelab-optimized/docs/infrastructure/hosts/homelab-vm-runbook.md
Gitea Mirror Bot 2ea7d71f94
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-05 10:53:12 UTC
2026-04-05 10:53:12 +00:00

4.0 KiB

Homelab VM Runbook

Proxmox VM - Monitoring & DevOps

Endpoint ID: 443399
Status: 🟢 Online
Hardware: 4 vCPU, 28GB RAM
Access: 192.168.0.210


Overview

Homelab VM runs monitoring, alerting, and development services on Proxmox.

Hardware Specs

Component Specification
Platform Proxmox VE
vCPU 4 cores
RAM 28GB
Storage 100GB SSD
Network 1x 1GbE

Services

Monitoring Stack

Service Port Purpose
Prometheus 9090 Metrics collection
Grafana 3000 Dashboards
Alertmanager 9093 Alert routing
Node Exporter 9100 System metrics
cAdvisor 8080 Container metrics
Uptime Kuma 3001 Uptime monitoring

Development

Service Port Purpose
Gitea 3000 Git hosting
Gitea Runner 3008 CI/CD runner
OpenHands 8000 AI developer

Database

Service Port Purpose
PostgreSQL 5432 Database
Redis 6379 Caching

Daily Operations

Check Monitoring

# Prometheus targets
curl http://192.168.0.210:9090/api/v1/targets | jq

# Grafana dashboards
open http://192.168.0.210:3000

Alert Status

# Alertmanager
open http://192.168.0.210:9093

# Check ntfy for alerts
curl -s ntfy.vish.local/homelab-alerts | head -20

Prometheus Configuration

Scraping Targets

  • Node exporters (all hosts)
  • cAdvisor (all hosts)
  • Prometheus self-monitoring
  • Application-specific metrics

Retention

  • Time: 30 days
  • Storage: 20GB

Maintenance

# Check TSDB size
du -sh /var/lib/prometheus/

# Manual compaction
docker exec prometheus promtool tsdb compact /prometheus

Grafana Dashboards

Key Dashboards

  • Infrastructure Overview
  • Container Health
  • Network Traffic
  • Service-specific metrics

Alert Rules

  • CPU > 80% for 5 minutes
  • Memory > 90% for 5 minutes
  • Disk > 85%
  • Service down > 2 minutes

Common Issues

Prometheus Not Scraping

  1. Check targets: Prometheus UI → Status → Targets
  2. Verify network connectivity
  3. Check firewall rules
  4. Review scrape errors in logs

Grafana Dashboards Slow

  1. Check Prometheus query performance
  2. Reduce time range
  3. Optimize queries
  4. Check resource usage

Alerts Not Firing

  1. Verify Alertmanager config
  2. Check ntfy integration
  3. Review alert rules syntax
  4. Test with artificial alert

Maintenance

Weekly

  • Review alert history
  • Check disk space
  • Verify backups

Monthly

  • Clean old metrics
  • Update dashboards
  • Review alert thresholds

Quarterly

  • Test alert notifications
  • Review retention policy
  • Optimize queries

Backup Procedures

Configuration

# Grafana dashboards
cp -r /opt/grafana/dashboards /backup/

# Prometheus rules
cp -r /opt/prometheus/rules /backup/

Ansible

ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm

Emergency Procedures

Prometheus Full

  1. Check storage: docker system df
  2. Reduce retention in prometheus.yml
  3. Delete old data: docker exec prometheus rm -rf /prometheus/wal/*
  4. Restart container

VM Down

  1. Check Proxmox: qm list
  2. Start VM: qm start <vmid>
  3. Check console: qm terminal <vmid>
  4. Review logs in Proxmox UI

Useful Commands

# SSH access
ssh homelab@192.168.0.210

# Restart monitoring
cd /opt/docker/prometheus && docker-compose restart
cd /opt/docker/grafana && docker-compose restart

# Check targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'

# View logs
docker logs prometheus
docker logs grafana
docker logs alertmanager