Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 2ea7d71f94

Documentation / Build Docusaurus (push) Failing after 5m3s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-04-05 10:53:12 UTC

2026-04-05 10:53:12 +00:00

4.0 KiB

Raw Blame History

Homelab VM Runbook

Proxmox VM - Monitoring & DevOps

Endpoint ID: 443399
Status: 🟢 Online
Hardware: 4 vCPU, 28GB RAM
Access: 192.168.0.210

Overview

Homelab VM runs monitoring, alerting, and development services on Proxmox.

Hardware Specs

Component	Specification
Platform	Proxmox VE
vCPU	4 cores
RAM	28GB
Storage	100GB SSD
Network	1x 1GbE

Services

Monitoring Stack

Service	Port	Purpose
Prometheus	9090	Metrics collection
Grafana	3000	Dashboards
Alertmanager	9093	Alert routing
Node Exporter	9100	System metrics
cAdvisor	8080	Container metrics
Uptime Kuma	3001	Uptime monitoring

Development

Service	Port	Purpose
Gitea	3000	Git hosting
Gitea Runner	3008	CI/CD runner
OpenHands	8000	AI developer

Database

Service	Port	Purpose
PostgreSQL	5432	Database
Redis	6379	Caching

Daily Operations

Check Monitoring

# Prometheus targets
curl http://192.168.0.210:9090/api/v1/targets | jq

# Grafana dashboards
open http://192.168.0.210:3000

Alert Status

# Alertmanager
open http://192.168.0.210:9093

# Check ntfy for alerts
curl -s ntfy.vish.local/homelab-alerts | head -20

Prometheus Configuration

Scraping Targets

Node exporters (all hosts)
cAdvisor (all hosts)
Prometheus self-monitoring
Application-specific metrics

Retention

Time: 30 days
Storage: 20GB

Maintenance

# Check TSDB size
du -sh /var/lib/prometheus/

# Manual compaction
docker exec prometheus promtool tsdb compact /prometheus

Grafana Dashboards

Key Dashboards

Infrastructure Overview
Container Health
Network Traffic
Service-specific metrics

Alert Rules

CPU > 80% for 5 minutes
Memory > 90% for 5 minutes
Disk > 85%
Service down > 2 minutes

Common Issues

Prometheus Not Scraping

Check targets: Prometheus UI → Status → Targets
Verify network connectivity
Check firewall rules
Review scrape errors in logs

Grafana Dashboards Slow

Check Prometheus query performance
Reduce time range
Optimize queries
Check resource usage

Alerts Not Firing

Verify Alertmanager config
Check ntfy integration
Review alert rules syntax
Test with artificial alert

Maintenance

Weekly

Review alert history
Check disk space
Verify backups

Monthly

Clean old metrics
Update dashboards
Review alert thresholds

Quarterly

Test alert notifications
Review retention policy
Optimize queries

Backup Procedures

Configuration

# Grafana dashboards
cp -r /opt/grafana/dashboards /backup/

# Prometheus rules
cp -r /opt/prometheus/rules /backup/

Ansible

ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm

Emergency Procedures

Prometheus Full

Check storage: docker system df
Reduce retention in prometheus.yml
Delete old data: docker exec prometheus rm -rf /prometheus/wal/*
Restart container

VM Down

Check Proxmox: qm list
Start VM: qm start <vmid>
Check console: qm terminal <vmid>
Review logs in Proxmox UI

Useful Commands

# SSH access
ssh homelab@192.168.0.210

# Restart monitoring
cd /opt/docker/prometheus && docker-compose restart
cd /opt/docker/grafana && docker-compose restart

# Check targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'

# View logs
docker logs prometheus
docker logs grafana
docker logs alertmanager

4.0 KiB Raw Blame History

Homelab VM Runbook

Overview

Hardware Specs

Services

Monitoring Stack

Development

Database

Daily Operations

Check Monitoring

Alert Status

Prometheus Configuration

Scraping Targets

Retention

Maintenance

Grafana Dashboards

Key Dashboards

Alert Rules

Common Issues

Prometheus Not Scraping

Grafana Dashboards Slow

Alerts Not Firing

Maintenance

Weekly

Monthly

Quarterly

Backup Procedures

Configuration

Ansible

Emergency Procedures

Prometheus Full

VM Down

Useful Commands

Links

4.0 KiB

Raw Blame History