4.0 KiB
4.0 KiB
Homelab VM Runbook
Proxmox VM - Monitoring & DevOps
Endpoint ID: 443399
Status: 🟢 Online
Hardware: 4 vCPU, 28GB RAM
Access: 192.168.0.210
Overview
Homelab VM runs monitoring, alerting, and development services on Proxmox.
Hardware Specs
| Component | Specification |
|---|---|
| Platform | Proxmox VE |
| vCPU | 4 cores |
| RAM | 28GB |
| Storage | 100GB SSD |
| Network | 1x 1GbE |
Services
Monitoring Stack
| Service | Port | Purpose |
|---|---|---|
| Prometheus | 9090 | Metrics collection |
| Grafana | 3000 | Dashboards |
| Alertmanager | 9093 | Alert routing |
| Node Exporter | 9100 | System metrics |
| cAdvisor | 8080 | Container metrics |
| Uptime Kuma | 3001 | Uptime monitoring |
Development
| Service | Port | Purpose |
|---|---|---|
| Gitea | 3000 | Git hosting |
| Gitea Runner | 3008 | CI/CD runner |
| OpenHands | 8000 | AI developer |
Database
| Service | Port | Purpose |
|---|---|---|
| PostgreSQL | 5432 | Database |
| Redis | 6379 | Caching |
Daily Operations
Check Monitoring
# Prometheus targets
curl http://192.168.0.210:9090/api/v1/targets | jq
# Grafana dashboards
open http://192.168.0.210:3000
Alert Status
# Alertmanager
open http://192.168.0.210:9093
# Check ntfy for alerts
curl -s ntfy.vish.local/homelab-alerts | head -20
Prometheus Configuration
Scraping Targets
- Node exporters (all hosts)
- cAdvisor (all hosts)
- Prometheus self-monitoring
- Application-specific metrics
Retention
- Time: 30 days
- Storage: 20GB
Maintenance
# Check TSDB size
du -sh /var/lib/prometheus/
# Manual compaction
docker exec prometheus promtool tsdb compact /prometheus
Grafana Dashboards
Key Dashboards
- Infrastructure Overview
- Container Health
- Network Traffic
- Service-specific metrics
Alert Rules
- CPU > 80% for 5 minutes
- Memory > 90% for 5 minutes
- Disk > 85%
- Service down > 2 minutes
Common Issues
Prometheus Not Scraping
- Check targets: Prometheus UI → Status → Targets
- Verify network connectivity
- Check firewall rules
- Review scrape errors in logs
Grafana Dashboards Slow
- Check Prometheus query performance
- Reduce time range
- Optimize queries
- Check resource usage
Alerts Not Firing
- Verify Alertmanager config
- Check ntfy integration
- Review alert rules syntax
- Test with artificial alert
Maintenance
Weekly
- Review alert history
- Check disk space
- Verify backups
Monthly
- Clean old metrics
- Update dashboards
- Review alert thresholds
Quarterly
- Test alert notifications
- Review retention policy
- Optimize queries
Backup Procedures
Configuration
# Grafana dashboards
cp -r /opt/grafana/dashboards /backup/
# Prometheus rules
cp -r /opt/prometheus/rules /backup/
Ansible
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
Emergency Procedures
Prometheus Full
- Check storage:
docker system df - Reduce retention in prometheus.yml
- Delete old data:
docker exec prometheus rm -rf /prometheus/wal/* - Restart container
VM Down
- Check Proxmox:
qm list - Start VM:
qm start <vmid> - Check console:
qm terminal <vmid> - Review logs in Proxmox UI
Useful Commands
# SSH access
ssh homelab@192.168.0.210
# Restart monitoring
cd /opt/docker/prometheus && docker-compose restart
cd /opt/docker/grafana && docker-compose restart
# Check targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
# View logs
docker logs prometheus
docker logs grafana
docker logs alertmanager