Files
homelab-optimized/docs/infrastructure/hosts/homelab-vm-runbook.md
Gitea Mirror Bot 89aad4f882
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m2s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-18 11:13:17 UTC
2026-04-18 11:13:18 +00:00

219 lines
4.0 KiB
Markdown

# Homelab VM Runbook
*Proxmox VM - Monitoring & DevOps*
**Endpoint ID:** 443399
**Status:** 🟢 Online
**Hardware:** 4 vCPU, 28GB RAM
**Access:** `192.168.0.210`
---
## Overview
Homelab VM runs monitoring, alerting, and development services on Proxmox.
## Hardware Specs
| Component | Specification |
|----------|---------------|
| Platform | Proxmox VE |
| vCPU | 4 cores |
| RAM | 28GB |
| Storage | 100GB SSD |
| Network | 1x 1GbE |
## Services
### Monitoring Stack
| Service | Port | Purpose |
|---------|------|---------|
| **Prometheus** | 9090 | Metrics collection |
| **Grafana** | 3000 | Dashboards |
| **Alertmanager** | 9093 | Alert routing |
| **Node Exporter** | 9100 | System metrics |
| **cAdvisor** | 8080 | Container metrics |
| **Uptime Kuma** | 3001 | Uptime monitoring |
### Development
| Service | Port | Purpose |
|---------|------|---------|
| Gitea | 3000 | Git hosting |
| Gitea Runner | 3008 | CI/CD runner |
| OpenHands | 8000 | AI developer |
### Database
| Service | Port | Purpose |
|---------|------|---------|
| PostgreSQL | 5432 | Database |
| Redis | 6379 | Caching |
---
## Daily Operations
### Check Monitoring
```bash
# Prometheus targets
curl http://192.168.0.210:9090/api/v1/targets | jq
# Grafana dashboards
open http://192.168.0.210:3000
```
### Alert Status
```bash
# Alertmanager
open http://192.168.0.210:9093
# Check ntfy for alerts
curl -s ntfy.vish.local/homelab-alerts | head -20
```
---
## Prometheus Configuration
### Scraping Targets
- Node exporters (all hosts)
- cAdvisor (all hosts)
- Prometheus self-monitoring
- Application-specific metrics
### Retention
- Time: 30 days
- Storage: 20GB
### Maintenance
```bash
# Check TSDB size
du -sh /var/lib/prometheus/
# Manual compaction
docker exec prometheus promtool tsdb compact /prometheus
```
---
## Grafana Dashboards
### Key Dashboards
- Infrastructure Overview
- Container Health
- Network Traffic
- Service-specific metrics
### Alert Rules
- CPU > 80% for 5 minutes
- Memory > 90% for 5 minutes
- Disk > 85%
- Service down > 2 minutes
---
## Common Issues
### Prometheus Not Scraping
1. Check targets: Prometheus UI → Status → Targets
2. Verify network connectivity
3. Check firewall rules
4. Review scrape errors in logs
### Grafana Dashboards Slow
1. Check Prometheus query performance
2. Reduce time range
3. Optimize queries
4. Check resource usage
### Alerts Not Firing
1. Verify Alertmanager config
2. Check ntfy integration
3. Review alert rules syntax
4. Test with artificial alert
---
## Maintenance
### Weekly
- [ ] Review alert history
- [ ] Check disk space
- [ ] Verify backups
### Monthly
- [ ] Clean old metrics
- [ ] Update dashboards
- [ ] Review alert thresholds
### Quarterly
- [ ] Test alert notifications
- [ ] Review retention policy
- [ ] Optimize queries
---
## Backup Procedures
### Configuration
```bash
# Grafana dashboards
cp -r /opt/grafana/dashboards /backup/
# Prometheus rules
cp -r /opt/prometheus/rules /backup/
```
### Ansible
```bash
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
```
---
## Emergency Procedures
### Prometheus Full
1. Check storage: `docker system df`
2. Reduce retention in prometheus.yml
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
4. Restart container
### VM Down
1. Check Proxmox: `qm list`
2. Start VM: `qm start <vmid>`
3. Check console: `qm terminal <vmid>`
4. Review logs in Proxmox UI
---
## Useful Commands
```bash
# SSH access
ssh homelab@192.168.0.210
# Restart monitoring
cd /opt/docker/prometheus && docker-compose restart
cd /opt/docker/grafana && docker-compose restart
# Check targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
# View logs
docker logs prometheus
docker logs grafana
docker logs alertmanager
```
---
## Links
- [Prometheus](http://192.168.0.210:9090)
- [Grafana](http://192.168.0.210:3000)
- [Alertmanager](http://192.168.0.210:9093)
- [Uptime Kuma](http://192.168.0.210:3001)