219 lines
4.0 KiB
Markdown
219 lines
4.0 KiB
Markdown
# Homelab VM Runbook
|
|
|
|
*Proxmox VM - Monitoring & DevOps*
|
|
|
|
**Endpoint ID:** 443399
|
|
**Status:** 🟢 Online
|
|
**Hardware:** 4 vCPU, 28GB RAM
|
|
**Access:** `192.168.0.210`
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Homelab VM runs monitoring, alerting, and development services on Proxmox.
|
|
|
|
## Hardware Specs
|
|
|
|
| Component | Specification |
|
|
|----------|---------------|
|
|
| Platform | Proxmox VE |
|
|
| vCPU | 4 cores |
|
|
| RAM | 28GB |
|
|
| Storage | 100GB SSD |
|
|
| Network | 1x 1GbE |
|
|
|
|
## Services
|
|
|
|
### Monitoring Stack
|
|
|
|
| Service | Port | Purpose |
|
|
|---------|------|---------|
|
|
| **Prometheus** | 9090 | Metrics collection |
|
|
| **Grafana** | 3000 | Dashboards |
|
|
| **Alertmanager** | 9093 | Alert routing |
|
|
| **Node Exporter** | 9100 | System metrics |
|
|
| **cAdvisor** | 8080 | Container metrics |
|
|
| **Uptime Kuma** | 3001 | Uptime monitoring |
|
|
|
|
### Development
|
|
|
|
| Service | Port | Purpose |
|
|
|---------|------|---------|
|
|
| Gitea | 3000 | Git hosting |
|
|
| Gitea Runner | 3008 | CI/CD runner |
|
|
| OpenHands | 8000 | AI developer |
|
|
|
|
### Database
|
|
|
|
| Service | Port | Purpose |
|
|
|---------|------|---------|
|
|
| PostgreSQL | 5432 | Database |
|
|
| Redis | 6379 | Caching |
|
|
|
|
---
|
|
|
|
## Daily Operations
|
|
|
|
### Check Monitoring
|
|
```bash
|
|
# Prometheus targets
|
|
curl http://192.168.0.210:9090/api/v1/targets | jq
|
|
|
|
# Grafana dashboards
|
|
open http://192.168.0.210:3000
|
|
```
|
|
|
|
### Alert Status
|
|
```bash
|
|
# Alertmanager
|
|
open http://192.168.0.210:9093
|
|
|
|
# Check ntfy for alerts
|
|
curl -s ntfy.vish.local/homelab-alerts | head -20
|
|
```
|
|
|
|
---
|
|
|
|
## Prometheus Configuration
|
|
|
|
### Scraping Targets
|
|
- Node exporters (all hosts)
|
|
- cAdvisor (all hosts)
|
|
- Prometheus self-monitoring
|
|
- Application-specific metrics
|
|
|
|
### Retention
|
|
- Time: 30 days
|
|
- Storage: 20GB
|
|
|
|
### Maintenance
|
|
```bash
|
|
# Check TSDB size
|
|
du -sh /var/lib/prometheus/
|
|
|
|
# Manual compaction
|
|
docker exec prometheus promtool tsdb compact /prometheus
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Dashboards
|
|
|
|
### Key Dashboards
|
|
- Infrastructure Overview
|
|
- Container Health
|
|
- Network Traffic
|
|
- Service-specific metrics
|
|
|
|
### Alert Rules
|
|
- CPU > 80% for 5 minutes
|
|
- Memory > 90% for 5 minutes
|
|
- Disk > 85%
|
|
- Service down > 2 minutes
|
|
|
|
---
|
|
|
|
## Common Issues
|
|
|
|
### Prometheus Not Scraping
|
|
1. Check targets: Prometheus UI → Status → Targets
|
|
2. Verify network connectivity
|
|
3. Check firewall rules
|
|
4. Review scrape errors in logs
|
|
|
|
### Grafana Dashboards Slow
|
|
1. Check Prometheus query performance
|
|
2. Reduce time range
|
|
3. Optimize queries
|
|
4. Check resource usage
|
|
|
|
### Alerts Not Firing
|
|
1. Verify Alertmanager config
|
|
2. Check ntfy integration
|
|
3. Review alert rules syntax
|
|
4. Test with artificial alert
|
|
|
|
---
|
|
|
|
## Maintenance
|
|
|
|
### Weekly
|
|
- [ ] Review alert history
|
|
- [ ] Check disk space
|
|
- [ ] Verify backups
|
|
|
|
### Monthly
|
|
- [ ] Clean old metrics
|
|
- [ ] Update dashboards
|
|
- [ ] Review alert thresholds
|
|
|
|
### Quarterly
|
|
- [ ] Test alert notifications
|
|
- [ ] Review retention policy
|
|
- [ ] Optimize queries
|
|
|
|
---
|
|
|
|
## Backup Procedures
|
|
|
|
### Configuration
|
|
```bash
|
|
# Grafana dashboards
|
|
cp -r /opt/grafana/dashboards /backup/
|
|
|
|
# Prometheus rules
|
|
cp -r /opt/prometheus/rules /backup/
|
|
```
|
|
|
|
### Ansible
|
|
```bash
|
|
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
|
|
```
|
|
|
|
---
|
|
|
|
## Emergency Procedures
|
|
|
|
### Prometheus Full
|
|
1. Check storage: `docker system df`
|
|
2. Reduce retention in prometheus.yml
|
|
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
|
|
4. Restart container
|
|
|
|
### VM Down
|
|
1. Check Proxmox: `qm list`
|
|
2. Start VM: `qm start <vmid>`
|
|
3. Check console: `qm terminal <vmid>`
|
|
4. Review logs in Proxmox UI
|
|
|
|
---
|
|
|
|
## Useful Commands
|
|
|
|
```bash
|
|
# SSH access
|
|
ssh homelab@192.168.0.210
|
|
|
|
# Restart monitoring
|
|
cd /opt/docker/prometheus && docker-compose restart
|
|
cd /opt/docker/grafana && docker-compose restart
|
|
|
|
# Check targets
|
|
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
|
|
|
|
# View logs
|
|
docker logs prometheus
|
|
docker logs grafana
|
|
docker logs alertmanager
|
|
```
|
|
|
|
---
|
|
|
|
## Links
|
|
|
|
- [Prometheus](http://192.168.0.210:9090)
|
|
- [Grafana](http://192.168.0.210:3000)
|
|
- [Alertmanager](http://192.168.0.210:9093)
|
|
- [Uptime Kuma](http://192.168.0.210:3001)
|