homelab-optimized/docs/infrastructure/hosts/homelab-vm-runbook.md

# Homelab VM Runbook

*Proxmox VM - Monitoring & DevOps*

**Endpoint ID:** 443399
**Status:** 🟢 Online
**Hardware:** 4 vCPU, 28GB RAM
**Access:** `192.168.0.210`

---

## Overview

Homelab VM runs monitoring, alerting, and development services on Proxmox.

## Hardware Specs

| Component | Specification |
|----------|---------------|
| Platform | Proxmox VE |
| vCPU | 4 cores |
| RAM | 28GB |
| Storage | 100GB SSD |
| Network | 1x 1GbE |

## Services

### Monitoring Stack

| Service | Port | Purpose |
|---------|------|---------|
| **Prometheus** | 9090 | Metrics collection |
| **Grafana** | 3000 | Dashboards |
| **Alertmanager** | 9093 | Alert routing |
| **Node Exporter** | 9100 | System metrics |
| **cAdvisor** | 8080 | Container metrics |
| **Uptime Kuma** | 3001 | Uptime monitoring |

### Development

| Service | Port | Purpose |
|---------|------|---------|
| Gitea | 3000 | Git hosting |
| Gitea Runner | 3008 | CI/CD runner |
| OpenHands | 8000 | AI developer |

### Database

| Service | Port | Purpose |
|---------|------|---------|
| PostgreSQL | 5432 | Database |
| Redis | 6379 | Caching |

---

## Daily Operations

### Check Monitoring
```bash
# Prometheus targets
curl http://192.168.0.210:9090/api/v1/targets | jq

# Grafana dashboards
open http://192.168.0.210:3000
```

### Alert Status
```bash
# Alertmanager
open http://192.168.0.210:9093

# Check ntfy for alerts
curl -s ntfy.vish.local/homelab-alerts | head -20
```

---

## Prometheus Configuration

### Scraping Targets
- Node exporters (all hosts)
- cAdvisor (all hosts)
- Prometheus self-monitoring
- Application-specific metrics

### Retention
- Time: 30 days
- Storage: 20GB

### Maintenance
```bash
# Check TSDB size
du -sh /var/lib/prometheus/

# Manual compaction
docker exec prometheus promtool tsdb compact /prometheus
```

---

## Grafana Dashboards

### Key Dashboards
- Infrastructure Overview
- Container Health
- Network Traffic
- Service-specific metrics

### Alert Rules
- CPU > 80% for 5 minutes
- Memory > 90% for 5 minutes
- Disk > 85%
- Service down > 2 minutes

---

## Common Issues

### Prometheus Not Scraping
1. Check targets: Prometheus UI → Status → Targets
2. Verify network connectivity
3. Check firewall rules
4. Review scrape errors in logs

### Grafana Dashboards Slow
1. Check Prometheus query performance
2. Reduce time range
3. Optimize queries
4. Check resource usage

### Alerts Not Firing
1. Verify Alertmanager config
2. Check ntfy integration
3. Review alert rules syntax
4. Test with artificial alert

---

## Maintenance

### Weekly
- [ ] Review alert history
- [ ] Check disk space
- [ ] Verify backups

### Monthly
- [ ] Clean old metrics
- [ ] Update dashboards
- [ ] Review alert thresholds

### Quarterly
- [ ] Test alert notifications
- [ ] Review retention policy
- [ ] Optimize queries

---

## Backup Procedures

### Configuration
```bash
# Grafana dashboards
cp -r /opt/grafana/dashboards /backup/

# Prometheus rules
cp -r /opt/prometheus/rules /backup/
```

### Ansible
```bash
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
```

---

## Emergency Procedures

### Prometheus Full
1. Check storage: `docker system df`
2. Reduce retention in prometheus.yml
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
4. Restart container

### VM Down
1. Check Proxmox: `qm list`
2. Start VM: `qm start <vmid>`
3. Check console: `qm terminal <vmid>`
4. Review logs in Proxmox UI

---

## Useful Commands

```bash
# SSH access
ssh homelab@192.168.0.210

# Restart monitoring
cd /opt/docker/prometheus && docker-compose restart
cd /opt/docker/grafana && docker-compose restart

# Check targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'

# View logs
docker logs prometheus
docker logs grafana
docker logs alertmanager
```

---

## Links

- [Prometheus](http://192.168.0.210:9090)
- [Grafana](http://192.168.0.210:3000)
- [Alertmanager](http://192.168.0.210:9093)
- [Uptime Kuma](http://192.168.0.210:3001)