Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC
This commit is contained in:
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Homelab VM Runbook
|
||||
|
||||
*Proxmox VM - Monitoring & DevOps*
|
||||
|
||||
**Endpoint ID:** 443399
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** 4 vCPU, 28GB RAM
|
||||
**Access:** `192.168.0.210`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Homelab VM runs monitoring, alerting, and development services on Proxmox.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Platform | Proxmox VE |
|
||||
| vCPU | 4 cores |
|
||||
| RAM | 28GB |
|
||||
| Storage | 100GB SSD |
|
||||
| Network | 1x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Monitoring Stack
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Prometheus** | 9090 | Metrics collection |
|
||||
| **Grafana** | 3000 | Dashboards |
|
||||
| **Alertmanager** | 9093 | Alert routing |
|
||||
| **Node Exporter** | 9100 | System metrics |
|
||||
| **cAdvisor** | 8080 | Container metrics |
|
||||
| **Uptime Kuma** | 3001 | Uptime monitoring |
|
||||
|
||||
### Development
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Gitea | 3000 | Git hosting |
|
||||
| Gitea Runner | 3008 | CI/CD runner |
|
||||
| OpenHands | 8000 | AI developer |
|
||||
|
||||
### Database
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| PostgreSQL | 5432 | Database |
|
||||
| Redis | 6379 | Caching |
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Monitoring
|
||||
```bash
|
||||
# Prometheus targets
|
||||
curl http://192.168.0.210:9090/api/v1/targets | jq
|
||||
|
||||
# Grafana dashboards
|
||||
open http://192.168.0.210:3000
|
||||
```
|
||||
|
||||
### Alert Status
|
||||
```bash
|
||||
# Alertmanager
|
||||
open http://192.168.0.210:9093
|
||||
|
||||
# Check ntfy for alerts
|
||||
curl -s ntfy.vish.local/homelab-alerts | head -20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Scraping Targets
|
||||
- Node exporters (all hosts)
|
||||
- cAdvisor (all hosts)
|
||||
- Prometheus self-monitoring
|
||||
- Application-specific metrics
|
||||
|
||||
### Retention
|
||||
- Time: 30 days
|
||||
- Storage: 20GB
|
||||
|
||||
### Maintenance
|
||||
```bash
|
||||
# Check TSDB size
|
||||
du -sh /var/lib/prometheus/
|
||||
|
||||
# Manual compaction
|
||||
docker exec prometheus promtool tsdb compact /prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Key Dashboards
|
||||
- Infrastructure Overview
|
||||
- Container Health
|
||||
- Network Traffic
|
||||
- Service-specific metrics
|
||||
|
||||
### Alert Rules
|
||||
- CPU > 80% for 5 minutes
|
||||
- Memory > 90% for 5 minutes
|
||||
- Disk > 85%
|
||||
- Service down > 2 minutes
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Prometheus Not Scraping
|
||||
1. Check targets: Prometheus UI → Status → Targets
|
||||
2. Verify network connectivity
|
||||
3. Check firewall rules
|
||||
4. Review scrape errors in logs
|
||||
|
||||
### Grafana Dashboards Slow
|
||||
1. Check Prometheus query performance
|
||||
2. Reduce time range
|
||||
3. Optimize queries
|
||||
4. Check resource usage
|
||||
|
||||
### Alerts Not Firing
|
||||
1. Verify Alertmanager config
|
||||
2. Check ntfy integration
|
||||
3. Review alert rules syntax
|
||||
4. Test with artificial alert
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Review alert history
|
||||
- [ ] Check disk space
|
||||
- [ ] Verify backups
|
||||
|
||||
### Monthly
|
||||
- [ ] Clean old metrics
|
||||
- [ ] Update dashboards
|
||||
- [ ] Review alert thresholds
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test alert notifications
|
||||
- [ ] Review retention policy
|
||||
- [ ] Optimize queries
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
# Grafana dashboards
|
||||
cp -r /opt/grafana/dashboards /backup/
|
||||
|
||||
# Prometheus rules
|
||||
cp -r /opt/prometheus/rules /backup/
|
||||
```
|
||||
|
||||
### Ansible
|
||||
```bash
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Prometheus Full
|
||||
1. Check storage: `docker system df`
|
||||
2. Reduce retention in prometheus.yml
|
||||
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
|
||||
4. Restart container
|
||||
|
||||
### VM Down
|
||||
1. Check Proxmox: `qm list`
|
||||
2. Start VM: `qm start <vmid>`
|
||||
3. Check console: `qm terminal <vmid>`
|
||||
4. Review logs in Proxmox UI
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh homelab@192.168.0.210
|
||||
|
||||
# Restart monitoring
|
||||
cd /opt/docker/prometheus && docker-compose restart
|
||||
cd /opt/docker/grafana && docker-compose restart
|
||||
|
||||
# Check targets
|
||||
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
|
||||
|
||||
# View logs
|
||||
docker logs prometheus
|
||||
docker logs grafana
|
||||
docker logs alertmanager
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Prometheus](http://192.168.0.210:9090)
|
||||
- [Grafana](http://192.168.0.210:3000)
|
||||
- [Alertmanager](http://192.168.0.210:9093)
|
||||
- [Uptime Kuma](http://192.168.0.210:3001)
|
||||
Reference in New Issue
Block a user