Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC

2026-04-20 01:24:42 +00:00
commit e71c8ddb4b
1441 changed files with 363888 additions and 0 deletions
--- a/docs/infrastructure/hosts/homelab-vm-runbook.md
+++ b/docs/infrastructure/hosts/homelab-vm-runbook.md
@@ -0,0 +1,218 @@
+# Homelab VM Runbook
+
+*Proxmox VM - Monitoring & DevOps*
+
+**Endpoint ID:** 443399  
+**Status:** 🟢 Online  
+**Hardware:** 4 vCPU, 28GB RAM  
+**Access:** `192.168.0.210`
+
+---
+
+## Overview
+
+Homelab VM runs monitoring, alerting, and development services on Proxmox.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Platform | Proxmox VE |
+| vCPU | 4 cores |
+| RAM | 28GB |
+| Storage | 100GB SSD |
+| Network | 1x 1GbE |
+
+## Services
+
+### Monitoring Stack
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| **Prometheus** | 9090 | Metrics collection |
+| **Grafana** | 3000 | Dashboards |
+| **Alertmanager** | 9093 | Alert routing |
+| **Node Exporter** | 9100 | System metrics |
+| **cAdvisor** | 8080 | Container metrics |
+| **Uptime Kuma** | 3001 | Uptime monitoring |
+
+### Development
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Gitea | 3000 | Git hosting |
+| Gitea Runner | 3008 | CI/CD runner |
+| OpenHands | 8000 | AI developer |
+
+### Database
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| PostgreSQL | 5432 | Database |
+| Redis | 6379 | Caching |
+
+---
+
+## Daily Operations
+
+### Check Monitoring
+```bash
+# Prometheus targets
+curl http://192.168.0.210:9090/api/v1/targets | jq
+
+# Grafana dashboards
+open http://192.168.0.210:3000
+```
+
+### Alert Status
+```bash
+# Alertmanager
+open http://192.168.0.210:9093
+
+# Check ntfy for alerts
+curl -s ntfy.vish.local/homelab-alerts | head -20
+```
+
+---
+
+## Prometheus Configuration
+
+### Scraping Targets
+- Node exporters (all hosts)
+- cAdvisor (all hosts)
+- Prometheus self-monitoring
+- Application-specific metrics
+
+### Retention
+- Time: 30 days
+- Storage: 20GB
+
+### Maintenance
+```bash
+# Check TSDB size
+du -sh /var/lib/prometheus/
+
+# Manual compaction
+docker exec prometheus promtool tsdb compact /prometheus
+```
+
+---
+
+## Grafana Dashboards
+
+### Key Dashboards
+- Infrastructure Overview
+- Container Health
+- Network Traffic
+- Service-specific metrics
+
+### Alert Rules
+- CPU > 80% for 5 minutes
+- Memory > 90% for 5 minutes
+- Disk > 85%
+- Service down > 2 minutes
+
+---
+
+## Common Issues
+
+### Prometheus Not Scraping
+1. Check targets: Prometheus UI → Status → Targets
+2. Verify network connectivity
+3. Check firewall rules
+4. Review scrape errors in logs
+
+### Grafana Dashboards Slow
+1. Check Prometheus query performance
+2. Reduce time range
+3. Optimize queries
+4. Check resource usage
+
+### Alerts Not Firing
+1. Verify Alertmanager config
+2. Check ntfy integration
+3. Review alert rules syntax
+4. Test with artificial alert
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Review alert history
+- [ ] Check disk space
+- [ ] Verify backups
+
+### Monthly
+- [ ] Clean old metrics
+- [ ] Update dashboards
+- [ ] Review alert thresholds
+
+### Quarterly
+- [ ] Test alert notifications
+- [ ] Review retention policy
+- [ ] Optimize queries
+
+---
+
+## Backup Procedures
+
+### Configuration
+```bash
+# Grafana dashboards
+cp -r /opt/grafana/dashboards /backup/
+
+# Prometheus rules
+cp -r /opt/prometheus/rules /backup/
+```
+
+### Ansible
+```bash
+ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
+```
+
+---
+
+## Emergency Procedures
+
+### Prometheus Full
+1. Check storage: `docker system df`
+2. Reduce retention in prometheus.yml
+3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
+4. Restart container
+
+### VM Down
+1. Check Proxmox: `qm list`
+2. Start VM: `qm start <vmid>`
+3. Check console: `qm terminal <vmid>`
+4. Review logs in Proxmox UI
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh homelab@192.168.0.210
+
+# Restart monitoring
+cd /opt/docker/prometheus && docker-compose restart
+cd /opt/docker/grafana && docker-compose restart
+
+# Check targets
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
+
+# View logs
+docker logs prometheus
+docker logs grafana
+docker logs alertmanager
+```
+
+---
+
+## Links
+
+- [Prometheus](http://192.168.0.210:9090)
+- [Grafana](http://192.168.0.210:3000)
+- [Alertmanager](http://192.168.0.210:9093)
+- [Uptime Kuma](http://192.168.0.210:3001)