Sanitized mirror from private repository - 2026-04-20 01:24:42 UTC
This commit is contained in:
228
docs/infrastructure/hosts/atlantis-runbook.md
Normal file
228
docs/infrastructure/hosts/atlantis-runbook.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# Atlantis Runbook
|
||||
|
||||
*Synology DS1821+ - Primary NAS and Media Server*
|
||||
|
||||
**Endpoint ID:** 2
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** AMD Ryzen V1500B, 32GB RAM, 8 bays
|
||||
**Access:** `atlantis.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Atlantis is the primary Synology NAS serving as the homelab's central storage and media infrastructure.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Synology DS1821+ |
|
||||
| CPU | AMD Ryzen V1500B (4-core) |
|
||||
| RAM | 32GB |
|
||||
| Storage | 8-bay RAID6 + SSD cache |
|
||||
| Network | 4x 1GbE (Link aggregated) |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Services
|
||||
|
||||
| Service | Port | Purpose | Docker Image |
|
||||
|---------|------|---------|--------------|
|
||||
| **Vaultwarden** | 8080 | Password manager | vaultwarden/server |
|
||||
| **Immich** | 2283 | Photo backup | immich-app/immich |
|
||||
| **Plex** | 32400 | Media server | plexinc/pms-docker |
|
||||
| **Ollama** | 11434 | AI/ML | ollama/ollama |
|
||||
|
||||
### Media Stack
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| arr-suite | Various | Sonarr, Radarr, Lidarr, Prowlarr |
|
||||
| qBittorrent | 8080 | Download client |
|
||||
| Jellyseerr | 5055 | Media requests |
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Portainer | 9000 | Container management |
|
||||
| Watchtower | 9001 | Auto-updates |
|
||||
| Dozzle | 8081 | Log viewer |
|
||||
| Nginx Proxy Manager | 81/444 | Legacy proxy |
|
||||
|
||||
### Additional Services
|
||||
|
||||
- Jitsi (Video conferencing)
|
||||
- Matrix/Synapse (Chat)
|
||||
- Mastodon (Social)
|
||||
- Paperless-NGX (Documents)
|
||||
- Syncthing (File sync)
|
||||
- Grafana + Prometheus (Monitoring)
|
||||
|
||||
---
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
/volume1/
|
||||
├── docker/ # Docker volumes
|
||||
├── docker/compose/ # Service configurations
|
||||
├── media/ # Media files
|
||||
│ ├── movies/
|
||||
│ ├── tv/
|
||||
│ ├── music/
|
||||
│ └── books/
|
||||
├── photos/ # Immich storage
|
||||
├── backups/ # Backup destination
|
||||
└── shared/ # Shared folders
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://atlantis.vish.local:9000
|
||||
|
||||
# Via SSH
|
||||
ssh admin@atlantis.vish.local
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
```
|
||||
|
||||
### Check Disk Usage
|
||||
```bash
|
||||
# SSH to Atlantis
|
||||
ssh admin@atlantis.vish.local
|
||||
|
||||
# Synology storage manager
|
||||
sudo syno-storage-usage -a
|
||||
|
||||
# Or via Docker
|
||||
docker system df
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Specific service
|
||||
docker logs vaultwarden
|
||||
|
||||
# Follow logs
|
||||
docker logs -f vaultwarden
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Service Won't Start
|
||||
1. Check if port is already in use: `netstat -tulpn | grep <port>`
|
||||
2. Check logs: `docker logs <container>`
|
||||
3. Verify volume paths exist
|
||||
4. Restart Docker: `sudo systemctl restart docker`
|
||||
|
||||
### Storage Full
|
||||
1. Identify large files: `docker system df -v`
|
||||
2. Clean Docker: `docker system prune -a`
|
||||
3. Check Synology Storage Analyzer
|
||||
4. Archive old media files
|
||||
|
||||
### Performance Issues
|
||||
1. Check resource usage: `docker stats`
|
||||
2. Review Plex transcode logs
|
||||
3. Check RAID health: `sudo mdadm --detail /dev/md0`
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Verify backup completion
|
||||
- [ ] Check disk health (S.M.A.R.T.)
|
||||
- [ ] Review Watchtower updates
|
||||
- [ ] Check Plex library integrity
|
||||
|
||||
### Monthly
|
||||
- [ ] Run Docker cleanup
|
||||
- [ ] Update Docker Compose files
|
||||
- [ ] Review storage usage trends
|
||||
- [ ] Check security updates
|
||||
|
||||
### Quarterly
|
||||
- [ ] Deep clean unused images/containers
|
||||
- [ ] Review service dependencies
|
||||
- [ ] Test disaster recovery
|
||||
- [ ] Update documentation
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration Backup
|
||||
```bash
|
||||
# Via Ansible
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags atlantis
|
||||
```
|
||||
|
||||
### Data Backup
|
||||
- Synology Hyper Backup to external drive
|
||||
- Cloud sync to Backblaze B2
|
||||
- Critical configs to Git repository
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
ansible-playbook ansible/automation/playbooks/backup_verification.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Complete Outage
|
||||
1. Verify Synology is powered on
|
||||
2. Check network connectivity
|
||||
3. Access via DSM: `https://atlantis.vish.local:5001`
|
||||
4. Check Storage Manager for RAID status
|
||||
5. Contact via serial if no network
|
||||
|
||||
### RAID Degraded
|
||||
1. Identify failed drive via Storage Manager
|
||||
2. Power down and replace drive
|
||||
3. Rebuild will start automatically
|
||||
4. Monitor rebuild progress
|
||||
|
||||
### Data Recovery
|
||||
See [Disaster Recovery Guide](../troubleshooting/disaster-recovery.md)
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh admin@atlantis.vish.local
|
||||
|
||||
# Container management
|
||||
cd /volume1/docker/compose/<service>
|
||||
docker-compose restart <service>
|
||||
|
||||
# View all containers
|
||||
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
|
||||
# Logs for critical services
|
||||
docker logs vaultwarden
|
||||
docker logs plex
|
||||
docker logs immich
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Synology DSM](https://atlantis.vish.local:5001)
|
||||
- [Portainer](http://atlantis.vish.local:9000)
|
||||
- [Vaultwarden](http://atlantis.vish.local:8080)
|
||||
- [Plex](http://atlantis.vish.local:32400)
|
||||
- [Immich](http://atlantis.vish.local:2283)
|
||||
237
docs/infrastructure/hosts/calypso-runbook.md
Normal file
237
docs/infrastructure/hosts/calypso-runbook.md
Normal file
@@ -0,0 +1,237 @@
|
||||
# Calypso Runbook
|
||||
|
||||
*Synology DS723+ - Secondary NAS and Infrastructure*
|
||||
|
||||
**Endpoint ID:** 443397
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** AMD Ryzen R1600, 32GB RAM, 2 bays + expansion
|
||||
**Access:** `calypso.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Calypso is the secondary Synology NAS handling critical infrastructure services including authentication, reverse proxy, and monitoring.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Synology DS723+ |
|
||||
| CPU | AMD Ryzen R1600 (2-core/4-thread) |
|
||||
| RAM | 32GB |
|
||||
| Storage | 2-bay SHR + eSATA expansion |
|
||||
| Network | 2x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Infrastructure
|
||||
|
||||
| Service | Port | Purpose | Status |
|
||||
|---------|------|---------|--------|
|
||||
| **Nginx Proxy Manager** | 80/443 | SSL termination & routing | Required |
|
||||
| **Authentik** | 9000 | SSO authentication | Required |
|
||||
| **Prometheus** | 9090 | Metrics collection | Required |
|
||||
| **Grafana** | 3000 | Dashboards | Required |
|
||||
| **Alertmanager** | 9093 | Alert routing | Required |
|
||||
|
||||
### Additional Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| AdGuard | 3053 | DNS filtering (backup) |
|
||||
| Paperless-NGX | 8000 | Document management |
|
||||
| Reactive Resume | 3001 | Resume builder |
|
||||
| Gitea | 3000/22 | Git hosting |
|
||||
| Gitea Runner | 3008 | CI/CD |
|
||||
| Headscale | 8080 | WireGuard VPN controller |
|
||||
| Seafile | 8082 | File sync & share |
|
||||
| Syncthing | 8384 | File sync |
|
||||
| WireGuard | 51820 | VPN server |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
|
||||
### Media (ARR Stack)
|
||||
|
||||
- Sonarr, Radarr, Lidarr
|
||||
- Prowlarr (indexers)
|
||||
- Bazarr (subtitles)
|
||||
|
||||
---
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
/volume1/
|
||||
├── docker/
|
||||
├── docker/compose/
|
||||
├── appdata/ # Application data
|
||||
│ ├── authentik/
|
||||
│ ├── npm/
|
||||
│ ├── prometheus/
|
||||
│ └── grafana/
|
||||
├── documents/ # Paperless
|
||||
├── seafile/ # Seafile data
|
||||
└── backups/ # Backup destination
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://calypso.vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh admin@calypso.vish.local
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
```
|
||||
|
||||
### Monitor Critical Services
|
||||
```bash
|
||||
# Check NPM
|
||||
curl -I http://localhost:80
|
||||
|
||||
# Check Authentik
|
||||
curl -I http://localhost:9000
|
||||
|
||||
# Check Prometheus
|
||||
curl -I http://localhost:9090
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### NPM Not Routing
|
||||
1. Check if NPM is running: `docker ps | grep npm`
|
||||
2. Verify proxy hosts configured: Access NPM UI → Proxy Hosts
|
||||
3. Check SSL certificates
|
||||
4. Review NPM logs: `docker logs nginx-proxy-manager`
|
||||
|
||||
### Authentik SSO Broken
|
||||
1. Check Authentik running: `docker ps | grep authentik`
|
||||
2. Verify PostgreSQL: `docker logs authentik-postgresql`
|
||||
3. Check Redis: `docker logs authentik-redis`
|
||||
4. Review OIDC configurations in services
|
||||
|
||||
### Prometheus Down
|
||||
1. Check storage: `docker system df`
|
||||
2. Verify volume: `docker volume ls | grep prometheus`
|
||||
3. Check retention settings
|
||||
4. Review logs: `docker logs prometheus`
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Verify Authentik users can login
|
||||
- [ ] Check Prometheus metrics collection
|
||||
- [ ] Review Alertmanager notifications
|
||||
- [ ] Verify NPM certificates
|
||||
|
||||
### Monthly
|
||||
- [ ] Clean unused Docker images
|
||||
- [ ] Review Prometheus retention
|
||||
- [ ] Update applications
|
||||
- [ ] Check disk usage
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test OAuth flows
|
||||
- [ ] Verify backup restoration
|
||||
- [ ] Review monitoring thresholds
|
||||
- [ ] Update SSL certificates
|
||||
|
||||
---
|
||||
|
||||
## SSL Certificate Management
|
||||
|
||||
NPM handles all SSL certificates:
|
||||
|
||||
1. **Automatic Renewal**: Let's Encrypt (default)
|
||||
2. **Manual**: Access NPM → SSL Certificates → Add
|
||||
3. **Check Status**: NPM Dashboard → SSL
|
||||
|
||||
### Common Certificate Issues
|
||||
- Rate limits: Wait 1 hour between requests
|
||||
- DNS challenge: Verify external DNS
|
||||
- Self-signed: Use for internal services
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration Backup
|
||||
```bash
|
||||
# Via Ansible
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags calypso
|
||||
```
|
||||
|
||||
### Key Data to Backup
|
||||
- NPM configurations: `/volume1/docker/compose/nginx_proxy_manager/`
|
||||
- Authentik: `/volume1/docker/appdata/authentik/`
|
||||
- Prometheus: `/volume1/docker/appdata/prometheus/`
|
||||
- Grafana: `/volume1/docker/appdata/grafana/`
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Authentik Down
|
||||
**Impact**: SSO broken for all services
|
||||
|
||||
1. Verify containers running
|
||||
2. Check PostgreSQL: `docker logs authentik-postgresql`
|
||||
3. Check Redis: `docker logs authentik-redis`
|
||||
4. Restart Authentik: `docker-compose restart`
|
||||
5. If needed, restore from backup
|
||||
|
||||
### NPM Down
|
||||
**Impact**: No external access
|
||||
|
||||
1. Verify container: `docker ps | grep npm`
|
||||
2. Check ports 80/443: `netstat -tulpn | grep -E '80|443'`
|
||||
3. Restart: `docker-compose restart`
|
||||
4. Check DNS resolution
|
||||
|
||||
### Prometheus Full
|
||||
**Impact**: No metrics
|
||||
|
||||
1. Check storage: `docker system df`
|
||||
2. Reduce retention: Edit prometheus.yml
|
||||
3. Clean old data: `docker exec prometheus promtool tsdb delete-insufficient`
|
||||
4. Restart container
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh admin@calypso.vish.local
|
||||
|
||||
# Check critical services
|
||||
docker ps --filter "name=nginx" --filter "name=authentik" --filter "name=prometheus"
|
||||
|
||||
# Restart infrastructure
|
||||
cd /volume1/docker/compose/nginx_proxy_manager && docker-compose restart
|
||||
cd /volume1/docker/compose/authentik && docker-compose restart
|
||||
|
||||
# View logs
|
||||
docker logs -f nginx-proxy-manager
|
||||
docker logs -f authentik-server
|
||||
docker logs -f prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Synology DSM](https://calypso.vish.local:5001)
|
||||
- [Nginx Proxy Manager](http://calypso.vish.local:81)
|
||||
- [Authentik](http://calypso.vish.local:9000)
|
||||
- [Prometheus](http://calypso.vish.local:9090)
|
||||
- [Grafana](http://calypso.vish.local:3000)
|
||||
- [Alertmanager](http://calypso.vish.local:9093)
|
||||
244
docs/infrastructure/hosts/concord-nuc-runbook.md
Normal file
244
docs/infrastructure/hosts/concord-nuc-runbook.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Concord NUC Runbook
|
||||
|
||||
*Intel NUC6i3SYB - Home Automation & DNS*
|
||||
|
||||
**Endpoint ID:** 443398
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** Intel Core i3-6100U, 16GB RAM, 256GB SSD
|
||||
**Access:** `concordnuc.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Concord NUC runs lightweight services focused on home automation, DNS filtering, and local network services.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Intel NUC6i3SYB |
|
||||
| CPU | Intel Core i3-6100U (2-core) |
|
||||
| RAM | 16GB |
|
||||
| Storage | 256GB SSD |
|
||||
| Network | 1x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Services
|
||||
|
||||
| Service | Port | Purpose | Docker Image |
|
||||
|---------|------|---------|---------------|
|
||||
| **AdGuard Home** | 3053/53 | DNS filtering | adguard/adguardhome |
|
||||
| **Home Assistant** | 8123 | Home automation | homeassistant/home-assistant |
|
||||
| **Matter Server** | 5580 | Matter protocol | matter-server/matter-server |
|
||||
|
||||
### Additional Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Plex | 32400 | Media server |
|
||||
| Invidious | 2999 | YouTube frontend |
|
||||
| Piped | 1234 | YouTube music |
|
||||
| Syncthing | 8384 | File sync |
|
||||
| WireGuard | 51820 | VPN server |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
| Node Exporter | 9100 | Metrics |
|
||||
|
||||
---
|
||||
|
||||
## Network Position
|
||||
|
||||
```
|
||||
Internet
|
||||
│
|
||||
▼
|
||||
[Home Router] ──WAN──► (Public IP)
|
||||
│
|
||||
├─► [Pi-hole Primary]
|
||||
│
|
||||
└─► [AdGuard Home] ──► Local DNS
|
||||
│
|
||||
▼
|
||||
[Home Assistant] ──► Zigbee/Z-Wave
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://concordnuc.vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh homelab@concordnuc.vish.local
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Home Assistant
|
||||
```bash
|
||||
# Access UI
|
||||
open http://concordnuc.vish.local:8123
|
||||
|
||||
# Check logs
|
||||
docker logs homeassistant
|
||||
```
|
||||
|
||||
### AdGuard Home
|
||||
```bash
|
||||
# Access UI
|
||||
open http://concordnuc.vish.local:3053
|
||||
|
||||
# Check DNS filtering
|
||||
# Admin → Dashboard → DNS Queries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Home Assistant Won't Start
|
||||
1. Check logs: `docker logs homeassistant`
|
||||
2. Verify config: `config/configuration.yaml`
|
||||
3. Check Zigbee/Z-Wave stick
|
||||
4. Restore from backup if needed
|
||||
|
||||
### AdGuard Not Filtering
|
||||
1. Check service: `docker ps | grep adguard`
|
||||
2. Verify DNS settings on router
|
||||
3. Check filter lists: Admin → Filters
|
||||
4. Review query log
|
||||
|
||||
### No Network Connectivity
|
||||
1. Check Docker: `systemctl status docker`
|
||||
2. Verify network: `ip addr`
|
||||
3. Check firewall: `sudo ufw status`
|
||||
|
||||
---
|
||||
|
||||
## Home Assistant Configuration
|
||||
|
||||
### Add-ons Running
|
||||
- Zigbee2MQTT
|
||||
- Z-Wave JS UI
|
||||
- File editor
|
||||
- Terminal
|
||||
|
||||
### Backup
|
||||
```bash
|
||||
# Manual backup via UI
|
||||
Configuration → Backups → Create backup
|
||||
|
||||
# Automated to Synology
|
||||
Syncthing → Backups/homeassistant/
|
||||
```
|
||||
|
||||
### Restoration
|
||||
1. Access HA in safe mode
|
||||
2. Configuration → Backups
|
||||
3. Select backup → Restore
|
||||
|
||||
---
|
||||
|
||||
## AdGuard Home Configuration
|
||||
|
||||
### DNS Providers
|
||||
- Cloudflare: 1.1.1.1
|
||||
- Google: 8.8.8.8
|
||||
|
||||
### Blocklists Enabled
|
||||
- AdGuard Default
|
||||
- AdAway
|
||||
- Malware domains
|
||||
|
||||
### Query Log
|
||||
Access: Admin → Logs
|
||||
- Useful for debugging DNS issues
|
||||
- Check for blocked domains
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Check HA logs for errors
|
||||
- [ ] Review AdGuard query log
|
||||
- [ ] Verify backups completed
|
||||
|
||||
### Monthly
|
||||
- [ ] Update Home Assistant
|
||||
- [ ] Review AdGuard filters
|
||||
- [ ] Clean unused Docker images
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test automation reliability
|
||||
- [ ] Review device states
|
||||
- [ ] Check Zigbee network health
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Home Assistant Down
|
||||
**Impact**: Smart home controls unavailable
|
||||
|
||||
1. Check container: `docker ps | grep homeassistant`
|
||||
2. Restart: `docker-compose restart`
|
||||
3. Check logs: `docker logs homeassistant`
|
||||
4. If corrupted, restore from backup
|
||||
|
||||
### AdGuard Down
|
||||
**Impact**: DNS issues on network
|
||||
|
||||
1. Verify: `dig google.com @localhost`
|
||||
2. Restart: `docker-compose restart`
|
||||
3. Check config in UI
|
||||
4. Fallback to Pi-hole
|
||||
|
||||
### Complete Hardware Failure
|
||||
1. Replace NUC hardware
|
||||
2. Reinstall Ubuntu/Debian
|
||||
3. Run deploy playbook:
|
||||
```bash
|
||||
ansible-playbook ansible/homelab/playbooks/deploy_concord_nuc.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh homelab@concordnuc.vish.local
|
||||
|
||||
# Restart services
|
||||
docker-compose -f /opt/docker/compose/homeassistant.yaml restart
|
||||
docker-compose -f /opt/docker/compose/adguard.yaml restart
|
||||
|
||||
# View logs
|
||||
docker logs -f homeassistant
|
||||
docker logs -f adguard
|
||||
|
||||
# Check resource usage
|
||||
docker stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Device Access
|
||||
|
||||
| Device | Protocol | Address |
|
||||
|--------|----------|---------|
|
||||
| Zigbee Coordinator | USB | /dev/serial/by-id/* |
|
||||
| Z-Wave Controller | USB | /dev/serial/by-id/* |
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Home Assistant](http://concordnuc.vish.local:8123)
|
||||
- [AdGuard Home](http://concordnuc.vish.local:3053)
|
||||
- [Plex](http://concordnuc.vish.local:32400)
|
||||
- [Invidious](http://concordnuc.vish.local:2999)
|
||||
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Homelab VM Runbook
|
||||
|
||||
*Proxmox VM - Monitoring & DevOps*
|
||||
|
||||
**Endpoint ID:** 443399
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** 4 vCPU, 28GB RAM
|
||||
**Access:** `192.168.0.210`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Homelab VM runs monitoring, alerting, and development services on Proxmox.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Platform | Proxmox VE |
|
||||
| vCPU | 4 cores |
|
||||
| RAM | 28GB |
|
||||
| Storage | 100GB SSD |
|
||||
| Network | 1x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Monitoring Stack
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Prometheus** | 9090 | Metrics collection |
|
||||
| **Grafana** | 3000 | Dashboards |
|
||||
| **Alertmanager** | 9093 | Alert routing |
|
||||
| **Node Exporter** | 9100 | System metrics |
|
||||
| **cAdvisor** | 8080 | Container metrics |
|
||||
| **Uptime Kuma** | 3001 | Uptime monitoring |
|
||||
|
||||
### Development
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Gitea | 3000 | Git hosting |
|
||||
| Gitea Runner | 3008 | CI/CD runner |
|
||||
| OpenHands | 8000 | AI developer |
|
||||
|
||||
### Database
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| PostgreSQL | 5432 | Database |
|
||||
| Redis | 6379 | Caching |
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Monitoring
|
||||
```bash
|
||||
# Prometheus targets
|
||||
curl http://192.168.0.210:9090/api/v1/targets | jq
|
||||
|
||||
# Grafana dashboards
|
||||
open http://192.168.0.210:3000
|
||||
```
|
||||
|
||||
### Alert Status
|
||||
```bash
|
||||
# Alertmanager
|
||||
open http://192.168.0.210:9093
|
||||
|
||||
# Check ntfy for alerts
|
||||
curl -s ntfy.vish.local/homelab-alerts | head -20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Scraping Targets
|
||||
- Node exporters (all hosts)
|
||||
- cAdvisor (all hosts)
|
||||
- Prometheus self-monitoring
|
||||
- Application-specific metrics
|
||||
|
||||
### Retention
|
||||
- Time: 30 days
|
||||
- Storage: 20GB
|
||||
|
||||
### Maintenance
|
||||
```bash
|
||||
# Check TSDB size
|
||||
du -sh /var/lib/prometheus/
|
||||
|
||||
# Manual compaction
|
||||
docker exec prometheus promtool tsdb compact /prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Key Dashboards
|
||||
- Infrastructure Overview
|
||||
- Container Health
|
||||
- Network Traffic
|
||||
- Service-specific metrics
|
||||
|
||||
### Alert Rules
|
||||
- CPU > 80% for 5 minutes
|
||||
- Memory > 90% for 5 minutes
|
||||
- Disk > 85%
|
||||
- Service down > 2 minutes
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Prometheus Not Scraping
|
||||
1. Check targets: Prometheus UI → Status → Targets
|
||||
2. Verify network connectivity
|
||||
3. Check firewall rules
|
||||
4. Review scrape errors in logs
|
||||
|
||||
### Grafana Dashboards Slow
|
||||
1. Check Prometheus query performance
|
||||
2. Reduce time range
|
||||
3. Optimize queries
|
||||
4. Check resource usage
|
||||
|
||||
### Alerts Not Firing
|
||||
1. Verify Alertmanager config
|
||||
2. Check ntfy integration
|
||||
3. Review alert rules syntax
|
||||
4. Test with artificial alert
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Review alert history
|
||||
- [ ] Check disk space
|
||||
- [ ] Verify backups
|
||||
|
||||
### Monthly
|
||||
- [ ] Clean old metrics
|
||||
- [ ] Update dashboards
|
||||
- [ ] Review alert thresholds
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test alert notifications
|
||||
- [ ] Review retention policy
|
||||
- [ ] Optimize queries
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
# Grafana dashboards
|
||||
cp -r /opt/grafana/dashboards /backup/
|
||||
|
||||
# Prometheus rules
|
||||
cp -r /opt/prometheus/rules /backup/
|
||||
```
|
||||
|
||||
### Ansible
|
||||
```bash
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Prometheus Full
|
||||
1. Check storage: `docker system df`
|
||||
2. Reduce retention in prometheus.yml
|
||||
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
|
||||
4. Restart container
|
||||
|
||||
### VM Down
|
||||
1. Check Proxmox: `qm list`
|
||||
2. Start VM: `qm start <vmid>`
|
||||
3. Check console: `qm terminal <vmid>`
|
||||
4. Review logs in Proxmox UI
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh homelab@192.168.0.210
|
||||
|
||||
# Restart monitoring
|
||||
cd /opt/docker/prometheus && docker-compose restart
|
||||
cd /opt/docker/grafana && docker-compose restart
|
||||
|
||||
# Check targets
|
||||
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
|
||||
|
||||
# View logs
|
||||
docker logs prometheus
|
||||
docker logs grafana
|
||||
docker logs alertmanager
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Prometheus](http://192.168.0.210:9090)
|
||||
- [Grafana](http://192.168.0.210:3000)
|
||||
- [Alertmanager](http://192.168.0.210:9093)
|
||||
- [Uptime Kuma](http://192.168.0.210:3001)
|
||||
179
docs/infrastructure/hosts/rpi5-runbook.md
Normal file
179
docs/infrastructure/hosts/rpi5-runbook.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# RPi5 Runbook
|
||||
|
||||
*Raspberry Pi 5 - Edge Services*
|
||||
|
||||
**Endpoint ID:** 443395
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** ARM Cortex-A76, 16GB RAM, 512GB USB SSD
|
||||
**Access:** `rpi5-vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Raspberry Pi 5 runs edge services including Immich backup and lightweight applications.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Raspberry Pi 5 |
|
||||
| CPU | ARM Cortex-A76 (4-core) |
|
||||
| RAM | 16GB |
|
||||
| Storage | 512GB USB-C SSD |
|
||||
| Network | 1x 1GbE (Pi 4 adapter) |
|
||||
|
||||
## Services
|
||||
|
||||
### Primary Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Immich** | 2283 | Photo backup (edge) |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
| Node Exporter | 9100 | Metrics |
|
||||
|
||||
### Services (if enabled)
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Plex | 32400 | Media server |
|
||||
| WireGuard | 51820 | VPN |
|
||||
|
||||
## Secondary Pi Nodes
|
||||
|
||||
### Pi-5-Kevin
|
||||
This is a secondary Raspberry Pi 5 node with identical specifications but not typically online.
|
||||
|
||||
- **CPU**: Broadcom BCM2712 (4-core, 2.4GHz)
|
||||
- **RAM**: 8GB LPDDR4X
|
||||
- **Storage**: 64GB microSD
|
||||
- **Network**: Gigabit Ethernet + WiFi 6
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://rpi5-vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh pi@rpi5-vish.local
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Immich Status
|
||||
```bash
|
||||
# Access UI
|
||||
open http://rpi5-vish.local:2283
|
||||
|
||||
# Check sync status
|
||||
docker logs immich-server | grep -i sync
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Container Won't Start (ARM compatibility)
|
||||
1. Verify image supports ARM64: `docker pull --platform linux/arm64 <image>`
|
||||
2. Check container logs
|
||||
3. Verify Raspberry Pi OS 64-bit
|
||||
|
||||
### Storage Slow
|
||||
1. Check USB drive: `lsusb`
|
||||
2. Verify SSD: `sudo hdparm -t /dev/sda`
|
||||
3. Use fast USB port (USB-C)
|
||||
|
||||
### Network Issues
|
||||
1. Check adapter compatibility
|
||||
2. Verify driver loaded: `lsmod | grep smsc95xx`
|
||||
3. Update firmware: `sudo rpi-eeprom-update`
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
### Layout
|
||||
```
|
||||
/home/pi/
|
||||
├── docker/ # Docker data
|
||||
├── immich/ # Photo storage
|
||||
└── backups/ # Local backups
|
||||
```
|
||||
|
||||
### Performance Tips
|
||||
- Use USB 3.0 SSD
|
||||
- Usequality power supply (5V 5A)
|
||||
- Enable USB max_current in config.txt
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Check Docker disk usage
|
||||
- [ ] Verify Immich backup
|
||||
- [ ] Check container health
|
||||
|
||||
### Monthly
|
||||
- [ ] Update Raspberry Pi OS
|
||||
- [ ] Clean unused images
|
||||
- [ ] Review resource usage
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test backup restoration
|
||||
- [ ] Verify ARM image compatibility
|
||||
- [ ] Check firmware updates
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### SD Card/Storage Failure
|
||||
1. Replace storage drive
|
||||
2. Reinstall Raspberry Pi OS
|
||||
3. Run deploy playbook:
|
||||
```bash
|
||||
ansible-playbook ansible/homelab/playbooks/deploy_rpi5_vish.yml
|
||||
```
|
||||
|
||||
### Overheating
|
||||
1. Add heatsinks
|
||||
2. Enable fan
|
||||
3. Reduce CPU frequency: `sudo echo "arm_freq=1800" >> /boot/config.txt`
|
||||
|
||||
## Notes
|
||||
|
||||
This Raspberry Pi 5 system is the primary node that runs Immich and other services, with the secondary node **pi-5-kevin** intentionally kept offline for backup purposes when needed.
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh pi@rpi5-vish.local
|
||||
|
||||
# Check temperature
|
||||
vcgencmd measure_temp
|
||||
|
||||
# Check throttling
|
||||
vcgencmd get_throttled
|
||||
|
||||
# Update firmware
|
||||
sudo rpi-eeprom-update
|
||||
sudo rpi-eeprom-update -a
|
||||
|
||||
# View Immich logs
|
||||
docker logs -f immich-server
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Immich](http://rpi5-vish.local:2283)
|
||||
- [Portainer](http://rpi5-vish.local:9001)
|
||||
66
docs/infrastructure/hosts/runbooks.md
Normal file
66
docs/infrastructure/hosts/runbooks.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# Host Runbooks
|
||||
|
||||
This directory contains operational runbooks for each host in the homelab infrastructure.
|
||||
|
||||
## Available Runbooks
|
||||
|
||||
- [Atlantis Runbook](./atlantis-runbook.md) - Synology DS1821+ (Primary NAS)
|
||||
- [Calypso Runbook](./calypso-runbook.md) - Synology DS723+ (Secondary NAS)
|
||||
- [Concord NUC Runbook](./concord-nuc-runbook.md) - Intel NUC (Home Automation & DNS)
|
||||
- [Homelab VM Runbook](./homelab-vm-runbook.md) - Proxmox VM (Monitoring & DevOps)
|
||||
- [RPi5 Runbook](./rpi5-runbook.md) - Raspberry Pi 5 (Edge Services)
|
||||
|
||||
---
|
||||
|
||||
## Common Tasks
|
||||
|
||||
All hosts share common operational procedures:
|
||||
|
||||
### Viewing Logs
|
||||
```bash
|
||||
# Via SSH to host
|
||||
docker logs <container_name>
|
||||
|
||||
# Via Portainer
|
||||
Portainer → Containers → <container> → Logs
|
||||
```
|
||||
|
||||
### Restarting Services
|
||||
```bash
|
||||
# Via docker-compose
|
||||
cd hosts/<host>/<service>
|
||||
docker-compose restart <service>
|
||||
|
||||
# Via Portainer
|
||||
Portainer → Stacks → <stack> → Restart
|
||||
```
|
||||
|
||||
### Checking Resource Usage
|
||||
```bash
|
||||
# Via Portainer
|
||||
Portainer → Containers → Sort by CPU/Memory
|
||||
|
||||
# Via CLI
|
||||
docker stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
| Role | Contact | When to Contact |
|
||||
|------|---------|------------------|
|
||||
| Primary Admin | User | All critical issues |
|
||||
| Emergency | NTFY | Critical alerts only |
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Host | Primary Role | Critical Services | SSH Access |
|
||||
|------|--------------|-------------------|------------|
|
||||
| Atlantis | Media, Vault | Vaultwarden, Plex, Immich | atlantis.local |
|
||||
| Calypso | Infrastructure | NPM, Authentik, Prometheus | calypso.local |
|
||||
| Concord NUC | DNS, HA | AdGuard, Home Assistant | concord-nuc.local |
|
||||
| Homelab VM | Monitoring | Prometheus, Grafana | 192.168.0.210 |
|
||||
| RPi5 | Edge | Immich (backup) | rpi5-vish.local |
|
||||
Reference in New Issue
Block a user