Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
This commit is contained in:
228
docs/infrastructure/hosts/atlantis-runbook.md
Normal file
228
docs/infrastructure/hosts/atlantis-runbook.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# Atlantis Runbook
|
||||
|
||||
*Synology DS1821+ - Primary NAS and Media Server*
|
||||
|
||||
**Endpoint ID:** 2
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** AMD Ryzen V1500B, 32GB RAM, 8 bays
|
||||
**Access:** `atlantis.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Atlantis is the primary Synology NAS serving as the homelab's central storage and media infrastructure.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Synology DS1821+ |
|
||||
| CPU | AMD Ryzen V1500B (4-core) |
|
||||
| RAM | 32GB |
|
||||
| Storage | 8-bay RAID6 + SSD cache |
|
||||
| Network | 4x 1GbE (Link aggregated) |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Services
|
||||
|
||||
| Service | Port | Purpose | Docker Image |
|
||||
|---------|------|---------|--------------|
|
||||
| **Vaultwarden** | 8080 | Password manager | vaultwarden/server |
|
||||
| **Immich** | 2283 | Photo backup | immich-app/immich |
|
||||
| **Plex** | 32400 | Media server | plexinc/pms-docker |
|
||||
| **Ollama** | 11434 | AI/ML | ollama/ollama |
|
||||
|
||||
### Media Stack
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| arr-suite | Various | Sonarr, Radarr, Lidarr, Prowlarr |
|
||||
| qBittorrent | 8080 | Download client |
|
||||
| Jellyseerr | 5055 | Media requests |
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Portainer | 9000 | Container management |
|
||||
| Watchtower | 9001 | Auto-updates |
|
||||
| Dozzle | 8081 | Log viewer |
|
||||
| Nginx Proxy Manager | 81/444 | Legacy proxy |
|
||||
|
||||
### Additional Services
|
||||
|
||||
- Jitsi (Video conferencing)
|
||||
- Matrix/Synapse (Chat)
|
||||
- Mastodon (Social)
|
||||
- Paperless-NGX (Documents)
|
||||
- Syncthing (File sync)
|
||||
- Grafana + Prometheus (Monitoring)
|
||||
|
||||
---
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
/volume1/
|
||||
├── docker/ # Docker volumes
|
||||
├── docker/compose/ # Service configurations
|
||||
├── media/ # Media files
|
||||
│ ├── movies/
|
||||
│ ├── tv/
|
||||
│ ├── music/
|
||||
│ └── books/
|
||||
├── photos/ # Immich storage
|
||||
├── backups/ # Backup destination
|
||||
└── shared/ # Shared folders
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://atlantis.vish.local:9000
|
||||
|
||||
# Via SSH
|
||||
ssh admin@atlantis.vish.local
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
```
|
||||
|
||||
### Check Disk Usage
|
||||
```bash
|
||||
# SSH to Atlantis
|
||||
ssh admin@atlantis.vish.local
|
||||
|
||||
# Synology storage manager
|
||||
sudo syno-storage-usage -a
|
||||
|
||||
# Or via Docker
|
||||
docker system df
|
||||
```
|
||||
|
||||
### View Logs
|
||||
```bash
|
||||
# Specific service
|
||||
docker logs vaultwarden
|
||||
|
||||
# Follow logs
|
||||
docker logs -f vaultwarden
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Service Won't Start
|
||||
1. Check if port is already in use: `netstat -tulpn | grep <port>`
|
||||
2. Check logs: `docker logs <container>`
|
||||
3. Verify volume paths exist
|
||||
4. Restart Docker: `sudo systemctl restart docker`
|
||||
|
||||
### Storage Full
|
||||
1. Identify large files: `docker system df -v`
|
||||
2. Clean Docker: `docker system prune -a`
|
||||
3. Check Synology Storage Analyzer
|
||||
4. Archive old media files
|
||||
|
||||
### Performance Issues
|
||||
1. Check resource usage: `docker stats`
|
||||
2. Review Plex transcode logs
|
||||
3. Check RAID health: `sudo mdadm --detail /dev/md0`
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Verify backup completion
|
||||
- [ ] Check disk health (S.M.A.R.T.)
|
||||
- [ ] Review Watchtower updates
|
||||
- [ ] Check Plex library integrity
|
||||
|
||||
### Monthly
|
||||
- [ ] Run Docker cleanup
|
||||
- [ ] Update Docker Compose files
|
||||
- [ ] Review storage usage trends
|
||||
- [ ] Check security updates
|
||||
|
||||
### Quarterly
|
||||
- [ ] Deep clean unused images/containers
|
||||
- [ ] Review service dependencies
|
||||
- [ ] Test disaster recovery
|
||||
- [ ] Update documentation
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration Backup
|
||||
```bash
|
||||
# Via Ansible
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags atlantis
|
||||
```
|
||||
|
||||
### Data Backup
|
||||
- Synology Hyper Backup to external drive
|
||||
- Cloud sync to Backblaze B2
|
||||
- Critical configs to Git repository
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
ansible-playbook ansible/automation/playbooks/backup_verification.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Complete Outage
|
||||
1. Verify Synology is powered on
|
||||
2. Check network connectivity
|
||||
3. Access via DSM: `https://atlantis.vish.local:5001`
|
||||
4. Check Storage Manager for RAID status
|
||||
5. Contact via serial if no network
|
||||
|
||||
### RAID Degraded
|
||||
1. Identify failed drive via Storage Manager
|
||||
2. Power down and replace drive
|
||||
3. Rebuild will start automatically
|
||||
4. Monitor rebuild progress
|
||||
|
||||
### Data Recovery
|
||||
See [Disaster Recovery Guide](../troubleshooting/disaster-recovery.md)
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh admin@atlantis.vish.local
|
||||
|
||||
# Container management
|
||||
cd /volume1/docker/compose/<service>
|
||||
docker-compose restart <service>
|
||||
|
||||
# View all containers
|
||||
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
|
||||
# Logs for critical services
|
||||
docker logs vaultwarden
|
||||
docker logs plex
|
||||
docker logs immich
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Synology DSM](https://atlantis.vish.local:5001)
|
||||
- [Portainer](http://atlantis.vish.local:9000)
|
||||
- [Vaultwarden](http://atlantis.vish.local:8080)
|
||||
- [Plex](http://atlantis.vish.local:32400)
|
||||
- [Immich](http://atlantis.vish.local:2283)
|
||||
237
docs/infrastructure/hosts/calypso-runbook.md
Normal file
237
docs/infrastructure/hosts/calypso-runbook.md
Normal file
@@ -0,0 +1,237 @@
|
||||
# Calypso Runbook
|
||||
|
||||
*Synology DS723+ - Secondary NAS and Infrastructure*
|
||||
|
||||
**Endpoint ID:** 443397
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** AMD Ryzen R1600, 32GB RAM, 2 bays + expansion
|
||||
**Access:** `calypso.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Calypso is the secondary Synology NAS handling critical infrastructure services including authentication, reverse proxy, and monitoring.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Synology DS723+ |
|
||||
| CPU | AMD Ryzen R1600 (2-core/4-thread) |
|
||||
| RAM | 32GB |
|
||||
| Storage | 2-bay SHR + eSATA expansion |
|
||||
| Network | 2x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Infrastructure
|
||||
|
||||
| Service | Port | Purpose | Status |
|
||||
|---------|------|---------|--------|
|
||||
| **Nginx Proxy Manager** | 80/443 | SSL termination & routing | Required |
|
||||
| **Authentik** | 9000 | SSO authentication | Required |
|
||||
| **Prometheus** | 9090 | Metrics collection | Required |
|
||||
| **Grafana** | 3000 | Dashboards | Required |
|
||||
| **Alertmanager** | 9093 | Alert routing | Required |
|
||||
|
||||
### Additional Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| AdGuard | 3053 | DNS filtering (backup) |
|
||||
| Paperless-NGX | 8000 | Document management |
|
||||
| Reactive Resume | 3001 | Resume builder |
|
||||
| Gitea | 3000/22 | Git hosting |
|
||||
| Gitea Runner | 3008 | CI/CD |
|
||||
| Headscale | 8080 | WireGuard VPN controller |
|
||||
| Seafile | 8082 | File sync & share |
|
||||
| Syncthing | 8384 | File sync |
|
||||
| WireGuard | 51820 | VPN server |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
|
||||
### Media (ARR Stack)
|
||||
|
||||
- Sonarr, Radarr, Lidarr
|
||||
- Prowlarr (indexers)
|
||||
- Bazarr (subtitles)
|
||||
|
||||
---
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
/volume1/
|
||||
├── docker/
|
||||
├── docker/compose/
|
||||
├── appdata/ # Application data
|
||||
│ ├── authentik/
|
||||
│ ├── npm/
|
||||
│ ├── prometheus/
|
||||
│ └── grafana/
|
||||
├── documents/ # Paperless
|
||||
├── seafile/ # Seafile data
|
||||
└── backups/ # Backup destination
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://calypso.vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh admin@calypso.vish.local
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
```
|
||||
|
||||
### Monitor Critical Services
|
||||
```bash
|
||||
# Check NPM
|
||||
curl -I http://localhost:80
|
||||
|
||||
# Check Authentik
|
||||
curl -I http://localhost:9000
|
||||
|
||||
# Check Prometheus
|
||||
curl -I http://localhost:9090
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### NPM Not Routing
|
||||
1. Check if NPM is running: `docker ps | grep npm`
|
||||
2. Verify proxy hosts configured: Access NPM UI → Proxy Hosts
|
||||
3. Check SSL certificates
|
||||
4. Review NPM logs: `docker logs nginx-proxy-manager`
|
||||
|
||||
### Authentik SSO Broken
|
||||
1. Check Authentik running: `docker ps | grep authentik`
|
||||
2. Verify PostgreSQL: `docker logs authentik-postgresql`
|
||||
3. Check Redis: `docker logs authentik-redis`
|
||||
4. Review OIDC configurations in services
|
||||
|
||||
### Prometheus Down
|
||||
1. Check storage: `docker system df`
|
||||
2. Verify volume: `docker volume ls | grep prometheus`
|
||||
3. Check retention settings
|
||||
4. Review logs: `docker logs prometheus`
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Verify Authentik users can login
|
||||
- [ ] Check Prometheus metrics collection
|
||||
- [ ] Review Alertmanager notifications
|
||||
- [ ] Verify NPM certificates
|
||||
|
||||
### Monthly
|
||||
- [ ] Clean unused Docker images
|
||||
- [ ] Review Prometheus retention
|
||||
- [ ] Update applications
|
||||
- [ ] Check disk usage
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test OAuth flows
|
||||
- [ ] Verify backup restoration
|
||||
- [ ] Review monitoring thresholds
|
||||
- [ ] Update SSL certificates
|
||||
|
||||
---
|
||||
|
||||
## SSL Certificate Management
|
||||
|
||||
NPM handles all SSL certificates:
|
||||
|
||||
1. **Automatic Renewal**: Let's Encrypt (default)
|
||||
2. **Manual**: Access NPM → SSL Certificates → Add
|
||||
3. **Check Status**: NPM Dashboard → SSL
|
||||
|
||||
### Common Certificate Issues
|
||||
- Rate limits: Wait 1 hour between requests
|
||||
- DNS challenge: Verify external DNS
|
||||
- Self-signed: Use for internal services
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration Backup
|
||||
```bash
|
||||
# Via Ansible
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags calypso
|
||||
```
|
||||
|
||||
### Key Data to Backup
|
||||
- NPM configurations: `/volume1/docker/compose/nginx_proxy_manager/`
|
||||
- Authentik: `/volume1/docker/appdata/authentik/`
|
||||
- Prometheus: `/volume1/docker/appdata/prometheus/`
|
||||
- Grafana: `/volume1/docker/appdata/grafana/`
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Authentik Down
|
||||
**Impact**: SSO broken for all services
|
||||
|
||||
1. Verify containers running
|
||||
2. Check PostgreSQL: `docker logs authentik-postgresql`
|
||||
3. Check Redis: `docker logs authentik-redis`
|
||||
4. Restart Authentik: `docker-compose restart`
|
||||
5. If needed, restore from backup
|
||||
|
||||
### NPM Down
|
||||
**Impact**: No external access
|
||||
|
||||
1. Verify container: `docker ps | grep npm`
|
||||
2. Check ports 80/443: `netstat -tulpn | grep -E '80|443'`
|
||||
3. Restart: `docker-compose restart`
|
||||
4. Check DNS resolution
|
||||
|
||||
### Prometheus Full
|
||||
**Impact**: No metrics
|
||||
|
||||
1. Check storage: `docker system df`
|
||||
2. Reduce retention: Edit prometheus.yml
|
||||
3. Clean old data: `docker exec prometheus promtool tsdb delete-insufficient`
|
||||
4. Restart container
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh admin@calypso.vish.local
|
||||
|
||||
# Check critical services
|
||||
docker ps --filter "name=nginx" --filter "name=authentik" --filter "name=prometheus"
|
||||
|
||||
# Restart infrastructure
|
||||
cd /volume1/docker/compose/nginx_proxy_manager && docker-compose restart
|
||||
cd /volume1/docker/compose/authentik && docker-compose restart
|
||||
|
||||
# View logs
|
||||
docker logs -f nginx-proxy-manager
|
||||
docker logs -f authentik-server
|
||||
docker logs -f prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Synology DSM](https://calypso.vish.local:5001)
|
||||
- [Nginx Proxy Manager](http://calypso.vish.local:81)
|
||||
- [Authentik](http://calypso.vish.local:9000)
|
||||
- [Prometheus](http://calypso.vish.local:9090)
|
||||
- [Grafana](http://calypso.vish.local:3000)
|
||||
- [Alertmanager](http://calypso.vish.local:9093)
|
||||
244
docs/infrastructure/hosts/concord-nuc-runbook.md
Normal file
244
docs/infrastructure/hosts/concord-nuc-runbook.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Concord NUC Runbook
|
||||
|
||||
*Intel NUC6i3SYB - Home Automation & DNS*
|
||||
|
||||
**Endpoint ID:** 443398
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** Intel Core i3-6100U, 16GB RAM, 256GB SSD
|
||||
**Access:** `concordnuc.vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Concord NUC runs lightweight services focused on home automation, DNS filtering, and local network services.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Intel NUC6i3SYB |
|
||||
| CPU | Intel Core i3-6100U (2-core) |
|
||||
| RAM | 16GB |
|
||||
| Storage | 256GB SSD |
|
||||
| Network | 1x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Critical Services
|
||||
|
||||
| Service | Port | Purpose | Docker Image |
|
||||
|---------|------|---------|---------------|
|
||||
| **AdGuard Home** | 3053/53 | DNS filtering | adguard/adguardhome |
|
||||
| **Home Assistant** | 8123 | Home automation | homeassistant/home-assistant |
|
||||
| **Matter Server** | 5580 | Matter protocol | matter-server/matter-server |
|
||||
|
||||
### Additional Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Plex | 32400 | Media server |
|
||||
| Invidious | 2999 | YouTube frontend |
|
||||
| Piped | 1234 | YouTube music |
|
||||
| Syncthing | 8384 | File sync |
|
||||
| WireGuard | 51820 | VPN server |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
| Node Exporter | 9100 | Metrics |
|
||||
|
||||
---
|
||||
|
||||
## Network Position
|
||||
|
||||
```
|
||||
Internet
|
||||
│
|
||||
▼
|
||||
[Home Router] ──WAN──► (Public IP)
|
||||
│
|
||||
├─► [Pi-hole Primary]
|
||||
│
|
||||
└─► [AdGuard Home] ──► Local DNS
|
||||
│
|
||||
▼
|
||||
[Home Assistant] ──► Zigbee/Z-Wave
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://concordnuc.vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh homelab@concordnuc.vish.local
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Home Assistant
|
||||
```bash
|
||||
# Access UI
|
||||
open http://concordnuc.vish.local:8123
|
||||
|
||||
# Check logs
|
||||
docker logs homeassistant
|
||||
```
|
||||
|
||||
### AdGuard Home
|
||||
```bash
|
||||
# Access UI
|
||||
open http://concordnuc.vish.local:3053
|
||||
|
||||
# Check DNS filtering
|
||||
# Admin → Dashboard → DNS Queries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Home Assistant Won't Start
|
||||
1. Check logs: `docker logs homeassistant`
|
||||
2. Verify config: `config/configuration.yaml`
|
||||
3. Check Zigbee/Z-Wave stick
|
||||
4. Restore from backup if needed
|
||||
|
||||
### AdGuard Not Filtering
|
||||
1. Check service: `docker ps | grep adguard`
|
||||
2. Verify DNS settings on router
|
||||
3. Check filter lists: Admin → Filters
|
||||
4. Review query log
|
||||
|
||||
### No Network Connectivity
|
||||
1. Check Docker: `systemctl status docker`
|
||||
2. Verify network: `ip addr`
|
||||
3. Check firewall: `sudo ufw status`
|
||||
|
||||
---
|
||||
|
||||
## Home Assistant Configuration
|
||||
|
||||
### Add-ons Running
|
||||
- Zigbee2MQTT
|
||||
- Z-Wave JS UI
|
||||
- File editor
|
||||
- Terminal
|
||||
|
||||
### Backup
|
||||
```bash
|
||||
# Manual backup via UI
|
||||
Configuration → Backups → Create backup
|
||||
|
||||
# Automated to Synology
|
||||
Syncthing → Backups/homeassistant/
|
||||
```
|
||||
|
||||
### Restoration
|
||||
1. Access HA in safe mode
|
||||
2. Configuration → Backups
|
||||
3. Select backup → Restore
|
||||
|
||||
---
|
||||
|
||||
## AdGuard Home Configuration
|
||||
|
||||
### DNS Providers
|
||||
- Cloudflare: 1.1.1.1
|
||||
- Google: 8.8.8.8
|
||||
|
||||
### Blocklists Enabled
|
||||
- AdGuard Default
|
||||
- AdAway
|
||||
- Malware domains
|
||||
|
||||
### Query Log
|
||||
Access: Admin → Logs
|
||||
- Useful for debugging DNS issues
|
||||
- Check for blocked domains
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Check HA logs for errors
|
||||
- [ ] Review AdGuard query log
|
||||
- [ ] Verify backups completed
|
||||
|
||||
### Monthly
|
||||
- [ ] Update Home Assistant
|
||||
- [ ] Review AdGuard filters
|
||||
- [ ] Clean unused Docker images
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test automation reliability
|
||||
- [ ] Review device states
|
||||
- [ ] Check Zigbee network health
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Home Assistant Down
|
||||
**Impact**: Smart home controls unavailable
|
||||
|
||||
1. Check container: `docker ps | grep homeassistant`
|
||||
2. Restart: `docker-compose restart`
|
||||
3. Check logs: `docker logs homeassistant`
|
||||
4. If corrupted, restore from backup
|
||||
|
||||
### AdGuard Down
|
||||
**Impact**: DNS issues on network
|
||||
|
||||
1. Verify: `dig google.com @localhost`
|
||||
2. Restart: `docker-compose restart`
|
||||
3. Check config in UI
|
||||
4. Fallback to Pi-hole
|
||||
|
||||
### Complete Hardware Failure
|
||||
1. Replace NUC hardware
|
||||
2. Reinstall Ubuntu/Debian
|
||||
3. Run deploy playbook:
|
||||
```bash
|
||||
ansible-playbook ansible/homelab/playbooks/deploy_concord_nuc.yml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh homelab@concordnuc.vish.local
|
||||
|
||||
# Restart services
|
||||
docker-compose -f /opt/docker/compose/homeassistant.yaml restart
|
||||
docker-compose -f /opt/docker/compose/adguard.yaml restart
|
||||
|
||||
# View logs
|
||||
docker logs -f homeassistant
|
||||
docker logs -f adguard
|
||||
|
||||
# Check resource usage
|
||||
docker stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Device Access
|
||||
|
||||
| Device | Protocol | Address |
|
||||
|--------|----------|---------|
|
||||
| Zigbee Coordinator | USB | /dev/serial/by-id/* |
|
||||
| Z-Wave Controller | USB | /dev/serial/by-id/* |
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Home Assistant](http://concordnuc.vish.local:8123)
|
||||
- [AdGuard Home](http://concordnuc.vish.local:3053)
|
||||
- [Plex](http://concordnuc.vish.local:32400)
|
||||
- [Invidious](http://concordnuc.vish.local:2999)
|
||||
97
docs/infrastructure/hosts/deck-runbook.md
Normal file
97
docs/infrastructure/hosts/deck-runbook.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Steam Deck Runbook
|
||||
|
||||
*SteamOS handheld — tailnet node with self-healing watchdog.*
|
||||
|
||||
**Headscale ID:** 29
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** Steam Deck (AMD APU, SteamOS Holo, btrfs root)
|
||||
**Access:** `ssh deck` (key-based, via `~/.ssh/config` alias → `192.168.0.140`)
|
||||
**Tailnet IP:** `100.64.0.11` (MagicDNS `deck.tail.vish.gg`)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
The Steam Deck participates in the homelab tailnet for SSH/remote access. Because SteamOS ships with an immutable `/usr` and a read-only `/usr/local`, all custom state lives in `/etc` (writable via overlay) and `/opt` (writable, outside Valve's managed tree).
|
||||
|
||||
## Filesystem layout
|
||||
|
||||
| Path | Purpose | Survives SteamOS update? |
|
||||
|------|---------|--------------------------|
|
||||
| `/opt/tailscale/tailscale`, `/opt/tailscale/tailscaled` | Tailscale binaries (standard Steam Deck install location) | Usually yes |
|
||||
| `/etc/systemd/system/tailscaled.service` + `tailscaled.service.d/override.conf` | systemd unit + override pointing ExecStart at `/opt/tailscale/tailscaled` | Overlay — may be wiped on major updates |
|
||||
| `/etc/tailscale/authkey` (0600 root) | Reusable Headscale preauth key used by the watchdog for re-auth | Overlay — may be wiped |
|
||||
| `/etc/tailscale/watchdog.sh` (0755 root) | Re-auth watchdog (bash + python3) | Overlay — may be wiped |
|
||||
| `/etc/systemd/system/tailscale-watchdog.{service,timer}` | Watchdog systemd units | Overlay — may be wiped |
|
||||
| `/etc/hosts` | Contains a pin `<public-ip> headscale.vish.gg` maintained by the watchdog | Overlay — may be wiped |
|
||||
| `/var/log/tailscale-watchdog.log` | Watchdog activity log | Yes (on /var) |
|
||||
|
||||
> **After any SteamOS upgrade**, verify these files still exist: `ls /etc/tailscale/ /etc/systemd/system/tailscale-watchdog.*`. If the overlay was reset, re-run the setup (see `docs/infrastructure/hosts/deck-runbook.md` section "Recovering after SteamOS update").
|
||||
|
||||
## Tailscale / Headscale
|
||||
|
||||
- **Control server:** `https://headscale.vish.gg:8443` (migrated off public Tailscale 2026-04-19).
|
||||
- **Preauth key:** reusable, 1-year expiry, stored at `/etc/tailscale/authkey`. Reusable so the watchdog can re-authenticate without human intervention.
|
||||
- **Node expiry:** registered nodes in Headscale do not auto-expire unless explicitly expired with `headscale nodes expire`. If you want the `0001-01-01` sentinel (node-level "never expires"), that requires Headscale DB manipulation — not currently applied.
|
||||
|
||||
### Watchdog behavior
|
||||
|
||||
`/etc/tailscale/watchdog.sh` runs every 5 minutes via the `tailscale-watchdog.timer` (`OnBootSec=2min`, `OnUnitActiveSec=5min`). Each tick:
|
||||
|
||||
1. Calls `tailscale status --json`, extracts `BackendState` via python3.
|
||||
2. If `BackendState` is `Running`, exits silently.
|
||||
3. Otherwise (`NeedsLogin`, `Stopped`, `NoState`, or daemon missing):
|
||||
- Refreshes the `/etc/hosts` pin for `headscale.vish.gg` using **DNS-over-HTTPS** (`dns.google`, fallback `1.1.1.1`). This is needed because the Deck has no `dig`/`nslookup`/`host` — only `python3` — and because the local resolver returns the *internal* LAN IP for `headscale.vish.gg` when on-LAN (split-horizon DNS), which is useless when the Deck is travelling.
|
||||
- Re-runs `tailscale up --login-server=https://headscale.vish.gg:8443 --authkey=<stored> --accept-routes=false --hostname=deck`.
|
||||
- Logs to `/var/log/tailscale-watchdog.log`.
|
||||
|
||||
### Verified failure-recovery matrix (2026-04-19)
|
||||
|
||||
| Failure | Recovery mechanism | Recovery time |
|
||||
|---------|-------------------|---------------|
|
||||
| `kill -9 tailscaled` | `Restart=on-failure` in tailscaled.service | ~3 s, PID rotated, state preserved |
|
||||
| `tailscale down` | Watchdog detects `Stopped`, runs `tailscale up` | ~1 s after next timer tick (≤5 min) |
|
||||
| `tailscale logout` | Watchdog detects `NeedsLogin`, runs `tailscale up` with stored authkey | ~4 s after next timer tick (≤5 min) |
|
||||
| Boot | tailscaled auto-starts from `/var/lib/tailscale/tailscaled.state`; watchdog fires 2 min after boot as a safety net | not yet validated |
|
||||
|
||||
### Known gap
|
||||
|
||||
If tailscaled is stopped **cleanly** (`systemctl stop tailscaled`), the current watchdog logs "tailscaled not running" and tries `tailscale up`, which fails because the daemon socket is missing. On boot this is a non-issue (systemd starts tailscaled). During runtime, this would leave the Deck disconnected. If this becomes a problem, extend the watchdog to `systemctl start tailscaled` when `pidof tailscaled` is empty.
|
||||
|
||||
## SSH
|
||||
|
||||
- **Alias on homelab-vm:** `~/.ssh/config` entry → `Host deck / HostName 192.168.0.140 / User deck / IdentityFile ~/.ssh/id_ed25519`.
|
||||
- **Installed key:** `admin@thevish.io` ed25519 pubkey in `/home/deck/.ssh/authorized_keys`.
|
||||
- **Password** (for sudo): same as initial login.
|
||||
- **MCP:** `deck` is in `scripts/homelab-mcp/server.py` `SSH_KNOWN_HOSTS`, so `ssh_exec(host="deck", …)` works from the homelab MCP.
|
||||
|
||||
## Recovering after a SteamOS update
|
||||
|
||||
If the `/etc` overlay was wiped:
|
||||
|
||||
```bash
|
||||
# 1. Re-install key
|
||||
cat ~/.ssh/id_ed25519.pub | sshpass -p '<password>' ssh -o StrictHostKeyChecking=accept-new deck@192.168.0.140 \
|
||||
'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'
|
||||
|
||||
# 2. Restore systemd override for tailscaled
|
||||
ssh deck 'echo <pw> | sudo -S mkdir -p /etc/systemd/system/tailscaled.service.d && \
|
||||
echo -e "[Service]\nExecStartPre=\nExecStartPre=/opt/tailscale/tailscaled --cleanup\nExecStart=\nExecStart=/opt/tailscale/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=\${PORT} \$FLAGS\nExecStopPost=\nExecStopPost=/opt/tailscale/tailscaled --cleanup" | sudo tee /etc/systemd/system/tailscaled.service.d/override.conf'
|
||||
|
||||
# 3. Restore /etc/hosts pin
|
||||
ssh deck 'echo <pw> | sudo -S sh -c "grep -q headscale.vish.gg /etc/hosts || echo 184.23.52.14 headscale.vish.gg >> /etc/hosts"'
|
||||
|
||||
# 4. Create fresh reusable preauth key (via MCP headscale_create_preauth_key) and store it
|
||||
AUTHKEY='hskey-auth-…' # pragma: allowlist secret (placeholder)
|
||||
ssh deck 'echo <pw> | sudo -S sh -c "mkdir -p /etc/tailscale && umask 077 && printf %s \"$AUTHKEY\" > /etc/tailscale/authkey && chmod 600 /etc/tailscale/authkey"'
|
||||
|
||||
# 5. Reinstall watchdog (copy from git or re-apply from this runbook's source repo)
|
||||
scp docs/infrastructure/hosts/deck/watchdog.sh deck:/tmp/
|
||||
ssh deck 'echo <pw> | sudo -S install -m 0755 /tmp/watchdog.sh /etc/tailscale/watchdog.sh'
|
||||
|
||||
# 6. Reinstall + enable systemd units (see files/ directory)
|
||||
scp docs/infrastructure/hosts/deck/tailscale-watchdog.{service,timer} deck:/tmp/
|
||||
ssh deck 'echo <pw> | sudo -S sh -c "install -m 0644 /tmp/tailscale-watchdog.service /etc/systemd/system/ && install -m 0644 /tmp/tailscale-watchdog.timer /etc/systemd/system/ && systemctl daemon-reload && systemctl enable --now tailscaled.service tailscale-watchdog.timer"'
|
||||
```
|
||||
|
||||
The watchdog script and systemd unit sources are checked in under `docs/infrastructure/hosts/deck/` so a recovery doesn't require reconstructing them from memory.
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Tailscale re-auth watchdog
|
||||
After=network-online.target tailscaled.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/etc/tailscale/watchdog.sh
|
||||
Nice=5
|
||||
10
docs/infrastructure/hosts/deck/tailscale-watchdog.timer
Normal file
10
docs/infrastructure/hosts/deck/tailscale-watchdog.timer
Normal file
@@ -0,0 +1,10 @@
|
||||
[Unit]
|
||||
Description=Run tailscale watchdog every 5 min
|
||||
|
||||
[Timer]
|
||||
OnBootSec=2min
|
||||
OnUnitActiveSec=5min
|
||||
Unit=tailscale-watchdog.service
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
91
docs/infrastructure/hosts/deck/watchdog.sh
Normal file
91
docs/infrastructure/hosts/deck/watchdog.sh
Normal file
@@ -0,0 +1,91 @@
|
||||
#!/usr/bin/env bash
|
||||
# Tailscale re-auth watchdog for Steam Deck.
|
||||
# Runs every 5 min via systemd timer. If tailscale is logged out / stopped,
|
||||
# refreshes the /etc/hosts pin for headscale.vish.gg (Deck's own DNS may fail
|
||||
# when off-LAN) and re-runs `tailscale up` with the stored reusable key.
|
||||
|
||||
set -u
|
||||
|
||||
LOG=/var/log/tailscale-watchdog.log
|
||||
AUTHKEY_FILE=/etc/tailscale/authkey
|
||||
HEADSCALE_HOST=headscale.vish.gg
|
||||
LOGIN_SERVER=https://${HEADSCALE_HOST}:8443
|
||||
TS=/opt/tailscale/tailscale
|
||||
|
||||
log() { printf '%s %s\n' "$(date -u +%FT%TZ)" "$*" >> "$LOG"; }
|
||||
|
||||
get_backend_state() {
|
||||
"$TS" status --json 2>/dev/null | python3 -c '
|
||||
import json, sys
|
||||
try:
|
||||
print(json.load(sys.stdin).get("BackendState", ""))
|
||||
except Exception:
|
||||
print("")
|
||||
'
|
||||
}
|
||||
|
||||
need_reauth() {
|
||||
if ! pidof tailscaled >/dev/null; then
|
||||
log "tailscaled not running"
|
||||
return 0
|
||||
fi
|
||||
local state
|
||||
state=$(get_backend_state)
|
||||
case "$state" in
|
||||
Running) return 1 ;;
|
||||
NeedsLogin|Stopped|NoState|"") log "BackendState=$state"; return 0 ;;
|
||||
*) return 1 ;;
|
||||
esac
|
||||
}
|
||||
|
||||
resolve_headscale_public() {
|
||||
# DNS-over-HTTPS via Google (then Cloudflare). Returns an A record or empty.
|
||||
python3 - "$HEADSCALE_HOST" <<'PY'
|
||||
import json, sys, urllib.request, urllib.error
|
||||
name = sys.argv[1]
|
||||
for url in (
|
||||
f"https://dns.google/resolve?name={name}&type=A",
|
||||
f"https://1.1.1.1/dns-query?name={name}&type=A",
|
||||
):
|
||||
try:
|
||||
req = urllib.request.Request(url, headers={"accept": "application/dns-json"})
|
||||
with urllib.request.urlopen(req, timeout=4) as r:
|
||||
d = json.load(r)
|
||||
for a in d.get("Answer", []):
|
||||
if a.get("type") == 1:
|
||||
print(a["data"])
|
||||
sys.exit(0)
|
||||
except Exception:
|
||||
continue
|
||||
sys.exit(1)
|
||||
PY
|
||||
}
|
||||
|
||||
refresh_hosts_pin() {
|
||||
local ip current
|
||||
ip=$(resolve_headscale_public) || true
|
||||
if [[ -z "$ip" ]]; then
|
||||
log "could not resolve $HEADSCALE_HOST via DoH"
|
||||
return
|
||||
fi
|
||||
current=$(grep -E "[[:space:]]${HEADSCALE_HOST}$" /etc/hosts | awk '{print $1}' | head -1)
|
||||
if [[ "$current" != "$ip" ]]; then
|
||||
sed -i.bak "/[[:space:]]${HEADSCALE_HOST}\$/d" /etc/hosts
|
||||
printf '%s %s\n' "$ip" "$HEADSCALE_HOST" >> /etc/hosts
|
||||
log "pinned $HEADSCALE_HOST -> $ip (was ${current:-none})"
|
||||
fi
|
||||
}
|
||||
|
||||
if need_reauth; then
|
||||
refresh_hosts_pin
|
||||
if [[ -r "$AUTHKEY_FILE" ]]; then
|
||||
AUTHKEY=$(cat "$AUTHKEY_FILE")
|
||||
if "$TS" up --login-server="$LOGIN_SERVER" --authkey="$AUTHKEY" --accept-routes=false --hostname=deck >> "$LOG" 2>&1; then
|
||||
log "tailscale up succeeded"
|
||||
else
|
||||
log "tailscale up failed (rc=$?)"
|
||||
fi
|
||||
else
|
||||
log "missing $AUTHKEY_FILE"
|
||||
fi
|
||||
fi
|
||||
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
218
docs/infrastructure/hosts/homelab-vm-runbook.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Homelab VM Runbook
|
||||
|
||||
*Proxmox VM - Monitoring & DevOps*
|
||||
|
||||
**Endpoint ID:** 443399
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** 4 vCPU, 28GB RAM
|
||||
**Access:** `192.168.0.210`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Homelab VM runs monitoring, alerting, and development services on Proxmox.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Platform | Proxmox VE |
|
||||
| vCPU | 4 cores |
|
||||
| RAM | 28GB |
|
||||
| Storage | 100GB SSD |
|
||||
| Network | 1x 1GbE |
|
||||
|
||||
## Services
|
||||
|
||||
### Monitoring Stack
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Prometheus** | 9090 | Metrics collection |
|
||||
| **Grafana** | 3000 | Dashboards |
|
||||
| **Alertmanager** | 9093 | Alert routing |
|
||||
| **Node Exporter** | 9100 | System metrics |
|
||||
| **cAdvisor** | 8080 | Container metrics |
|
||||
| **Uptime Kuma** | 3001 | Uptime monitoring |
|
||||
|
||||
### Development
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Gitea | 3000 | Git hosting |
|
||||
| Gitea Runner | 3008 | CI/CD runner |
|
||||
| OpenHands | 8000 | AI developer |
|
||||
|
||||
### Database
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| PostgreSQL | 5432 | Database |
|
||||
| Redis | 6379 | Caching |
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Monitoring
|
||||
```bash
|
||||
# Prometheus targets
|
||||
curl http://192.168.0.210:9090/api/v1/targets | jq
|
||||
|
||||
# Grafana dashboards
|
||||
open http://192.168.0.210:3000
|
||||
```
|
||||
|
||||
### Alert Status
|
||||
```bash
|
||||
# Alertmanager
|
||||
open http://192.168.0.210:9093
|
||||
|
||||
# Check ntfy for alerts
|
||||
curl -s ntfy.vish.local/homelab-alerts | head -20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Scraping Targets
|
||||
- Node exporters (all hosts)
|
||||
- cAdvisor (all hosts)
|
||||
- Prometheus self-monitoring
|
||||
- Application-specific metrics
|
||||
|
||||
### Retention
|
||||
- Time: 30 days
|
||||
- Storage: 20GB
|
||||
|
||||
### Maintenance
|
||||
```bash
|
||||
# Check TSDB size
|
||||
du -sh /var/lib/prometheus/
|
||||
|
||||
# Manual compaction
|
||||
docker exec prometheus promtool tsdb compact /prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Key Dashboards
|
||||
- Infrastructure Overview
|
||||
- Container Health
|
||||
- Network Traffic
|
||||
- Service-specific metrics
|
||||
|
||||
### Alert Rules
|
||||
- CPU > 80% for 5 minutes
|
||||
- Memory > 90% for 5 minutes
|
||||
- Disk > 85%
|
||||
- Service down > 2 minutes
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Prometheus Not Scraping
|
||||
1. Check targets: Prometheus UI → Status → Targets
|
||||
2. Verify network connectivity
|
||||
3. Check firewall rules
|
||||
4. Review scrape errors in logs
|
||||
|
||||
### Grafana Dashboards Slow
|
||||
1. Check Prometheus query performance
|
||||
2. Reduce time range
|
||||
3. Optimize queries
|
||||
4. Check resource usage
|
||||
|
||||
### Alerts Not Firing
|
||||
1. Verify Alertmanager config
|
||||
2. Check ntfy integration
|
||||
3. Review alert rules syntax
|
||||
4. Test with artificial alert
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Review alert history
|
||||
- [ ] Check disk space
|
||||
- [ ] Verify backups
|
||||
|
||||
### Monthly
|
||||
- [ ] Clean old metrics
|
||||
- [ ] Update dashboards
|
||||
- [ ] Review alert thresholds
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test alert notifications
|
||||
- [ ] Review retention policy
|
||||
- [ ] Optimize queries
|
||||
|
||||
---
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
# Grafana dashboards
|
||||
cp -r /opt/grafana/dashboards /backup/
|
||||
|
||||
# Prometheus rules
|
||||
cp -r /opt/prometheus/rules /backup/
|
||||
```
|
||||
|
||||
### Ansible
|
||||
```bash
|
||||
ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Prometheus Full
|
||||
1. Check storage: `docker system df`
|
||||
2. Reduce retention in prometheus.yml
|
||||
3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
|
||||
4. Restart container
|
||||
|
||||
### VM Down
|
||||
1. Check Proxmox: `qm list`
|
||||
2. Start VM: `qm start <vmid>`
|
||||
3. Check console: `qm terminal <vmid>`
|
||||
4. Review logs in Proxmox UI
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh homelab@192.168.0.210
|
||||
|
||||
# Restart monitoring
|
||||
cd /opt/docker/prometheus && docker-compose restart
|
||||
cd /opt/docker/grafana && docker-compose restart
|
||||
|
||||
# Check targets
|
||||
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
|
||||
|
||||
# View logs
|
||||
docker logs prometheus
|
||||
docker logs grafana
|
||||
docker logs alertmanager
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Prometheus](http://192.168.0.210:9090)
|
||||
- [Grafana](http://192.168.0.210:3000)
|
||||
- [Alertmanager](http://192.168.0.210:9093)
|
||||
- [Uptime Kuma](http://192.168.0.210:3001)
|
||||
179
docs/infrastructure/hosts/rpi5-runbook.md
Normal file
179
docs/infrastructure/hosts/rpi5-runbook.md
Normal file
@@ -0,0 +1,179 @@
|
||||
# RPi5 Runbook
|
||||
|
||||
*Raspberry Pi 5 - Edge Services*
|
||||
|
||||
**Endpoint ID:** 443395
|
||||
**Status:** 🟢 Online
|
||||
**Hardware:** ARM Cortex-A76, 16GB RAM, 512GB USB SSD
|
||||
**Access:** `rpi5-vish.local`
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Raspberry Pi 5 runs edge services including Immich backup and lightweight applications.
|
||||
|
||||
## Hardware Specs
|
||||
|
||||
| Component | Specification |
|
||||
|----------|---------------|
|
||||
| Model | Raspberry Pi 5 |
|
||||
| CPU | ARM Cortex-A76 (4-core) |
|
||||
| RAM | 16GB |
|
||||
| Storage | 512GB USB-C SSD |
|
||||
| Network | 1x 1GbE (Pi 4 adapter) |
|
||||
|
||||
## Services
|
||||
|
||||
### Primary Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| **Immich** | 2283 | Photo backup (edge) |
|
||||
| Portainer Agent | 9001 | Container management |
|
||||
| Node Exporter | 9100 | Metrics |
|
||||
|
||||
### Services (if enabled)
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| Plex | 32400 | Media server |
|
||||
| WireGuard | 51820 | VPN |
|
||||
|
||||
## Secondary Pi Nodes
|
||||
|
||||
### Pi-5-Kevin
|
||||
This is a secondary Raspberry Pi 5 node with identical specifications but not typically online.
|
||||
|
||||
- **CPU**: Broadcom BCM2712 (4-core, 2.4GHz)
|
||||
- **RAM**: 8GB LPDDR4X
|
||||
- **Storage**: 64GB microSD
|
||||
- **Network**: Gigabit Ethernet + WiFi 6
|
||||
|
||||
---
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Check Service Health
|
||||
```bash
|
||||
# Via Portainer
|
||||
open http://rpi5-vish.local:9001
|
||||
|
||||
# Via SSH
|
||||
ssh pi@rpi5-vish.local
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Immich Status
|
||||
```bash
|
||||
# Access UI
|
||||
open http://rpi5-vish.local:2283
|
||||
|
||||
# Check sync status
|
||||
docker logs immich-server | grep -i sync
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Container Won't Start (ARM compatibility)
|
||||
1. Verify image supports ARM64: `docker pull --platform linux/arm64 <image>`
|
||||
2. Check container logs
|
||||
3. Verify Raspberry Pi OS 64-bit
|
||||
|
||||
### Storage Slow
|
||||
1. Check USB drive: `lsusb`
|
||||
2. Verify SSD: `sudo hdparm -t /dev/sda`
|
||||
3. Use fast USB port (USB-C)
|
||||
|
||||
### Network Issues
|
||||
1. Check adapter compatibility
|
||||
2. Verify driver loaded: `lsmod | grep smsc95xx`
|
||||
3. Update firmware: `sudo rpi-eeprom-update`
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
### Layout
|
||||
```
|
||||
/home/pi/
|
||||
├── docker/ # Docker data
|
||||
├── immich/ # Photo storage
|
||||
└── backups/ # Local backups
|
||||
```
|
||||
|
||||
### Performance Tips
|
||||
- Use USB 3.0 SSD
|
||||
- Usequality power supply (5V 5A)
|
||||
- Enable USB max_current in config.txt
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Weekly
|
||||
- [ ] Check Docker disk usage
|
||||
- [ ] Verify Immich backup
|
||||
- [ ] Check container health
|
||||
|
||||
### Monthly
|
||||
- [ ] Update Raspberry Pi OS
|
||||
- [ ] Clean unused images
|
||||
- [ ] Review resource usage
|
||||
|
||||
### Quarterly
|
||||
- [ ] Test backup restoration
|
||||
- [ ] Verify ARM image compatibility
|
||||
- [ ] Check firmware updates
|
||||
|
||||
---
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### SD Card/Storage Failure
|
||||
1. Replace storage drive
|
||||
2. Reinstall Raspberry Pi OS
|
||||
3. Run deploy playbook:
|
||||
```bash
|
||||
ansible-playbook ansible/homelab/playbooks/deploy_rpi5_vish.yml
|
||||
```
|
||||
|
||||
### Overheating
|
||||
1. Add heatsinks
|
||||
2. Enable fan
|
||||
3. Reduce CPU frequency: `sudo echo "arm_freq=1800" >> /boot/config.txt`
|
||||
|
||||
## Notes
|
||||
|
||||
This Raspberry Pi 5 system is the primary node that runs Immich and other services, with the secondary node **pi-5-kevin** intentionally kept offline for backup purposes when needed.
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```bash
|
||||
# SSH access
|
||||
ssh pi@rpi5-vish.local
|
||||
|
||||
# Check temperature
|
||||
vcgencmd measure_temp
|
||||
|
||||
# Check throttling
|
||||
vcgencmd get_throttled
|
||||
|
||||
# Update firmware
|
||||
sudo rpi-eeprom-update
|
||||
sudo rpi-eeprom-update -a
|
||||
|
||||
# View Immich logs
|
||||
docker logs -f immich-server
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- [Immich](http://rpi5-vish.local:2283)
|
||||
- [Portainer](http://rpi5-vish.local:9001)
|
||||
66
docs/infrastructure/hosts/runbooks.md
Normal file
66
docs/infrastructure/hosts/runbooks.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# Host Runbooks
|
||||
|
||||
This directory contains operational runbooks for each host in the homelab infrastructure.
|
||||
|
||||
## Available Runbooks
|
||||
|
||||
- [Atlantis Runbook](./atlantis-runbook.md) - Synology DS1821+ (Primary NAS)
|
||||
- [Calypso Runbook](./calypso-runbook.md) - Synology DS723+ (Secondary NAS)
|
||||
- [Concord NUC Runbook](./concord-nuc-runbook.md) - Intel NUC (Home Automation & DNS)
|
||||
- [Homelab VM Runbook](./homelab-vm-runbook.md) - Proxmox VM (Monitoring & DevOps)
|
||||
- [RPi5 Runbook](./rpi5-runbook.md) - Raspberry Pi 5 (Edge Services)
|
||||
|
||||
---
|
||||
|
||||
## Common Tasks
|
||||
|
||||
All hosts share common operational procedures:
|
||||
|
||||
### Viewing Logs
|
||||
```bash
|
||||
# Via SSH to host
|
||||
docker logs <container_name>
|
||||
|
||||
# Via Portainer
|
||||
Portainer → Containers → <container> → Logs
|
||||
```
|
||||
|
||||
### Restarting Services
|
||||
```bash
|
||||
# Via docker-compose
|
||||
cd hosts/<host>/<service>
|
||||
docker-compose restart <service>
|
||||
|
||||
# Via Portainer
|
||||
Portainer → Stacks → <stack> → Restart
|
||||
```
|
||||
|
||||
### Checking Resource Usage
|
||||
```bash
|
||||
# Via Portainer
|
||||
Portainer → Containers → Sort by CPU/Memory
|
||||
|
||||
# Via CLI
|
||||
docker stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
| Role | Contact | When to Contact |
|
||||
|------|---------|------------------|
|
||||
| Primary Admin | User | All critical issues |
|
||||
| Emergency | NTFY | Critical alerts only |
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Host | Primary Role | Critical Services | SSH Access |
|
||||
|------|--------------|-------------------|------------|
|
||||
| Atlantis | Media, Vault | Vaultwarden, Plex, Immich | atlantis.local |
|
||||
| Calypso | Infrastructure | NPM, Authentik, Prometheus | calypso.local |
|
||||
| Concord NUC | DNS, HA | AdGuard, Home Assistant | concord-nuc.local |
|
||||
| Homelab VM | Monitoring | Prometheus, Grafana | 192.168.0.210 |
|
||||
| RPi5 | Edge | Immich (backup) | rpi5-vish.local |
|
||||
Reference in New Issue
Block a user