Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC

2026-04-20 01:32:01 +00:00
commit e7652c8dab
1445 changed files with 364095 additions and 0 deletions
--- a/docs/infrastructure/hosts/atlantis-runbook.md
+++ b/docs/infrastructure/hosts/atlantis-runbook.md
@@ -0,0 +1,228 @@
+# Atlantis Runbook
+
+*Synology DS1821+ - Primary NAS and Media Server*
+
+**Endpoint ID:** 2  
+**Status:** 🟢 Online  
+**Hardware:** AMD Ryzen V1500B, 32GB RAM, 8 bays  
+**Access:** `atlantis.vish.local`
+
+---
+
+## Overview
+
+Atlantis is the primary Synology NAS serving as the homelab's central storage and media infrastructure.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Model | Synology DS1821+ |
+| CPU | AMD Ryzen V1500B (4-core) |
+| RAM | 32GB |
+| Storage | 8-bay RAID6 + SSD cache |
+| Network | 4x 1GbE (Link aggregated) |
+
+## Services
+
+### Critical Services
+
+| Service | Port | Purpose | Docker Image |
+|---------|------|---------|--------------|
+| **Vaultwarden** | 8080 | Password manager | vaultwarden/server |
+| **Immich** | 2283 | Photo backup | immich-app/immich |
+| **Plex** | 32400 | Media server | plexinc/pms-docker |
+| **Ollama** | 11434 | AI/ML | ollama/ollama |
+
+### Media Stack
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| arr-suite | Various | Sonarr, Radarr, Lidarr, Prowlarr |
+| qBittorrent | 8080 | Download client |
+| Jellyseerr | 5055 | Media requests |
+
+### Infrastructure
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Portainer | 9000 | Container management |
+| Watchtower | 9001 | Auto-updates |
+| Dozzle | 8081 | Log viewer |
+| Nginx Proxy Manager | 81/444 | Legacy proxy |
+
+### Additional Services
+
+- Jitsi (Video conferencing)
+- Matrix/Synapse (Chat)
+- Mastodon (Social)
+- Paperless-NGX (Documents)
+- Syncthing (File sync)
+- Grafana + Prometheus (Monitoring)
+
+---
+
+## Storage Layout
+
+```
+/volume1/
+├── docker/          # Docker volumes
+├── docker/compose/ # Service configurations
+├── media/          # Media files
+│   ├── movies/
+│   ├── tv/
+│   ├── music/
+│   └── books/
+├── photos/         # Immich storage
+├── backups/        # Backup destination
+└── shared/         # Shared folders
+```
+
+---
+
+## Daily Operations
+
+### Check Service Health
+```bash
+# Via Portainer
+open http://atlantis.vish.local:9000
+
+# Via SSH
+ssh admin@atlantis.vish.local
+docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+```
+
+### Check Disk Usage
+```bash
+# SSH to Atlantis
+ssh admin@atlantis.vish.local
+
+# Synology storage manager
+sudo syno-storage-usage -a
+
+# Or via Docker
+docker system df
+```
+
+### View Logs
+```bash
+# Specific service
+docker logs vaultwarden
+
+# Follow logs
+docker logs -f vaultwarden
+```
+
+---
+
+## Common Issues
+
+### Service Won't Start
+1. Check if port is already in use: `netstat -tulpn | grep <port>`
+2. Check logs: `docker logs <container>`
+3. Verify volume paths exist
+4. Restart Docker: `sudo systemctl restart docker`
+
+### Storage Full
+1. Identify large files: `docker system df -v`
+2. Clean Docker: `docker system prune -a`
+3. Check Synology Storage Analyzer
+4. Archive old media files
+
+### Performance Issues
+1. Check resource usage: `docker stats`
+2. Review Plex transcode logs
+3. Check RAID health: `sudo mdadm --detail /dev/md0`
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Verify backup completion
+- [ ] Check disk health (S.M.A.R.T.)
+- [ ] Review Watchtower updates
+- [ ] Check Plex library integrity
+
+### Monthly
+- [ ] Run Docker cleanup
+- [ ] Update Docker Compose files
+- [ ] Review storage usage trends
+- [ ] Check security updates
+
+### Quarterly
+- [ ] Deep clean unused images/containers
+- [ ] Review service dependencies
+- [ ] Test disaster recovery
+- [ ] Update documentation
+
+---
+
+## Backup Procedures
+
+### Configuration Backup
+```bash
+# Via Ansible
+ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags atlantis
+```
+
+### Data Backup
+- Synology Hyper Backup to external drive
+- Cloud sync to Backblaze B2
+- Critical configs to Git repository
+
+### Verification
+```bash
+ansible-playbook ansible/automation/playbooks/backup_verification.yml
+```
+
+---
+
+## Emergency Procedures
+
+### Complete Outage
+1. Verify Synology is powered on
+2. Check network connectivity
+3. Access via DSM: `https://atlantis.vish.local:5001`
+4. Check Storage Manager for RAID status
+5. Contact via serial if no network
+
+### RAID Degraded
+1. Identify failed drive via Storage Manager
+2. Power down and replace drive
+3. Rebuild will start automatically
+4. Monitor rebuild progress
+
+### Data Recovery
+See [Disaster Recovery Guide](../troubleshooting/disaster-recovery.md)
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh admin@atlantis.vish.local
+
+# Container management
+cd /volume1/docker/compose/<service>
+docker-compose restart <service>
+
+# View all containers
+docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+
+# Logs for critical services
+docker logs vaultwarden
+docker logs plex
+docker logs immich
+```
+
+---
+
+## Links
+
+- [Synology DSM](https://atlantis.vish.local:5001)
+- [Portainer](http://atlantis.vish.local:9000)
+- [Vaultwarden](http://atlantis.vish.local:8080)
+- [Plex](http://atlantis.vish.local:32400)
+- [Immich](http://atlantis.vish.local:2283)
--- a/docs/infrastructure/hosts/calypso-runbook.md
+++ b/docs/infrastructure/hosts/calypso-runbook.md
@@ -0,0 +1,237 @@
+# Calypso Runbook
+
+*Synology DS723+ - Secondary NAS and Infrastructure*
+
+**Endpoint ID:** 443397  
+**Status:** 🟢 Online  
+**Hardware:** AMD Ryzen R1600, 32GB RAM, 2 bays + expansion  
+**Access:** `calypso.vish.local`
+
+---
+
+## Overview
+
+Calypso is the secondary Synology NAS handling critical infrastructure services including authentication, reverse proxy, and monitoring.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Model | Synology DS723+ |
+| CPU | AMD Ryzen R1600 (2-core/4-thread) |
+| RAM | 32GB |
+| Storage | 2-bay SHR + eSATA expansion |
+| Network | 2x 1GbE |
+
+## Services
+
+### Critical Infrastructure
+
+| Service | Port | Purpose | Status |
+|---------|------|---------|--------|
+| **Nginx Proxy Manager** | 80/443 | SSL termination & routing | Required |
+| **Authentik** | 9000 | SSO authentication | Required |
+| **Prometheus** | 9090 | Metrics collection | Required |
+| **Grafana** | 3000 | Dashboards | Required |
+| **Alertmanager** | 9093 | Alert routing | Required |
+
+### Additional Services
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| AdGuard | 3053 | DNS filtering (backup) |
+| Paperless-NGX | 8000 | Document management |
+| Reactive Resume | 3001 | Resume builder |
+| Gitea | 3000/22 | Git hosting |
+| Gitea Runner | 3008 | CI/CD |
+| Headscale | 8080 | WireGuard VPN controller |
+| Seafile | 8082 | File sync & share |
+| Syncthing | 8384 | File sync |
+| WireGuard | 51820 | VPN server |
+| Portainer Agent | 9001 | Container management |
+
+### Media (ARR Stack)
+
+- Sonarr, Radarr, Lidarr
+- Prowlarr (indexers)
+- Bazarr (subtitles)
+
+---
+
+## Storage Layout
+
+```
+/volume1/
+├── docker/
+├── docker/compose/
+├── appdata/           # Application data
+│   ├── authentik/
+│   ├── npm/
+│   ├── prometheus/
+│   └── grafana/
+├── documents/        # Paperless
+├── seafile/          # Seafile data
+└── backups/          # Backup destination
+```
+
+---
+
+## Daily Operations
+
+### Check Service Health
+```bash
+# Via Portainer
+open http://calypso.vish.local:9001
+
+# Via SSH
+ssh admin@calypso.vish.local
+docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
+```
+
+### Monitor Critical Services
+```bash
+# Check NPM
+curl -I http://localhost:80
+
+# Check Authentik
+curl -I http://localhost:9000
+
+# Check Prometheus
+curl -I http://localhost:9090
+```
+
+---
+
+## Common Issues
+
+### NPM Not Routing
+1. Check if NPM is running: `docker ps | grep npm`
+2. Verify proxy hosts configured: Access NPM UI → Proxy Hosts
+3. Check SSL certificates
+4. Review NPM logs: `docker logs nginx-proxy-manager`
+
+### Authentik SSO Broken
+1. Check Authentik running: `docker ps | grep authentik`
+2. Verify PostgreSQL: `docker logs authentik-postgresql`
+3. Check Redis: `docker logs authentik-redis`
+4. Review OIDC configurations in services
+
+### Prometheus Down
+1. Check storage: `docker system df`
+2. Verify volume: `docker volume ls | grep prometheus`
+3. Check retention settings
+4. Review logs: `docker logs prometheus`
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Verify Authentik users can login
+- [ ] Check Prometheus metrics collection
+- [ ] Review Alertmanager notifications
+- [ ] Verify NPM certificates
+
+### Monthly
+- [ ] Clean unused Docker images
+- [ ] Review Prometheus retention
+- [ ] Update applications
+- [ ] Check disk usage
+
+### Quarterly
+- [ ] Test OAuth flows
+- [ ] Verify backup restoration
+- [ ] Review monitoring thresholds
+- [ ] Update SSL certificates
+
+---
+
+## SSL Certificate Management
+
+NPM handles all SSL certificates:
+
+1. **Automatic Renewal**: Let's Encrypt (default)
+2. **Manual**: Access NPM → SSL Certificates → Add
+3. **Check Status**: NPM Dashboard → SSL
+
+### Common Certificate Issues
+- Rate limits: Wait 1 hour between requests
+- DNS challenge: Verify external DNS
+- Self-signed: Use for internal services
+
+---
+
+## Backup Procedures
+
+### Configuration Backup
+```bash
+# Via Ansible
+ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags calypso
+```
+
+### Key Data to Backup
+- NPM configurations: `/volume1/docker/compose/nginx_proxy_manager/`
+- Authentik: `/volume1/docker/appdata/authentik/`
+- Prometheus: `/volume1/docker/appdata/prometheus/`
+- Grafana: `/volume1/docker/appdata/grafana/`
+
+---
+
+## Emergency Procedures
+
+### Authentik Down
+**Impact**: SSO broken for all services
+
+1. Verify containers running
+2. Check PostgreSQL: `docker logs authentik-postgresql`
+3. Check Redis: `docker logs authentik-redis`
+4. Restart Authentik: `docker-compose restart`
+5. If needed, restore from backup
+
+### NPM Down
+**Impact**: No external access
+
+1. Verify container: `docker ps | grep npm`
+2. Check ports 80/443: `netstat -tulpn | grep -E '80|443'`
+3. Restart: `docker-compose restart`
+4. Check DNS resolution
+
+### Prometheus Full
+**Impact**: No metrics
+
+1. Check storage: `docker system df`
+2. Reduce retention: Edit prometheus.yml
+3. Clean old data: `docker exec prometheus promtool tsdb delete-insufficient`
+4. Restart container
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh admin@calypso.vish.local
+
+# Check critical services
+docker ps --filter "name=nginx" --filter "name=authentik" --filter "name=prometheus"
+
+# Restart infrastructure
+cd /volume1/docker/compose/nginx_proxy_manager && docker-compose restart
+cd /volume1/docker/compose/authentik && docker-compose restart
+
+# View logs
+docker logs -f nginx-proxy-manager
+docker logs -f authentik-server
+docker logs -f prometheus
+```
+
+---
+
+## Links
+
+- [Synology DSM](https://calypso.vish.local:5001)
+- [Nginx Proxy Manager](http://calypso.vish.local:81)
+- [Authentik](http://calypso.vish.local:9000)
+- [Prometheus](http://calypso.vish.local:9090)
+- [Grafana](http://calypso.vish.local:3000)
+- [Alertmanager](http://calypso.vish.local:9093)
--- a/docs/infrastructure/hosts/concord-nuc-runbook.md
+++ b/docs/infrastructure/hosts/concord-nuc-runbook.md
@@ -0,0 +1,244 @@
+# Concord NUC Runbook
+
+*Intel NUC6i3SYB - Home Automation & DNS*
+
+**Endpoint ID:** 443398  
+**Status:** 🟢 Online  
+**Hardware:** Intel Core i3-6100U, 16GB RAM, 256GB SSD  
+**Access:** `concordnuc.vish.local`
+
+---
+
+## Overview
+
+Concord NUC runs lightweight services focused on home automation, DNS filtering, and local network services.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Model | Intel NUC6i3SYB |
+| CPU | Intel Core i3-6100U (2-core) |
+| RAM | 16GB |
+| Storage | 256GB SSD |
+| Network | 1x 1GbE |
+
+## Services
+
+### Critical Services
+
+| Service | Port | Purpose | Docker Image |
+|---------|------|---------|---------------|
+| **AdGuard Home** | 3053/53 | DNS filtering | adguard/adguardhome |
+| **Home Assistant** | 8123 | Home automation | homeassistant/home-assistant |
+| **Matter Server** | 5580 | Matter protocol | matter-server/matter-server |
+
+### Additional Services
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Plex | 32400 | Media server |
+| Invidious | 2999 | YouTube frontend |
+| Piped | 1234 | YouTube music |
+| Syncthing | 8384 | File sync |
+| WireGuard | 51820 | VPN server |
+| Portainer Agent | 9001 | Container management |
+| Node Exporter | 9100 | Metrics |
+
+---
+
+## Network Position
+
+```
+Internet
+    │
+    ▼
+[Home Router] ──WAN──► (Public IP)
+    │
+    ├─► [Pi-hole Primary]
+    │
+    └─► [AdGuard Home] ──► Local DNS
+              │
+              ▼
+        [Home Assistant] ──► Zigbee/Z-Wave
+```
+
+---
+
+## Daily Operations
+
+### Check Service Health
+```bash
+# Via Portainer
+open http://concordnuc.vish.local:9001
+
+# Via SSH
+ssh homelab@concordnuc.vish.local
+docker ps
+```
+
+### Home Assistant
+```bash
+# Access UI
+open http://concordnuc.vish.local:8123
+
+# Check logs
+docker logs homeassistant
+```
+
+### AdGuard Home
+```bash
+# Access UI
+open http://concordnuc.vish.local:3053
+
+# Check DNS filtering
+# Admin → Dashboard → DNS Queries
+```
+
+---
+
+## Common Issues
+
+### Home Assistant Won't Start
+1. Check logs: `docker logs homeassistant`
+2. Verify config: `config/configuration.yaml`
+3. Check Zigbee/Z-Wave stick
+4. Restore from backup if needed
+
+### AdGuard Not Filtering
+1. Check service: `docker ps | grep adguard`
+2. Verify DNS settings on router
+3. Check filter lists: Admin → Filters
+4. Review query log
+
+### No Network Connectivity
+1. Check Docker: `systemctl status docker`
+2. Verify network: `ip addr`
+3. Check firewall: `sudo ufw status`
+
+---
+
+## Home Assistant Configuration
+
+### Add-ons Running
+- Zigbee2MQTT
+- Z-Wave JS UI
+- File editor
+- Terminal
+
+### Backup
+```bash
+# Manual backup via UI
+Configuration → Backups → Create backup
+
+# Automated to Synology
+Syncthing → Backups/homeassistant/
+```
+
+### Restoration
+1. Access HA in safe mode
+2. Configuration → Backups
+3. Select backup → Restore
+
+---
+
+## AdGuard Home Configuration
+
+### DNS Providers
+- Cloudflare: 1.1.1.1
+- Google: 8.8.8.8
+
+### Blocklists Enabled
+- AdGuard Default
+- AdAway
+- Malware domains
+
+### Query Log
+Access: Admin → Logs
+- Useful for debugging DNS issues
+- Check for blocked domains
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Check HA logs for errors
+- [ ] Review AdGuard query log
+- [ ] Verify backups completed
+
+### Monthly
+- [ ] Update Home Assistant
+- [ ] Review AdGuard filters
+- [ ] Clean unused Docker images
+
+### Quarterly
+- [ ] Test automation reliability
+- [ ] Review device states
+- [ ] Check Zigbee network health
+
+---
+
+## Emergency Procedures
+
+### Home Assistant Down
+**Impact**: Smart home controls unavailable
+
+1. Check container: `docker ps | grep homeassistant`
+2. Restart: `docker-compose restart`
+3. Check logs: `docker logs homeassistant`
+4. If corrupted, restore from backup
+
+### AdGuard Down
+**Impact**: DNS issues on network
+
+1. Verify: `dig google.com @localhost`
+2. Restart: `docker-compose restart`
+3. Check config in UI
+4. Fallback to Pi-hole
+
+### Complete Hardware Failure
+1. Replace NUC hardware
+2. Reinstall Ubuntu/Debian
+3. Run deploy playbook:
+   ```bash
+   ansible-playbook ansible/homelab/playbooks/deploy_concord_nuc.yml
+   ```
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh homelab@concordnuc.vish.local
+
+# Restart services
+docker-compose -f /opt/docker/compose/homeassistant.yaml restart
+docker-compose -f /opt/docker/compose/adguard.yaml restart
+
+# View logs
+docker logs -f homeassistant
+docker logs -f adguard
+
+# Check resource usage
+docker stats
+```
+
+---
+
+## Device Access
+
+| Device | Protocol | Address |
+|--------|----------|---------|
+| Zigbee Coordinator | USB | /dev/serial/by-id/* |
+| Z-Wave Controller | USB | /dev/serial/by-id/* |
+
+---
+
+## Links
+
+- [Home Assistant](http://concordnuc.vish.local:8123)
+- [AdGuard Home](http://concordnuc.vish.local:3053)
+- [Plex](http://concordnuc.vish.local:32400)
+- [Invidious](http://concordnuc.vish.local:2999)
--- a/docs/infrastructure/hosts/deck-runbook.md
+++ b/docs/infrastructure/hosts/deck-runbook.md
@@ -0,0 +1,97 @@
+# Steam Deck Runbook
+
+*SteamOS handheld — tailnet node with self-healing watchdog.*
+
+**Headscale ID:** 29
+**Status:** 🟢 Online
+**Hardware:** Steam Deck (AMD APU, SteamOS Holo, btrfs root)
+**Access:** `ssh deck` (key-based, via `~/.ssh/config` alias → `192.168.0.140`)
+**Tailnet IP:** `100.64.0.11` (MagicDNS `deck.tail.vish.gg`)
+
+---
+
+## Overview
+
+The Steam Deck participates in the homelab tailnet for SSH/remote access. Because SteamOS ships with an immutable `/usr` and a read-only `/usr/local`, all custom state lives in `/etc` (writable via overlay) and `/opt` (writable, outside Valve's managed tree).
+
+## Filesystem layout
+
+| Path | Purpose | Survives SteamOS update? |
+|------|---------|--------------------------|
+| `/opt/tailscale/tailscale`, `/opt/tailscale/tailscaled` | Tailscale binaries (standard Steam Deck install location) | Usually yes |
+| `/etc/systemd/system/tailscaled.service` + `tailscaled.service.d/override.conf` | systemd unit + override pointing ExecStart at `/opt/tailscale/tailscaled` | Overlay — may be wiped on major updates |
+| `/etc/tailscale/authkey` (0600 root) | Reusable Headscale preauth key used by the watchdog for re-auth | Overlay — may be wiped |
+| `/etc/tailscale/watchdog.sh` (0755 root) | Re-auth watchdog (bash + python3) | Overlay — may be wiped |
+| `/etc/systemd/system/tailscale-watchdog.{service,timer}` | Watchdog systemd units | Overlay — may be wiped |
+| `/etc/hosts` | Contains a pin `<public-ip> headscale.vish.gg` maintained by the watchdog | Overlay — may be wiped |
+| `/var/log/tailscale-watchdog.log` | Watchdog activity log | Yes (on /var) |
+
+> **After any SteamOS upgrade**, verify these files still exist: `ls /etc/tailscale/ /etc/systemd/system/tailscale-watchdog.*`. If the overlay was reset, re-run the setup (see `docs/infrastructure/hosts/deck-runbook.md` section "Recovering after SteamOS update").
+
+## Tailscale / Headscale
+
+- **Control server:** `https://headscale.vish.gg:8443` (migrated off public Tailscale 2026-04-19).
+- **Preauth key:** reusable, 1-year expiry, stored at `/etc/tailscale/authkey`. Reusable so the watchdog can re-authenticate without human intervention.
+- **Node expiry:** registered nodes in Headscale do not auto-expire unless explicitly expired with `headscale nodes expire`. If you want the `0001-01-01` sentinel (node-level "never expires"), that requires Headscale DB manipulation — not currently applied.
+
+### Watchdog behavior
+
+`/etc/tailscale/watchdog.sh` runs every 5 minutes via the `tailscale-watchdog.timer` (`OnBootSec=2min`, `OnUnitActiveSec=5min`). Each tick:
+
+1. Calls `tailscale status --json`, extracts `BackendState` via python3.
+2. If `BackendState` is `Running`, exits silently.
+3. Otherwise (`NeedsLogin`, `Stopped`, `NoState`, or daemon missing):
+   - Refreshes the `/etc/hosts` pin for `headscale.vish.gg` using **DNS-over-HTTPS** (`dns.google`, fallback `1.1.1.1`). This is needed because the Deck has no `dig`/`nslookup`/`host` — only `python3` — and because the local resolver returns the *internal* LAN IP for `headscale.vish.gg` when on-LAN (split-horizon DNS), which is useless when the Deck is travelling.
+   - Re-runs `tailscale up --login-server=https://headscale.vish.gg:8443 --authkey=<stored> --accept-routes=false --hostname=deck`.
+   - Logs to `/var/log/tailscale-watchdog.log`.
+
+### Verified failure-recovery matrix (2026-04-19)
+
+| Failure | Recovery mechanism | Recovery time |
+|---------|-------------------|---------------|
+| `kill -9 tailscaled` | `Restart=on-failure` in tailscaled.service | ~3 s, PID rotated, state preserved |
+| `tailscale down` | Watchdog detects `Stopped`, runs `tailscale up` | ~1 s after next timer tick (≤5 min) |
+| `tailscale logout` | Watchdog detects `NeedsLogin`, runs `tailscale up` with stored authkey | ~4 s after next timer tick (≤5 min) |
+| Boot | tailscaled auto-starts from `/var/lib/tailscale/tailscaled.state`; watchdog fires 2 min after boot as a safety net | not yet validated |
+
+### Known gap
+
+If tailscaled is stopped **cleanly** (`systemctl stop tailscaled`), the current watchdog logs "tailscaled not running" and tries `tailscale up`, which fails because the daemon socket is missing. On boot this is a non-issue (systemd starts tailscaled). During runtime, this would leave the Deck disconnected. If this becomes a problem, extend the watchdog to `systemctl start tailscaled` when `pidof tailscaled` is empty.
+
+## SSH
+
+- **Alias on homelab-vm:** `~/.ssh/config` entry → `Host deck / HostName 192.168.0.140 / User deck / IdentityFile ~/.ssh/id_ed25519`.
+- **Installed key:** `admin@thevish.io` ed25519 pubkey in `/home/deck/.ssh/authorized_keys`.
+- **Password** (for sudo): same as initial login.
+- **MCP:** `deck` is in `scripts/homelab-mcp/server.py` `SSH_KNOWN_HOSTS`, so `ssh_exec(host="deck", …)` works from the homelab MCP.
+
+## Recovering after a SteamOS update
+
+If the `/etc` overlay was wiped:
+
+```bash
+# 1. Re-install key
+cat ~/.ssh/id_ed25519.pub | sshpass -p '<password>' ssh -o StrictHostKeyChecking=accept-new deck@192.168.0.140 \
+  'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'
+
+# 2. Restore systemd override for tailscaled
+ssh deck 'echo <pw> | sudo -S mkdir -p /etc/systemd/system/tailscaled.service.d && \
+  echo -e "[Service]\nExecStartPre=\nExecStartPre=/opt/tailscale/tailscaled --cleanup\nExecStart=\nExecStart=/opt/tailscale/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=\${PORT} \$FLAGS\nExecStopPost=\nExecStopPost=/opt/tailscale/tailscaled --cleanup" | sudo tee /etc/systemd/system/tailscaled.service.d/override.conf'
+
+# 3. Restore /etc/hosts pin
+ssh deck 'echo <pw> | sudo -S sh -c "grep -q headscale.vish.gg /etc/hosts || echo 184.23.52.14 headscale.vish.gg >> /etc/hosts"'
+
+# 4. Create fresh reusable preauth key (via MCP headscale_create_preauth_key) and store it
+AUTHKEY='hskey-auth-…'  # pragma: allowlist secret (placeholder)
+ssh deck 'echo <pw> | sudo -S sh -c "mkdir -p /etc/tailscale && umask 077 && printf %s \"$AUTHKEY\" > /etc/tailscale/authkey && chmod 600 /etc/tailscale/authkey"'
+
+# 5. Reinstall watchdog (copy from git or re-apply from this runbook's source repo)
+scp docs/infrastructure/hosts/deck/watchdog.sh deck:/tmp/
+ssh deck 'echo <pw> | sudo -S install -m 0755 /tmp/watchdog.sh /etc/tailscale/watchdog.sh'
+
+# 6. Reinstall + enable systemd units (see files/ directory)
+scp docs/infrastructure/hosts/deck/tailscale-watchdog.{service,timer} deck:/tmp/
+ssh deck 'echo <pw> | sudo -S sh -c "install -m 0644 /tmp/tailscale-watchdog.service /etc/systemd/system/ && install -m 0644 /tmp/tailscale-watchdog.timer /etc/systemd/system/ && systemctl daemon-reload && systemctl enable --now tailscaled.service tailscale-watchdog.timer"'
+```
+
+The watchdog script and systemd unit sources are checked in under `docs/infrastructure/hosts/deck/` so a recovery doesn't require reconstructing them from memory.
--- a/docs/infrastructure/hosts/deck/tailscale-watchdog.service
+++ b/docs/infrastructure/hosts/deck/tailscale-watchdog.service
@@ -0,0 +1,9 @@
+[Unit]
+Description=Tailscale re-auth watchdog
+After=network-online.target tailscaled.service
+Wants=network-online.target
+
+[Service]
+Type=oneshot
+ExecStart=/etc/tailscale/watchdog.sh
+Nice=5
--- a/docs/infrastructure/hosts/deck/tailscale-watchdog.timer
+++ b/docs/infrastructure/hosts/deck/tailscale-watchdog.timer
@@ -0,0 +1,10 @@
+[Unit]
+Description=Run tailscale watchdog every 5 min
+
+[Timer]
+OnBootSec=2min
+OnUnitActiveSec=5min
+Unit=tailscale-watchdog.service
+
+[Install]
+WantedBy=timers.target
--- a/docs/infrastructure/hosts/deck/watchdog.sh
+++ b/docs/infrastructure/hosts/deck/watchdog.sh
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+# Tailscale re-auth watchdog for Steam Deck.
+# Runs every 5 min via systemd timer. If tailscale is logged out / stopped,
+# refreshes the /etc/hosts pin for headscale.vish.gg (Deck's own DNS may fail
+# when off-LAN) and re-runs `tailscale up` with the stored reusable key.
+
+set -u
+
+LOG=/var/log/tailscale-watchdog.log
+AUTHKEY_FILE=/etc/tailscale/authkey
+HEADSCALE_HOST=headscale.vish.gg
+LOGIN_SERVER=https://${HEADSCALE_HOST}:8443
+TS=/opt/tailscale/tailscale
+
+log() { printf '%s %s\n' "$(date -u +%FT%TZ)" "$*" >> "$LOG"; }
+
+get_backend_state() {
+  "$TS" status --json 2>/dev/null | python3 -c '
+import json, sys
+try:
+    print(json.load(sys.stdin).get("BackendState", ""))
+except Exception:
+    print("")
+'
+}
+
+need_reauth() {
+  if ! pidof tailscaled >/dev/null; then
+    log "tailscaled not running"
+    return 0
+  fi
+  local state
+  state=$(get_backend_state)
+  case "$state" in
+    Running) return 1 ;;
+    NeedsLogin|Stopped|NoState|"") log "BackendState=$state"; return 0 ;;
+    *) return 1 ;;
+  esac
+}
+
+resolve_headscale_public() {
+  # DNS-over-HTTPS via Google (then Cloudflare). Returns an A record or empty.
+  python3 - "$HEADSCALE_HOST" <<'PY'
+import json, sys, urllib.request, urllib.error
+name = sys.argv[1]
+for url in (
+    f"https://dns.google/resolve?name={name}&type=A",
+    f"https://1.1.1.1/dns-query?name={name}&type=A",
+):
+    try:
+        req = urllib.request.Request(url, headers={"accept": "application/dns-json"})
+        with urllib.request.urlopen(req, timeout=4) as r:
+            d = json.load(r)
+        for a in d.get("Answer", []):
+            if a.get("type") == 1:
+                print(a["data"])
+                sys.exit(0)
+    except Exception:
+        continue
+sys.exit(1)
+PY
+}
+
+refresh_hosts_pin() {
+  local ip current
+  ip=$(resolve_headscale_public) || true
+  if [[ -z "$ip" ]]; then
+    log "could not resolve $HEADSCALE_HOST via DoH"
+    return
+  fi
+  current=$(grep -E "[[:space:]]${HEADSCALE_HOST}$" /etc/hosts | awk '{print $1}' | head -1)
+  if [[ "$current" != "$ip" ]]; then
+    sed -i.bak "/[[:space:]]${HEADSCALE_HOST}\$/d" /etc/hosts
+    printf '%s %s\n' "$ip" "$HEADSCALE_HOST" >> /etc/hosts
+    log "pinned $HEADSCALE_HOST -> $ip (was ${current:-none})"
+  fi
+}
+
+if need_reauth; then
+  refresh_hosts_pin
+  if [[ -r "$AUTHKEY_FILE" ]]; then
+    AUTHKEY=$(cat "$AUTHKEY_FILE")
+    if "$TS" up --login-server="$LOGIN_SERVER" --authkey="$AUTHKEY" --accept-routes=false --hostname=deck >> "$LOG" 2>&1; then
+      log "tailscale up succeeded"
+    else
+      log "tailscale up failed (rc=$?)"
+    fi
+  else
+    log "missing $AUTHKEY_FILE"
+  fi
+fi
--- a/docs/infrastructure/hosts/homelab-vm-runbook.md
+++ b/docs/infrastructure/hosts/homelab-vm-runbook.md
@@ -0,0 +1,218 @@
+# Homelab VM Runbook
+
+*Proxmox VM - Monitoring & DevOps*
+
+**Endpoint ID:** 443399  
+**Status:** 🟢 Online  
+**Hardware:** 4 vCPU, 28GB RAM  
+**Access:** `192.168.0.210`
+
+---
+
+## Overview
+
+Homelab VM runs monitoring, alerting, and development services on Proxmox.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Platform | Proxmox VE |
+| vCPU | 4 cores |
+| RAM | 28GB |
+| Storage | 100GB SSD |
+| Network | 1x 1GbE |
+
+## Services
+
+### Monitoring Stack
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| **Prometheus** | 9090 | Metrics collection |
+| **Grafana** | 3000 | Dashboards |
+| **Alertmanager** | 9093 | Alert routing |
+| **Node Exporter** | 9100 | System metrics |
+| **cAdvisor** | 8080 | Container metrics |
+| **Uptime Kuma** | 3001 | Uptime monitoring |
+
+### Development
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Gitea | 3000 | Git hosting |
+| Gitea Runner | 3008 | CI/CD runner |
+| OpenHands | 8000 | AI developer |
+
+### Database
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| PostgreSQL | 5432 | Database |
+| Redis | 6379 | Caching |
+
+---
+
+## Daily Operations
+
+### Check Monitoring
+```bash
+# Prometheus targets
+curl http://192.168.0.210:9090/api/v1/targets | jq
+
+# Grafana dashboards
+open http://192.168.0.210:3000
+```
+
+### Alert Status
+```bash
+# Alertmanager
+open http://192.168.0.210:9093
+
+# Check ntfy for alerts
+curl -s ntfy.vish.local/homelab-alerts | head -20
+```
+
+---
+
+## Prometheus Configuration
+
+### Scraping Targets
+- Node exporters (all hosts)
+- cAdvisor (all hosts)
+- Prometheus self-monitoring
+- Application-specific metrics
+
+### Retention
+- Time: 30 days
+- Storage: 20GB
+
+### Maintenance
+```bash
+# Check TSDB size
+du -sh /var/lib/prometheus/
+
+# Manual compaction
+docker exec prometheus promtool tsdb compact /prometheus
+```
+
+---
+
+## Grafana Dashboards
+
+### Key Dashboards
+- Infrastructure Overview
+- Container Health
+- Network Traffic
+- Service-specific metrics
+
+### Alert Rules
+- CPU > 80% for 5 minutes
+- Memory > 90% for 5 minutes
+- Disk > 85%
+- Service down > 2 minutes
+
+---
+
+## Common Issues
+
+### Prometheus Not Scraping
+1. Check targets: Prometheus UI → Status → Targets
+2. Verify network connectivity
+3. Check firewall rules
+4. Review scrape errors in logs
+
+### Grafana Dashboards Slow
+1. Check Prometheus query performance
+2. Reduce time range
+3. Optimize queries
+4. Check resource usage
+
+### Alerts Not Firing
+1. Verify Alertmanager config
+2. Check ntfy integration
+3. Review alert rules syntax
+4. Test with artificial alert
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Review alert history
+- [ ] Check disk space
+- [ ] Verify backups
+
+### Monthly
+- [ ] Clean old metrics
+- [ ] Update dashboards
+- [ ] Review alert thresholds
+
+### Quarterly
+- [ ] Test alert notifications
+- [ ] Review retention policy
+- [ ] Optimize queries
+
+---
+
+## Backup Procedures
+
+### Configuration
+```bash
+# Grafana dashboards
+cp -r /opt/grafana/dashboards /backup/
+
+# Prometheus rules
+cp -r /opt/prometheus/rules /backup/
+```
+
+### Ansible
+```bash
+ansible-playbook ansible/automation/playbooks/backup_configs.yml --tags homelab_vm
+```
+
+---
+
+## Emergency Procedures
+
+### Prometheus Full
+1. Check storage: `docker system df`
+2. Reduce retention in prometheus.yml
+3. Delete old data: `docker exec prometheus rm -rf /prometheus/wal/*`
+4. Restart container
+
+### VM Down
+1. Check Proxmox: `qm list`
+2. Start VM: `qm start <vmid>`
+3. Check console: `qm terminal <vmid>`
+4. Review logs in Proxmox UI
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh homelab@192.168.0.210
+
+# Restart monitoring
+cd /opt/docker/prometheus && docker-compose restart
+cd /opt/docker/grafana && docker-compose restart
+
+# Check targets
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
+
+# View logs
+docker logs prometheus
+docker logs grafana
+docker logs alertmanager
+```
+
+---
+
+## Links
+
+- [Prometheus](http://192.168.0.210:9090)
+- [Grafana](http://192.168.0.210:3000)
+- [Alertmanager](http://192.168.0.210:9093)
+- [Uptime Kuma](http://192.168.0.210:3001)
--- a/docs/infrastructure/hosts/rpi5-runbook.md
+++ b/docs/infrastructure/hosts/rpi5-runbook.md
@@ -0,0 +1,179 @@
+# RPi5 Runbook
+
+*Raspberry Pi 5 - Edge Services*
+
+**Endpoint ID:** 443395  
+**Status:** 🟢 Online  
+**Hardware:** ARM Cortex-A76, 16GB RAM, 512GB USB SSD  
+**Access:** `rpi5-vish.local`
+
+---
+
+## Overview
+
+Raspberry Pi 5 runs edge services including Immich backup and lightweight applications.
+
+## Hardware Specs
+
+| Component | Specification |
+|----------|---------------|
+| Model | Raspberry Pi 5 |
+| CPU | ARM Cortex-A76 (4-core) |
+| RAM | 16GB |
+| Storage | 512GB USB-C SSD |
+| Network | 1x 1GbE (Pi 4 adapter) |
+
+## Services
+
+### Primary Services
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| **Immich** | 2283 | Photo backup (edge) |
+| Portainer Agent | 9001 | Container management |
+| Node Exporter | 9100 | Metrics |
+
+### Services (if enabled)
+
+| Service | Port | Purpose |
+|---------|------|---------|
+| Plex | 32400 | Media server |
+| WireGuard | 51820 | VPN |
+
+## Secondary Pi Nodes
+
+### Pi-5-Kevin
+This is a secondary Raspberry Pi 5 node with identical specifications but not typically online.
+
+- **CPU**: Broadcom BCM2712 (4-core, 2.4GHz)
+- **RAM**: 8GB LPDDR4X
+- **Storage**: 64GB microSD
+- **Network**: Gigabit Ethernet + WiFi 6
+
+---
+
+## Daily Operations
+
+### Check Service Health
+```bash
+# Via Portainer
+open http://rpi5-vish.local:9001
+
+# Via SSH
+ssh pi@rpi5-vish.local
+docker ps
+```
+
+### Immich Status
+```bash
+# Access UI
+open http://rpi5-vish.local:2283
+
+# Check sync status
+docker logs immich-server | grep -i sync
+```
+
+---
+
+## Common Issues
+
+### Container Won't Start (ARM compatibility)
+1. Verify image supports ARM64: `docker pull --platform linux/arm64 <image>`
+2. Check container logs
+3. Verify Raspberry Pi OS 64-bit
+
+### Storage Slow
+1. Check USB drive: `lsusb`
+2. Verify SSD: `sudo hdparm -t /dev/sda`
+3. Use fast USB port (USB-C)
+
+### Network Issues
+1. Check adapter compatibility
+2. Verify driver loaded: `lsmod | grep smsc95xx`
+3. Update firmware: `sudo rpi-eeprom-update`
+
+---
+
+## Storage
+
+### Layout
+```
+/home/pi/
+├── docker/           # Docker data
+├── immich/          # Photo storage
+└── backups/         # Local backups
+```
+
+### Performance Tips
+- Use USB 3.0 SSD
+- Usequality power supply (5V 5A)
+- Enable USB max_current in config.txt
+
+---
+
+## Maintenance
+
+### Weekly
+- [ ] Check Docker disk usage
+- [ ] Verify Immich backup
+- [ ] Check container health
+
+### Monthly
+- [ ] Update Raspberry Pi OS
+- [ ] Clean unused images
+- [ ] Review resource usage
+
+### Quarterly
+- [ ] Test backup restoration
+- [ ] Verify ARM image compatibility
+- [ ] Check firmware updates
+
+---
+
+## Emergency Procedures
+
+### SD Card/Storage Failure
+1. Replace storage drive
+2. Reinstall Raspberry Pi OS
+3. Run deploy playbook:
+   ```bash
+   ansible-playbook ansible/homelab/playbooks/deploy_rpi5_vish.yml
+   ```
+
+### Overheating
+1. Add heatsinks
+2. Enable fan
+3. Reduce CPU frequency: `sudo echo "arm_freq=1800" >> /boot/config.txt`
+
+## Notes
+
+This Raspberry Pi 5 system is the primary node that runs Immich and other services, with the secondary node **pi-5-kevin** intentionally kept offline for backup purposes when needed.
+
+---
+
+## Useful Commands
+
+```bash
+# SSH access
+ssh pi@rpi5-vish.local
+
+# Check temperature
+vcgencmd measure_temp
+
+# Check throttling
+vcgencmd get_throttled
+
+# Update firmware
+sudo rpi-eeprom-update
+sudo rpi-eeprom-update -a
+
+# View Immich logs
+docker logs -f immich-server
+```
+
+---
+
+## Links
+
+- [Immich](http://rpi5-vish.local:2283)
+- [Portainer](http://rpi5-vish.local:9001)
--- a/docs/infrastructure/hosts/runbooks.md
+++ b/docs/infrastructure/hosts/runbooks.md
@@ -0,0 +1,66 @@
+# Host Runbooks
+
+This directory contains operational runbooks for each host in the homelab infrastructure.
+
+## Available Runbooks
+
+- [Atlantis Runbook](./atlantis-runbook.md) - Synology DS1821+ (Primary NAS)
+- [Calypso Runbook](./calypso-runbook.md) - Synology DS723+ (Secondary NAS)
+- [Concord NUC Runbook](./concord-nuc-runbook.md) - Intel NUC (Home Automation & DNS)
+- [Homelab VM Runbook](./homelab-vm-runbook.md) - Proxmox VM (Monitoring & DevOps)
+- [RPi5 Runbook](./rpi5-runbook.md) - Raspberry Pi 5 (Edge Services)
+
+---
+
+## Common Tasks
+
+All hosts share common operational procedures:
+
+### Viewing Logs
+```bash
+# Via SSH to host
+docker logs <container_name>
+
+# Via Portainer
+Portainer → Containers → <container> → Logs
+```
+
+### Restarting Services
+```bash
+# Via docker-compose
+cd hosts/<host>/<service>
+docker-compose restart <service>
+
+# Via Portainer
+Portainer → Stacks → <stack> → Restart
+```
+
+### Checking Resource Usage
+```bash
+# Via Portainer
+Portainer → Containers → Sort by CPU/Memory
+
+# Via CLI
+docker stats
+```
+
+---
+
+## Emergency Contacts
+
+| Role | Contact | When to Contact |
+|------|---------|------------------|
+| Primary Admin | User | All critical issues |
+| Emergency | NTFY | Critical alerts only |
+
+---
+
+## Quick Reference
+
+| Host | Primary Role | Critical Services | SSH Access |
+|------|--------------|-------------------|------------|
+| Atlantis | Media, Vault | Vaultwarden, Plex, Immich | atlantis.local |
+| Calypso | Infrastructure | NPM, Authentik, Prometheus | calypso.local |
+| Concord NUC | DNS, HA | AdGuard, Home Assistant | concord-nuc.local |
+| Homelab VM | Monitoring | Prometheus, Grafana | 192.168.0.210 |
+| RPi5 | Edge | Immich (backup) | rpi5-vish.local |