98 lines
6.8 KiB
Markdown
98 lines
6.8 KiB
Markdown
# Steam Deck Runbook
|
|
|
|
*SteamOS handheld — tailnet node with self-healing watchdog.*
|
|
|
|
**Headscale ID:** 29
|
|
**Status:** 🟢 Online
|
|
**Hardware:** Steam Deck (AMD APU, SteamOS Holo, btrfs root)
|
|
**Access:** `ssh deck` (key-based, via `~/.ssh/config` alias → `192.168.0.140`)
|
|
**Tailnet IP:** `100.64.0.11` (MagicDNS `deck.tail.vish.gg`)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
The Steam Deck participates in the homelab tailnet for SSH/remote access. Because SteamOS ships with an immutable `/usr` and a read-only `/usr/local`, all custom state lives in `/etc` (writable via overlay) and `/opt` (writable, outside Valve's managed tree).
|
|
|
|
## Filesystem layout
|
|
|
|
| Path | Purpose | Survives SteamOS update? |
|
|
|------|---------|--------------------------|
|
|
| `/opt/tailscale/tailscale`, `/opt/tailscale/tailscaled` | Tailscale binaries (standard Steam Deck install location) | Usually yes |
|
|
| `/etc/systemd/system/tailscaled.service` + `tailscaled.service.d/override.conf` | systemd unit + override pointing ExecStart at `/opt/tailscale/tailscaled` | Overlay — may be wiped on major updates |
|
|
| `/etc/tailscale/authkey` (0600 root) | Reusable Headscale preauth key used by the watchdog for re-auth | Overlay — may be wiped |
|
|
| `/etc/tailscale/watchdog.sh` (0755 root) | Re-auth watchdog (bash + python3) | Overlay — may be wiped |
|
|
| `/etc/systemd/system/tailscale-watchdog.{service,timer}` | Watchdog systemd units | Overlay — may be wiped |
|
|
| `/etc/hosts` | Contains a pin `<public-ip> headscale.vish.gg` maintained by the watchdog | Overlay — may be wiped |
|
|
| `/var/log/tailscale-watchdog.log` | Watchdog activity log | Yes (on /var) |
|
|
|
|
> **After any SteamOS upgrade**, verify these files still exist: `ls /etc/tailscale/ /etc/systemd/system/tailscale-watchdog.*`. If the overlay was reset, re-run the setup (see `docs/infrastructure/hosts/deck-runbook.md` section "Recovering after SteamOS update").
|
|
|
|
## Tailscale / Headscale
|
|
|
|
- **Control server:** `https://headscale.vish.gg:8443` (migrated off public Tailscale 2026-04-19).
|
|
- **Preauth key:** reusable, 1-year expiry, stored at `/etc/tailscale/authkey`. Reusable so the watchdog can re-authenticate without human intervention.
|
|
- **Node expiry:** registered nodes in Headscale do not auto-expire unless explicitly expired with `headscale nodes expire`. If you want the `0001-01-01` sentinel (node-level "never expires"), that requires Headscale DB manipulation — not currently applied.
|
|
|
|
### Watchdog behavior
|
|
|
|
`/etc/tailscale/watchdog.sh` runs every 5 minutes via the `tailscale-watchdog.timer` (`OnBootSec=2min`, `OnUnitActiveSec=5min`). Each tick:
|
|
|
|
1. Calls `tailscale status --json`, extracts `BackendState` via python3.
|
|
2. If `BackendState` is `Running`, exits silently.
|
|
3. Otherwise (`NeedsLogin`, `Stopped`, `NoState`, or daemon missing):
|
|
- Refreshes the `/etc/hosts` pin for `headscale.vish.gg` using **DNS-over-HTTPS** (`dns.google`, fallback `1.1.1.1`). This is needed because the Deck has no `dig`/`nslookup`/`host` — only `python3` — and because the local resolver returns the *internal* LAN IP for `headscale.vish.gg` when on-LAN (split-horizon DNS), which is useless when the Deck is travelling.
|
|
- Re-runs `tailscale up --login-server=https://headscale.vish.gg:8443 --authkey=<stored> --accept-routes=false --hostname=deck`.
|
|
- Logs to `/var/log/tailscale-watchdog.log`.
|
|
|
|
### Verified failure-recovery matrix (2026-04-19)
|
|
|
|
| Failure | Recovery mechanism | Recovery time |
|
|
|---------|-------------------|---------------|
|
|
| `kill -9 tailscaled` | `Restart=on-failure` in tailscaled.service | ~3 s, PID rotated, state preserved |
|
|
| `tailscale down` | Watchdog detects `Stopped`, runs `tailscale up` | ~1 s after next timer tick (≤5 min) |
|
|
| `tailscale logout` | Watchdog detects `NeedsLogin`, runs `tailscale up` with stored authkey | ~4 s after next timer tick (≤5 min) |
|
|
| Boot | tailscaled auto-starts from `/var/lib/tailscale/tailscaled.state`; watchdog fires 2 min after boot as a safety net | not yet validated |
|
|
|
|
### Known gap
|
|
|
|
If tailscaled is stopped **cleanly** (`systemctl stop tailscaled`), the current watchdog logs "tailscaled not running" and tries `tailscale up`, which fails because the daemon socket is missing. On boot this is a non-issue (systemd starts tailscaled). During runtime, this would leave the Deck disconnected. If this becomes a problem, extend the watchdog to `systemctl start tailscaled` when `pidof tailscaled` is empty.
|
|
|
|
## SSH
|
|
|
|
- **Alias on homelab-vm:** `~/.ssh/config` entry → `Host deck / HostName 192.168.0.140 / User deck / IdentityFile ~/.ssh/id_ed25519`.
|
|
- **Installed key:** `admin@thevish.io` ed25519 pubkey in `/home/deck/.ssh/authorized_keys`.
|
|
- **Password** (for sudo): same as initial login.
|
|
- **MCP:** `deck` is in `scripts/homelab-mcp/server.py` `SSH_KNOWN_HOSTS`, so `ssh_exec(host="deck", …)` works from the homelab MCP.
|
|
|
|
## Recovering after a SteamOS update
|
|
|
|
If the `/etc` overlay was wiped:
|
|
|
|
```bash
|
|
# 1. Re-install key
|
|
cat ~/.ssh/id_ed25519.pub | sshpass -p '<password>' ssh -o StrictHostKeyChecking=accept-new deck@192.168.0.140 \
|
|
'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'
|
|
|
|
# 2. Restore systemd override for tailscaled
|
|
ssh deck 'echo <pw> | sudo -S mkdir -p /etc/systemd/system/tailscaled.service.d && \
|
|
echo -e "[Service]\nExecStartPre=\nExecStartPre=/opt/tailscale/tailscaled --cleanup\nExecStart=\nExecStart=/opt/tailscale/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=\${PORT} \$FLAGS\nExecStopPost=\nExecStopPost=/opt/tailscale/tailscaled --cleanup" | sudo tee /etc/systemd/system/tailscaled.service.d/override.conf'
|
|
|
|
# 3. Restore /etc/hosts pin
|
|
ssh deck 'echo <pw> | sudo -S sh -c "grep -q headscale.vish.gg /etc/hosts || echo 184.23.52.14 headscale.vish.gg >> /etc/hosts"'
|
|
|
|
# 4. Create fresh reusable preauth key (via MCP headscale_create_preauth_key) and store it
|
|
AUTHKEY='hskey-auth-…' # pragma: allowlist secret (placeholder)
|
|
ssh deck 'echo <pw> | sudo -S sh -c "mkdir -p /etc/tailscale && umask 077 && printf %s \"$AUTHKEY\" > /etc/tailscale/authkey && chmod 600 /etc/tailscale/authkey"'
|
|
|
|
# 5. Reinstall watchdog (copy from git or re-apply from this runbook's source repo)
|
|
scp docs/infrastructure/hosts/deck/watchdog.sh deck:/tmp/
|
|
ssh deck 'echo <pw> | sudo -S install -m 0755 /tmp/watchdog.sh /etc/tailscale/watchdog.sh'
|
|
|
|
# 6. Reinstall + enable systemd units (see files/ directory)
|
|
scp docs/infrastructure/hosts/deck/tailscale-watchdog.{service,timer} deck:/tmp/
|
|
ssh deck 'echo <pw> | sudo -S sh -c "install -m 0644 /tmp/tailscale-watchdog.service /etc/systemd/system/ && install -m 0644 /tmp/tailscale-watchdog.timer /etc/systemd/system/ && systemctl daemon-reload && systemctl enable --now tailscaled.service tailscale-watchdog.timer"'
|
|
```
|
|
|
|
The watchdog script and systemd unit sources are checked in under `docs/infrastructure/hosts/deck/` so a recovery doesn't require reconstructing them from memory.
|