Files
homelab-optimized/docs/infrastructure/hosts/deck-runbook.md
Gitea Mirror Bot e7652c8dab
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m3s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC
2026-04-20 01:32:01 +00:00

6.8 KiB

Steam Deck Runbook

SteamOS handheld — tailnet node with self-healing watchdog.

Headscale ID: 29 Status: 🟢 Online Hardware: Steam Deck (AMD APU, SteamOS Holo, btrfs root) Access: ssh deck (key-based, via ~/.ssh/config alias → 192.168.0.140) Tailnet IP: 100.64.0.11 (MagicDNS deck.tail.vish.gg)


Overview

The Steam Deck participates in the homelab tailnet for SSH/remote access. Because SteamOS ships with an immutable /usr and a read-only /usr/local, all custom state lives in /etc (writable via overlay) and /opt (writable, outside Valve's managed tree).

Filesystem layout

Path Purpose Survives SteamOS update?
/opt/tailscale/tailscale, /opt/tailscale/tailscaled Tailscale binaries (standard Steam Deck install location) Usually yes
/etc/systemd/system/tailscaled.service + tailscaled.service.d/override.conf systemd unit + override pointing ExecStart at /opt/tailscale/tailscaled Overlay — may be wiped on major updates
/etc/tailscale/authkey (0600 root) Reusable Headscale preauth key used by the watchdog for re-auth Overlay — may be wiped
/etc/tailscale/watchdog.sh (0755 root) Re-auth watchdog (bash + python3) Overlay — may be wiped
/etc/systemd/system/tailscale-watchdog.{service,timer} Watchdog systemd units Overlay — may be wiped
/etc/hosts Contains a pin <public-ip> headscale.vish.gg maintained by the watchdog Overlay — may be wiped
/var/log/tailscale-watchdog.log Watchdog activity log Yes (on /var)

After any SteamOS upgrade, verify these files still exist: ls /etc/tailscale/ /etc/systemd/system/tailscale-watchdog.*. If the overlay was reset, re-run the setup (see docs/infrastructure/hosts/deck-runbook.md section "Recovering after SteamOS update").

Tailscale / Headscale

  • Control server: https://headscale.vish.gg:8443 (migrated off public Tailscale 2026-04-19).
  • Preauth key: reusable, 1-year expiry, stored at /etc/tailscale/authkey. Reusable so the watchdog can re-authenticate without human intervention.
  • Node expiry: registered nodes in Headscale do not auto-expire unless explicitly expired with headscale nodes expire. If you want the 0001-01-01 sentinel (node-level "never expires"), that requires Headscale DB manipulation — not currently applied.

Watchdog behavior

/etc/tailscale/watchdog.sh runs every 5 minutes via the tailscale-watchdog.timer (OnBootSec=2min, OnUnitActiveSec=5min). Each tick:

  1. Calls tailscale status --json, extracts BackendState via python3.
  2. If BackendState is Running, exits silently.
  3. Otherwise (NeedsLogin, Stopped, NoState, or daemon missing):
    • Refreshes the /etc/hosts pin for headscale.vish.gg using DNS-over-HTTPS (dns.google, fallback 1.1.1.1). This is needed because the Deck has no dig/nslookup/host — only python3 — and because the local resolver returns the internal LAN IP for headscale.vish.gg when on-LAN (split-horizon DNS), which is useless when the Deck is travelling.
    • Re-runs tailscale up --login-server=https://headscale.vish.gg:8443 --authkey=<stored> --accept-routes=false --hostname=deck.
    • Logs to /var/log/tailscale-watchdog.log.

Verified failure-recovery matrix (2026-04-19)

Failure Recovery mechanism Recovery time
kill -9 tailscaled Restart=on-failure in tailscaled.service ~3 s, PID rotated, state preserved
tailscale down Watchdog detects Stopped, runs tailscale up ~1 s after next timer tick (≤5 min)
tailscale logout Watchdog detects NeedsLogin, runs tailscale up with stored authkey ~4 s after next timer tick (≤5 min)
Boot tailscaled auto-starts from /var/lib/tailscale/tailscaled.state; watchdog fires 2 min after boot as a safety net not yet validated

Known gap

If tailscaled is stopped cleanly (systemctl stop tailscaled), the current watchdog logs "tailscaled not running" and tries tailscale up, which fails because the daemon socket is missing. On boot this is a non-issue (systemd starts tailscaled). During runtime, this would leave the Deck disconnected. If this becomes a problem, extend the watchdog to systemctl start tailscaled when pidof tailscaled is empty.

SSH

  • Alias on homelab-vm: ~/.ssh/config entry → Host deck / HostName 192.168.0.140 / User deck / IdentityFile ~/.ssh/id_ed25519.
  • Installed key: admin@thevish.io ed25519 pubkey in /home/deck/.ssh/authorized_keys.
  • Password (for sudo): same as initial login.
  • MCP: deck is in scripts/homelab-mcp/server.py SSH_KNOWN_HOSTS, so ssh_exec(host="deck", …) works from the homelab MCP.

Recovering after a SteamOS update

If the /etc overlay was wiped:

# 1. Re-install key
cat ~/.ssh/id_ed25519.pub | sshpass -p '<password>' ssh -o StrictHostKeyChecking=accept-new deck@192.168.0.140 \
  'mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys'

# 2. Restore systemd override for tailscaled
ssh deck 'echo <pw> | sudo -S mkdir -p /etc/systemd/system/tailscaled.service.d && \
  echo -e "[Service]\nExecStartPre=\nExecStartPre=/opt/tailscale/tailscaled --cleanup\nExecStart=\nExecStart=/opt/tailscale/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=\${PORT} \$FLAGS\nExecStopPost=\nExecStopPost=/opt/tailscale/tailscaled --cleanup" | sudo tee /etc/systemd/system/tailscaled.service.d/override.conf'

# 3. Restore /etc/hosts pin
ssh deck 'echo <pw> | sudo -S sh -c "grep -q headscale.vish.gg /etc/hosts || echo 184.23.52.14 headscale.vish.gg >> /etc/hosts"'

# 4. Create fresh reusable preauth key (via MCP headscale_create_preauth_key) and store it
AUTHKEY='hskey-auth-…'  # pragma: allowlist secret (placeholder)
ssh deck 'echo <pw> | sudo -S sh -c "mkdir -p /etc/tailscale && umask 077 && printf %s \"$AUTHKEY\" > /etc/tailscale/authkey && chmod 600 /etc/tailscale/authkey"'

# 5. Reinstall watchdog (copy from git or re-apply from this runbook's source repo)
scp docs/infrastructure/hosts/deck/watchdog.sh deck:/tmp/
ssh deck 'echo <pw> | sudo -S install -m 0755 /tmp/watchdog.sh /etc/tailscale/watchdog.sh'

# 6. Reinstall + enable systemd units (see files/ directory)
scp docs/infrastructure/hosts/deck/tailscale-watchdog.{service,timer} deck:/tmp/
ssh deck 'echo <pw> | sudo -S sh -c "install -m 0644 /tmp/tailscale-watchdog.service /etc/systemd/system/ && install -m 0644 /tmp/tailscale-watchdog.timer /etc/systemd/system/ && systemctl daemon-reload && systemctl enable --now tailscaled.service tailscale-watchdog.timer"'

The watchdog script and systemd unit sources are checked in under docs/infrastructure/hosts/deck/ so a recovery doesn't require reconstructing them from memory.