homelab-optimized/CLAUDE.md

# Homelab Claude Code Instructions

## Deployment

- When deploying services, always verify the target host before proceeding. Confirm which host a service should run on and check for port conflicts with existing services.
- Check `ss -tlnp | grep <port>` on the target host before deploying.
- Hosts: atlantis (Synology NAS, media/arr), calypso (Synology, DNS/SSO), olares (K3s, GPU), nuc (lightweight), rpi5 (Kuma), homelab-vm (monitoring/dashboard), guava (TrueNAS), seattle (remote), matrix-ubuntu (NPM/CrowdSec).

## Configuration Management

- Before modifying config files (YAML, JSON, etc.), always create a backup copy first.
- Never use sed for complex YAML edits — use a proper parser or manual editing to avoid duplicate keys and corruption.
- For YAML changes, validate with `python3 -c "import yaml; yaml.safe_load(open('file.yaml'))"` after editing.
- Never empty or overwrite a config file without reading it first.

## Homelab SSH & Networking

- For homelab SSH operations: if MCP SSH times out on large outputs, fall back to Bash with `ssh` directly.
- Always use the correct Tailscale/LAN IP for each host. When Ollama or other services aren't on localhost, check the memory or ask for the correct endpoint before guessing.
- After making infrastructure changes (Tailscale, DNS, networking), always verify connectivity from affected hosts before marking complete.
- Never run a second instance of a network daemon (tailscaled, etc.) — it will break host networking.
- homelab-vm IS localhost — never SSH into it, use local commands.

## Heterogeneous Host Awareness

- Before installing/running anything on a remote host, probe the environment first: `uname -a`, `which <binary>`, `mount | grep noexec`, `sudo -n true`. Adapt or propose alternatives instead of failing then pivoting.
- Tailscale binary paths differ across hosts (Synology, GL.iNet, k3s, standard Linux) — verify with `which tailscale` before assuming.
- Synology `/tmp` is `noexec` — stage scripts in `/volume1` or user home.
- Synology has no `git` and no SFTP subsystem — use `ssh-pipe` (`cat file | ssh host 'cat > dest'`) and prefix docker commands with `sudo /usr/local/bin/docker`.
- GL.iNet travel routers wipe config on firmware update — reapply watchdog/Tailscale config after every flash.
- uqiyoe is **Windows** — use `dir`/`del`/`rmdir`, not `ls`/`rm`. SSH user is `vish`, not `homelab`.
- Check architecture (`uname -m`) before downloading binaries; the fleet has mixed amd64/arm64.

## Long-Running Commands

- Set explicit, short timeouts on SSH/Bash commands. Default 30s, max 120s for known-slow ops.
- For potentially slow operations (find on NAS, large rsync, apt upgrade): run with `run_in_background: true` and poll, or scope tightly with `-maxdepth`/path filters.
- Never run unbounded `find /` on NAS or Synology hosts — always anchor to a specific path.
- For destructive/mutating ops (rsync, dd, rm -rf, db edits): dry-run first, verify checksums/counts, take a backup before applying. Don't trust silent successes — `rsync` once truncated 70 GB to 74 MB without erroring.

## Debugging Discipline

- Before changing anything to "fix" an issue, list the top 2–3 candidate root causes ranked by likelihood with one diagnostic per candidate. Run the diagnostics first, share results, then propose a fix. Don't patch the visible symptom (e.g., disabling a Kuma monitor) before confirming the underlying cause.

## Verification Discipline

- After deploying or fixing a service, verify end-to-end before declaring done: curl the endpoint, check Kuma status, tail logs for >60s of clean uptime.
- Kuma `accepted_statuscodes` must be quoted strings in JSON: `["200-299"]`, not `[200-299]` (parse error otherwise).
- Commit and push documentation changes in the same session as the infra change — don't leave docs lagging behind reality.

## LLM Services

- When working with LLM model deployments (Ollama, vLLM), always verify: 1) GPU access, 2) context length meets the consumer's requirements, 3) tool-calling support if needed.
- Ollama is at `http://192.168.0.145:31434` (Olares LAN NodePort), NOT localhost.
- HAMI vGPU on Olares causes ffmpeg segfaults — do NOT request `nvidia.com/gpu` resources, use `runtimeClassName: nvidia` directly.

## Olares (K3s)

- Olares admission webhook blocks hostNetwork and reverts custom NetworkPolicies.
- Use Calico GlobalNetworkPolicy for LAN access (it can't be overridden by the webhook).
- The Olares proxy adds ~100ms latency — use direct LAN NodePorts for streaming/high-throughput services.
- Marketplace app patches (NFS mounts, GPU) are lost on app updates — re-apply after updates.

## Git & Commits

- Never add Co-Authored-By lines to git commits.
- Always run `detect-secrets scan --baseline .secrets.baseline` before committing if secrets baseline exists.
- Use `pragma: allowlist secret` comments for intentional secrets in private repo files.

## Documentation

- After completing each task, immediately update the relevant documentation in the repo and commit with a descriptive message before moving to the next task.
- Key docs: `docs/services/individual/dashboard.md`, `docs/services/individual/olares.md`, `scripts/README.md`.

## Portainer

- API uses `X-API-Key` header (NOT Bearer token).
- Portainer URL: `http://100.83.230.112:10000` (Tailscale IP).
- Endpoints: atlantis=2, calypso=443397, nuc=443398, homelab=443399, rpi5=443395.
- GitOps stacks use Gitea token for auth — if redeploy fails with "authentication required", credentials need re-entry in Portainer UI.

## Dashboard

- Dashboard runs at `http://homelab.tail.vish.gg:3100` (Next.js on port 3100, FastAPI API on port 18888).
- API proxied through Next.js rewrites — frontend calls `/api/*` which routes to localhost:18888.
- 16 glassmorphism themes with Exo 2 font.
- To rebuild: `cd dashboard/ui && rm -rf .next && BACKEND_URL=http://localhost:18888 npm run build && cp -r .next/static .next/standalone/.next/static && cp -r public .next/standalone/public`.