Sanitized mirror from private repository - 2026-04-20 01:32:01 UTC

2026-04-20 01:32:01 +00:00
commit e7652c8dab
1445 changed files with 364095 additions and 0 deletions
--- a/docs/services/individual/olares.md
+++ b/docs/services/individual/olares.md
@@ -0,0 +1,456 @@
+# Olares
+
+**Kubernetes Self-Hosting Platform**
+
+## Service Overview
+
+| Property | Value |
+|----------|-------|
+| **Host** | olares (192.168.0.145) |
+| **OS** | Ubuntu 24.04.3 LTS |
+| **Platform** | Olares (Kubernetes/K3s with Calico CNI) |
+| **Hardware** | Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe |
+| **SSH** | `ssh olares` (key auth, user: olares) |
+
+## Purpose
+
+Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).
+
+Primary use case: **local LLM inference** via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).
+
+## LLM Services
+
+Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under `*.vishinator.olares.com`.
+
+### Available Models
+
+| Model | Backend | Namespace | Endpoint | Context | Notes |
+|-------|---------|-----------|----------|---------|-------|
+| Qwen3-Coder 30B | Ollama | `ollamaserver-shared` | `https://a5be22681.vishinator.olares.com/v1` | 65k tokens | MoE (3.3B active), coding-focused, currently active |
+| Qwen3 30B A3B (4-bit) | vLLM | `vllmqwen330ba3bv2server-shared` | `https://04521407.vishinator.olares.com/v1` | ~40k tokens | MoE, fast inference, limited tool calling |
+| Qwen3 30B A3B (4-bit) | vLLM | `vllmqwen330ba3binstruct4bitv2-vishinator` | — | ~40k tokens | Duplicate deployment (vishinator namespace) |
+| Qwen3.5 27B Q4_K_M | Ollama | `ollamaqwen3527bq4kmv2server-shared` | `https://37e62186.vishinator.olares.com/v1` | 40k+ (262k native) | Dense, best for agentic coding |
+| GPT-OSS 20B | vLLM | `vllmgptoss20bv2server-shared` | `https://6941bf89.vishinator.olares.com/v1` | 65k tokens | Requires auth bypass in Olares settings |
+| Qwen3.5 9B | Ollama | `ollamaqwen359bv2server-shared` | — | — | Lightweight, scaled to 0 |
+| Qwen3-30B-A3B AWQ 4-bit | vLLM | `vllm-qwen3:32b` | — (raw kubectl, no Olares URL) | 16k tokens | **Failed experiment** — context too small for agentic coding, scaled to 0. See opencode.md |
+
+### GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)
+
+- Only run **one model at a time** to avoid VRAM exhaustion
+- vLLM `--gpu-memory-utilization 0.95` is the default
+- Context limits are determined by available KV cache after model loading
+- Use `nvidia-smi` or check vLLM logs for actual KV cache capacity
+- Before starting a model, scale down all others (see Scaling Operations below)
+
+### Scaling Operations
+
+Only one model should be loaded at a time due to VRAM constraints. Use these commands to switch between models.
+
+**Check what's running:**
+```bash
+ssh olares "sudo kubectl get deployments -A | grep -iE 'vllm|ollama'"
+ssh olares "nvidia-smi --query-gpu=memory.used,memory.free --format=csv"
+```
+
+**Stop all LLM deployments (free GPU):**
+```bash
+# Qwen3-Coder (Ollama — currently active)
+ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=0"
+ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=0"
+
+# Qwen3 30B A3B vLLM (shared)
+ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=0"
+
+# Qwen3 30B A3B vLLM (vishinator)
+ssh olares "sudo kubectl scale deployment vllmqwen330ba3binstruct4bitv2 -n vllmqwen330ba3binstruct4bitv2-vishinator --replicas=0"
+
+# Qwen3.5 27B Ollama
+ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
+ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
+
+# GPT-OSS 20B vLLM
+ssh olares "sudo kubectl scale deployment vllm -n vllmgptoss20bv2server-shared --replicas=0"
+```
+
+**Start Qwen3-Coder (Ollama):**
+```bash
+ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=1"
+ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=1"
+```
+
+**Start Qwen3 30B A3B (vLLM):**
+```bash
+ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=1"
+# Wait 2-3 minutes for vLLM startup, then check:
+ssh olares "sudo kubectl logs -n vllmqwen330ba3bv2server-shared -l io.kompose.service=vllm --tail=5"
+```
+
+**Start Qwen3.5 27B (Ollama):**
+```bash
+ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
+ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
+```
+
+**Unload a model from Ollama (without scaling down the pod):**
+```bash
+ssh olares "sudo kubectl exec -n ollamaserver-shared \$(sudo kubectl get pods -n ollamaserver-shared -l io.kompose.service=ollama -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama stop qwen3:32b"
+```
+
+### vLLM max_model_len
+
+The `max_model_len` parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:
+
+```
+Available KV cache memory: X.XX GiB
+GPU KV cache size: XXXXX tokens
+```
+
+To change it, either:
+1. Edit in the **Olares app settings UI** (persistent across redeploys)
+2. Patch the deployment directly (resets on redeploy):
+   ```bash
+   kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json
+   # Edit max-model-len in the command string
+   kubectl apply -f /tmp/patch.json
+   ```
+
+## OpenClaw (Chat Agent)
+
+OpenClaw runs as a Kubernetes app in the `clawdbot-vishinator` namespace.
+
+### Configuration
+
+Config file inside the pod: `/home/node/.openclaw/openclaw.json`
+
+To read/write config:
+```bash
+ssh olares
+sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json
+```
+
+### Key Settings
+
+- **Compaction**: `mode: "safeguard"` with `maxHistoryShare: 0.5` prevents context overflow
+- **contextWindow**: Must match vLLM's actual `max_model_len` (not the model's native limit)
+- **Workspace data**: Lives at `/home/node/.openclaw/workspace/` inside the pod
+- **Brew packages**: OpenClaw has Homebrew; install tools with `brew install <pkg>` from the agent or pod
+
+### Troubleshooting
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `localhost:8000 connection refused` | Model provider not configured or not running | Check model endpoint URL in config, verify vLLM pod is running |
+| `Context overflow` | Prompt exceeded model's context limit | Enable compaction, or `/reset` the session |
+| `pairing required` (WebSocket 1008) | Device pairing data was cleared | Reload the Control UI page to re-pair |
+| `does not support tools` (400) | Ollama model lacks tool calling template | Use vLLM with `--enable-auto-tool-choice` instead of Ollama |
+| `max_tokens must be at least 1, got negative` | Context window too small for system prompt + tools | Increase `max_model_len` (vLLM) or `num_ctx` (Ollama) |
+| `bad request` / 400 from Ollama | Request exceeds `num_ctx` | Increase `num_ctx` in Modelfile: `ollama create model -f Modelfile` |
+| 302 redirect on model endpoint | Olares auth (Authelia) blocking API access | Disable auth for the endpoint in Olares app settings |
+| vLLM server pod scaled to 0 | Previously stopped, client pod crashes | Scale up: `kubectl scale deployment vllm -n <namespace> --replicas=1` |
+
+## OpenCode Configuration
+
+OpenCode on the homelab VM and moon are configured to use these endpoints.
+
+### Config Location
+
+- **homelab VM**: `~/.config/opencode/opencode.json`
+- **moon**: `~/.config/opencode/opencode.json` (user: moon)
+
+### Model Switching
+
+Change the `"model"` field in `opencode.json`:
+
+```json
+"model": "olares//models/qwen3-30b"
+```
+
+Available provider/model strings:
+- `olares//models/qwen3-30b` (recommended — supports tool calling via vLLM)
+- `olares-gptoss//models/gpt-oss-20b`
+- `olares-qwen35/qwen3.5:27b-q4_K_M` (Ollama — does NOT support tool calling, avoid for OpenCode)
+
+**Important**: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with `--enable-auto-tool-choice --tool-call-parser hermes` for reliable tool use.
+
+### Loop Prevention
+
+```json
+"mode": {
+  "build": {
+    "steps": 25,
+    "permission": { "doom_loop": "deny" }
+  },
+  "plan": {
+    "steps": 15,
+    "permission": { "doom_loop": "deny" }
+  }
+}
+```
+
+## Storage — NFS Mount from Atlantis
+
+Olares has an NFS mount from Atlantis for persistent storage shared with the homelab:
+
+| Property | Value |
+|----------|-------|
+| **Mount point** | `/mnt/atlantis_olares_storage` |
+| **Source** | `192.168.0.200:/volume1/documents/olares_storage` |
+| **Access** | Read/write (`all_squash`, anonuid=1026/anongid=100) |
+| **Persistent** | Yes — configured in `/etc/fstab` |
+| **Capacity** | 84TB pool (46TB free as of 2026-03-16) |
+
+### fstab entry
+```
+192.168.0.200:/volume1/documents/olares_storage /mnt/atlantis_olares_storage nfs rw,async,hard,intr,rsize=8192,wsize=8192,timeo=14 0 0
+```
+
+### Mount/unmount manually
+```bash
+# Mount
+sudo mount /mnt/atlantis_olares_storage
+
+# Unmount
+sudo umount /mnt/atlantis_olares_storage
+
+# Check
+df -h /mnt/atlantis_olares_storage
+ls /mnt/atlantis_olares_storage
+```
+
+### Troubleshooting
+- If mount fails after reboot, check Atlantis is up and NFS is running: `sudo showmount -e 192.168.0.200`
+- Fail2ban on Olares may ban homelab-vm (`192.168.0.210`) — whitelist is `/etc/fail2ban/jail.d/local.conf` with `ignoreip = 127.0.0.1/8 ::1 192.168.0.0/24`
+- SSH to Olares uses key auth (`ssh olares` works from homelab-vm) — key installed 2026-03-16
+
+---
+
+## Built-in Services
+
+Olares runs its own infrastructure in Kubernetes:
+
+- **Headscale + Tailscale**: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
+- **Authelia**: SSO/auth gateway for app endpoints
+- **Envoy**: Sidecar proxy for all apps
+- **HAMI**: GPU device scheduler for vLLM/Ollama pods
+- **Prometheus**: Metrics collection
+
+## Network
+
+| Interface | IP | Notes |
+|-----------|-----|-------|
+| LAN (enp129s0) | 192.168.0.145/24 | Primary access |
+| Tailscale (K8s pod) | 100.64.0.1 | Olares internal tailnet only |
+
+Note: The host does **not** have Tailscale installed directly. The K8s Tailscale pod uses `tailscale0` and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.
+
+## Media — Jellyfin
+
+Jellyfin is deployed from the Olares marketplace with manual patches for NFS media and GPU transcoding. See [jellyfin-olares.md](jellyfin-olares.md) for full details.
+
+| Property | Value |
+|----------|-------|
+| **Namespace** | `jellyfin-vishinator` |
+| **LAN Access** | `http://192.168.0.145:30096` |
+| **Olares Proxy** | `https://7e89d2a1.vishinator.olares.com` |
+| **Media Source** | Atlantis NFS → `/media/` (movies, tv, anime, music, audiobooks) |
+| **GPU Transcoding** | NVIDIA NVENC (AV1/HEVC/H264), tone mapping, hardware decode |
+
+**Important**: Use LAN URL for streaming (Olares proxy adds ~100ms latency per request, causes buffering).
+
+## Ollama LAN Access
+
+Ollama is exposed directly on LAN for services that can't authenticate through the Olares proxy (e.g., Gmail auto-labeler cron jobs).
+
+| Property | Value |
+|----------|-------|
+| **LAN URL** | `http://192.168.0.145:31434` |
+| **Olares Proxy** | `https://a5be22681.vishinator.olares.com` (requires auth) |
+| **Service** | `ollama-lan` in `ollamaserver-shared` namespace |
+| **Calico Policy** | `allow-lan-to-ollama` GlobalNetworkPolicy |
+
+## Tdarr Node (GPU Transcoding)
+
+Tdarr transcoding node using the RTX 5090 NVENC hardware encoder. Fastest node in the cluster.
+
+| Property | Value |
+|----------|-------|
+| **Namespace** | `tdarr-node` |
+| **Manifest** | `olares/tdarr-node.yaml` |
+| **Version** | 2.67.01 (pinned by digest) |
+| **Server** | Atlantis (192.168.0.200:8266) |
+| **GPU Encoders** | h264_nvenc, hevc_nvenc, av1_nvenc |
+| **Workers** | GPU=2, CPU=0, Health=1 |
+
+**NFS mounts:**
+- `/mnt/atlantis_media` — media library (read-write, Tdarr needs write access to replace transcoded files in place)
+- `/mnt/atlantis_cache` — transcoding cache (read-write, shared with all nodes)
+
+**NFS cache mount (`/mnt/atlantis_cache`):**
+- Source: `192.168.0.200:/volume1/data/tdarr_cache`
+- Added to `/etc/fstab` on Olares for persistence
+- Required `no_root_squash` on Atlantis NFS export for `192.168.0.145` (Olares containers run as root, default `root_squash` maps root to `nobody` causing permission denied on cache writes)
+
+**Deploy/redeploy:**
+```bash
+ssh olares "kubectl apply -f -" < olares/tdarr-node.yaml
+ssh olares "kubectl get pods -n tdarr-node"
+ssh olares "kubectl exec -n tdarr-node deploy/tdarr-node -- nvidia-smi"
+```
+
+**GPU contention:** NVENC uses dedicated hardware separate from CUDA cores. Tdarr + Ollama coexist fine. Tdarr + Jellyfin may compete for NVENC sessions (RTX 5090 supports up to 8 concurrent).
+
+### Troubleshooting: slow transcodes (CPU fallback)
+
+Symptom: Tdarr jobs running but GPU utilization is 0%; `ps -ef` inside the node pod shows `tdarr-ffmpeg ... -c:v libx265 ...` instead of `hevc_nvenc`.
+
+Two root causes to check in order:
+
+**1. Pod lost GPU runtime state**
+
+```bash
+ssh olares 'kubectl exec -n tdarr-node $(kubectl get pod -n tdarr-node -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi'
+```
+
+If you see `Failed to initialize NVML: Unknown Error` or a direct NVENC test fails with `CUDA_ERROR_NO_DEVICE`, the pod's GPU bindings are stale (usually after a host NVIDIA driver activity while the pod kept running). **Fix:**
+
+```bash
+ssh olares 'kubectl delete pod -n tdarr-node $(kubectl get pod -n tdarr-node -o jsonpath="{.items[0].metadata.name}")'
+# Deployment recreates it with working GPU access
+```
+
+**2. Library plugin is Intel-QSV or CPU-only**
+
+The Tdarr library's plugin stack may be using an Intel-only encoder that falls back silently to libx265 on NVIDIA nodes. Avoid these plugins on Olares:
+
+- `Tdarr_Plugin_bsh1_Boosh_FFMPEG_QSV_HEVC` — requires Intel iGPU
+- `Tdarr_Plugin_MC93_Migz1FFMPEG_CPU` — explicitly CPU
+
+Use `Tdarr_Plugin_MC93_Migz1FFMPEG` (no suffix) — it auto-detects NVIDIA and uses `hevc_nvenc`. Swap via the Tdarr UI (`http://192.168.0.200:8265/` → Libraries → per-library plugin stack) or directly via the server API:
+
+```bash
+curl -sX POST http://192.168.0.200:8266/api/v2/cruddb \
+  -H "Content-Type: application/json" \
+  -d '{"data":{"collection":"LibrarySettingsJSONDB","mode":"getById","docID":"<LIB_ID>"}}' \
+  | jq '.pluginIDs[] | {id, checked, priority}'
+```
+
+Healthy state — the ffmpeg command should look like this (verify inside the pod with `kubectl exec ... -- ps -ef | grep tdarr-ffmpeg`):
+
+```
+tdarr-ffmpeg -y -hwaccel cuda -hwaccel_device 0 -i <input> \
+  -map 0:0 -c:0 hevc_nvenc -qp 20 -preset p5 -gpu 0 ...
+```
+
+And `nvidia-smi` on the host should show `utilization.encoder = 80-100%` per active worker, each using ~500 MiB GPU RAM.
+
+*Incident: 2026-04-19 — pod had 13d uptime with stale NVML state + library plugins were QSV/CPU. Fixed by pod restart + plugin swap; went from libx265 CPU grind to 6 parallel NVENC workers at 100% encoder util.*
+
+## Calico GlobalNetworkPolicies
+
+Olares auto-creates restrictive `app-np` NetworkPolicies per namespace that block LAN traffic. These cannot be modified (admission webhook reverts changes) or supplemented (webhook deletes custom policies). The solution is Calico GlobalNetworkPolicies which operate at a level Olares can't override.
+
+Active policies:
+```bash
+kubectl get globalnetworkpolicy
+# allow-lan-to-jellyfin — 192.168.0.0/24 → app=jellyfin
+# allow-lan-to-ollama   — 192.168.0.0/24 → io.kompose.service=ollama
+# allow-lan-to-tdarr    — 192.168.0.0/24 ingress + all egress → app=tdarr-node
+```
+
+## NFS Media Mount from Atlantis
+
+| Property | Value |
+|----------|-------|
+| **Mount Point** | `/mnt/atlantis_media` |
+| **Source** | `192.168.0.200:/volume1/data/media` |
+| **Contents** | movies, tv, anime, music, audiobooks, ebooks, podcasts |
+| **Performance** | 180-420 MB/s sequential read |
+| **Persistent** | Yes — in `/etc/fstab` |
+
+```
+# /etc/fstab
+192.168.0.200:/volume1/data/media /mnt/atlantis_media nfs rw,async,hard,intr,rsize=131072,wsize=131072 0 0
+```
+
+## Known Issues
+
+- **Do NOT install host-level Tailscale** — it conflicts with the K8s Tailscale pod's `tailscale0` interface and causes total network loss requiring physical reboot
+- **Ollama Qwen3.5 27B lacks tool calling** — Ollama's model template doesn't support tools; use vLLM for coding agents
+- **Only run one model at a time** — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
+- **vLLM startup takes 2-3 minutes** — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
+- **Olares auth (Authelia) blocks API endpoints by default** — new model endpoints need auth bypass configured in Olares app settings
+- **Raw kubectl deployments don't get Olares URLs** — apps deployed outside Studio/Market have no managed ingress (`*.vishinator.olares.com`). Use SSH tunnels or NodePort (if networking allows) as workarounds
+- **HAMI GPU scheduler requires Olares labels** — pods requesting GPU without `applications.app.bytetrade.io/name` label will fail to schedule with `cannot schedule pod without applications.app.bytetrade.io/name label`
+- **Never name a k8s service `vllm`** — Kubernetes auto-injects `VLLM_PORT` env var from service discovery, which conflicts with vLLM's own config. Use `vllm-server` or similar
+- **HAMI vGPU causes ffmpeg segfaults** — HAMI injects `libvgpu.so` via `/etc/ld.so.preload` which intercepts CUDA calls. This causes ffmpeg to crash (exit 139) during GPU transcoding. Fix: don't request `nvidia.com/gpu` resources, use `runtimeClassName: nvidia` directly
+- **Olares admission webhook blocks LAN access** — auto-created `app-np` NetworkPolicies can't be modified or supplemented. Use Calico GlobalNetworkPolicy for LAN access
+- **Olares proxy adds ~100ms latency** — direct LAN access via NodePort + GlobalNetworkPolicy is 88x faster; use for streaming/high-throughput services
+- **hostNetwork blocked** — Olares admission webhook rejects `hostNetwork: true` pods with "HostNetwork Enabled Unsupported"
+- **Marketplace app patches lost on update** — kubectl patches to marketplace apps (NFS mounts, GPU access) are overwritten when the app is updated. Re-apply after updates
+
+## Remote Management with k9s
+
+k9s and kubectl are installed on the homelab VM for managing Olares pods without SSH.
+
+### Setup
+
+| Component | Details |
+|-----------|---------|
+| **kubectl** | `/usr/local/bin/kubectl` (v1.35.2) |
+| **k9s** | `/usr/local/bin/k9s` (v0.50.18) |
+| **kubeconfig** | `~/.kube/config` → `https://192.168.0.145:6443` |
+| **Access** | Full admin (K3s default user), LAN only |
+
+The kubeconfig was copied from `/etc/rancher/k3s/k3s.yaml` on Olares with the server address changed from `127.0.0.1` to `192.168.0.145`.
+
+### Usage
+
+```bash
+# Launch k9s (interactive TUI)
+k9s
+
+# Filter by namespace
+k9s -n ollamaserver-shared
+
+# Quick kubectl checks
+kubectl get pods -A
+kubectl get deployments -A | grep -iE 'ollama|vllm'
+kubectl logs -n ollamaserver-shared -l io.kompose.service=ollama --tail=20
+kubectl scale deployment ollama -n ollamaserver-shared --replicas=0
+```
+
+### Limitations
+
+- **LAN only** — Olares has no host-level Tailscale, so k9s only works from the local network
+- **Metrics API not available** — `kubectl top` / k9s resource view won't work
+- **Kubeconfig rotation** — if Olares is reinstalled or K3s certs rotate, re-copy the kubeconfig:
+  ```bash
+  ssh olares "sudo cat /etc/rancher/k3s/k3s.yaml" | sed 's|https://127.0.0.1:6443|https://192.168.0.145:6443|' > ~/.kube/config
+  chmod 600 ~/.kube/config
+  ```
+
+## Dashboard Integration
+
+The [Homelab Dashboard](dashboard.md) monitors Olares via SSH:
+- **GPU status**: `nvidia-smi` queries displayed on Dashboard and Infrastructure pages
+- **K3s pods**: pod listing on Infrastructure page (`/api/olares/pods`)
+- **Jellyfin**: sessions and recently added items on Media page (via `kubectl exec` + curl to bypass Olares auth sidecar)
+- **Ollama**: availability check and AI chat widget (uses Ollama LAN endpoint at `192.168.0.145:31434`)
+- **Quick actions**: restart Jellyfin and Ollama deployments via `kubectl rollout restart`
+
+## Maintenance
+
+### Reboot
+```bash
+ssh olares 'sudo reboot'
+```
+Allow 3-5 minutes for K8s pods to come back up. Check with:
+```bash
+ssh olares 'sudo kubectl get pods -A | grep -v Running'
+```
+
+### Memory Management
+With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:
+```bash
+ssh olares 'free -h; nvidia-smi'
+```