Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 37ee54f6e9

Documentation / Deploy to GitHub Pages (push) Has been cancelled

Details

Documentation / Build Docusaurus (push) Has been cancelled

Details

Sanitized mirror from private repository - 2026-04-19 09:37:42 UTC

2026-04-19 09:37:42 +00:00

20 KiB

Raw Blame History

Olares

Kubernetes Self-Hosting Platform

Service Overview

Property	Value
Host	olares (192.168.0.145)
OS	Ubuntu 24.04.3 LTS
Platform	Olares (Kubernetes/K3s with Calico CNI)
Hardware	Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe
SSH	`ssh olares` (key auth, user: olares)

Purpose

Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).

Primary use case: local LLM inference via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).

LLM Services

Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under *.vishinator.olares.com.

Available Models

Model	Backend	Namespace	Endpoint	Context	Notes
Qwen3-Coder 30B	Ollama	`ollamaserver-shared`	`https://a5be22681.vishinator.olares.com/v1`	65k tokens	MoE (3.3B active), coding-focused, currently active
Qwen3 30B A3B (4-bit)	vLLM	`vllmqwen330ba3bv2server-shared`	`https://04521407.vishinator.olares.com/v1`	~40k tokens	MoE, fast inference, limited tool calling
Qwen3 30B A3B (4-bit)	vLLM	`vllmqwen330ba3binstruct4bitv2-vishinator`	—	~40k tokens	Duplicate deployment (vishinator namespace)
Qwen3.5 27B Q4_K_M	Ollama	`ollamaqwen3527bq4kmv2server-shared`	`https://37e62186.vishinator.olares.com/v1`	40k+ (262k native)	Dense, best for agentic coding
GPT-OSS 20B	vLLM	`vllmgptoss20bv2server-shared`	`https://6941bf89.vishinator.olares.com/v1`	65k tokens	Requires auth bypass in Olares settings
Qwen3.5 9B	Ollama	`ollamaqwen359bv2server-shared`	—	—	Lightweight, scaled to 0
Qwen3-30B-A3B AWQ 4-bit	vLLM	`vllm-qwen3:32b`	— (raw kubectl, no Olares URL)	16k tokens	Failed experiment — context too small for agentic coding, scaled to 0. See opencode.md

GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)

Only run one model at a time to avoid VRAM exhaustion
vLLM --gpu-memory-utilization 0.95 is the default
Context limits are determined by available KV cache after model loading
Use nvidia-smi or check vLLM logs for actual KV cache capacity
Before starting a model, scale down all others (see Scaling Operations below)

Scaling Operations

Only one model should be loaded at a time due to VRAM constraints. Use these commands to switch between models.

Check what's running:

ssh olares "sudo kubectl get deployments -A | grep -iE 'vllm|ollama'"
ssh olares "nvidia-smi --query-gpu=memory.used,memory.free --format=csv"

Stop all LLM deployments (free GPU):

# Qwen3-Coder (Ollama — currently active)
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=0"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=0"

# Qwen3 30B A3B vLLM (shared)
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=0"

# Qwen3 30B A3B vLLM (vishinator)
ssh olares "sudo kubectl scale deployment vllmqwen330ba3binstruct4bitv2 -n vllmqwen330ba3binstruct4bitv2-vishinator --replicas=0"

# Qwen3.5 27B Ollama
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=0"

# GPT-OSS 20B vLLM
ssh olares "sudo kubectl scale deployment vllm -n vllmgptoss20bv2server-shared --replicas=0"

Start Qwen3-Coder (Ollama):

ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=1"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=1"

Start Qwen3 30B A3B (vLLM):

ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=1"
# Wait 2-3 minutes for vLLM startup, then check:
ssh olares "sudo kubectl logs -n vllmqwen330ba3bv2server-shared -l io.kompose.service=vllm --tail=5"

Start Qwen3.5 27B (Ollama):

ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=1"

Unload a model from Ollama (without scaling down the pod):

ssh olares "sudo kubectl exec -n ollamaserver-shared \$(sudo kubectl get pods -n ollamaserver-shared -l io.kompose.service=ollama -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama stop qwen3:32b"

vLLM max_model_len

The max_model_len parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:

Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens

To change it, either:

Edit in the Olares app settings UI (persistent across redeploys)

Patch the deployment directly (resets on redeploy):

kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json
# Edit max-model-len in the command string
kubectl apply -f /tmp/patch.json

OpenClaw (Chat Agent)

OpenClaw runs as a Kubernetes app in the clawdbot-vishinator namespace.

Configuration

Config file inside the pod: /home/node/.openclaw/openclaw.json

To read/write config:

ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json

Key Settings

Compaction: mode: "safeguard" with maxHistoryShare: 0.5 prevents context overflow
contextWindow: Must match vLLM's actual max_model_len (not the model's native limit)
Workspace data: Lives at /home/node/.openclaw/workspace/ inside the pod
Brew packages: OpenClaw has Homebrew; install tools with brew install <pkg> from the agent or pod

Troubleshooting

Error	Cause	Fix
`localhost:8000 connection refused`	Model provider not configured or not running	Check model endpoint URL in config, verify vLLM pod is running
`Context overflow`	Prompt exceeded model's context limit	Enable compaction, or `/reset` the session
`pairing required` (WebSocket 1008)	Device pairing data was cleared	Reload the Control UI page to re-pair
`does not support tools` (400)	Ollama model lacks tool calling template	Use vLLM with `--enable-auto-tool-choice` instead of Ollama
`max_tokens must be at least 1, got negative`	Context window too small for system prompt + tools	Increase `max_model_len` (vLLM) or `num_ctx` (Ollama)
`bad request` / 400 from Ollama	Request exceeds `num_ctx`	Increase `num_ctx` in Modelfile: `ollama create model -f Modelfile`
302 redirect on model endpoint	Olares auth (Authelia) blocking API access	Disable auth for the endpoint in Olares app settings
vLLM server pod scaled to 0	Previously stopped, client pod crashes	Scale up: `kubectl scale deployment vllm -n <namespace> --replicas=1`

OpenCode Configuration

OpenCode on the homelab VM and moon are configured to use these endpoints.

Config Location

homelab VM: ~/.config/opencode/opencode.json
moon: ~/.config/opencode/opencode.json (user: moon)

Model Switching

Change the "model" field in opencode.json:

"model": "olares//models/qwen3-30b"

Available provider/model strings:

olares//models/qwen3-30b (recommended — supports tool calling via vLLM)
olares-gptoss//models/gpt-oss-20b
olares-qwen35/qwen3.5:27b-q4_K_M (Ollama — does NOT support tool calling, avoid for OpenCode)

Important: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with --enable-auto-tool-choice --tool-call-parser hermes for reliable tool use.

Loop Prevention

"mode": {
  "build": {
    "steps": 25,
    "permission": { "doom_loop": "deny" }
  },
  "plan": {
    "steps": 15,
    "permission": { "doom_loop": "deny" }
  }
}

Storage — NFS Mount from Atlantis

Olares has an NFS mount from Atlantis for persistent storage shared with the homelab:

Property	Value
Mount point	`/mnt/atlantis_olares_storage`
Source	`192.168.0.200:/volume1/documents/olares_storage`
Access	Read/write (`all_squash`, anonuid=1026/anongid=100)
Persistent	Yes — configured in `/etc/fstab`
Capacity	84TB pool (46TB free as of 2026-03-16)

fstab entry

192.168.0.200:/volume1/documents/olares_storage /mnt/atlantis_olares_storage nfs rw,async,hard,intr,rsize=8192,wsize=8192,timeo=14 0 0

Mount/unmount manually

# Mount
sudo mount /mnt/atlantis_olares_storage

# Unmount
sudo umount /mnt/atlantis_olares_storage

# Check
df -h /mnt/atlantis_olares_storage
ls /mnt/atlantis_olares_storage

Troubleshooting

If mount fails after reboot, check Atlantis is up and NFS is running: sudo showmount -e 192.168.0.200
Fail2ban on Olares may ban homelab-vm (192.168.0.210) — whitelist is /etc/fail2ban/jail.d/local.conf with ignoreip = 127.0.0.1/8 ::1 192.168.0.0/24
SSH to Olares uses key auth (ssh olares works from homelab-vm) — key installed 2026-03-16

Built-in Services

Olares runs its own infrastructure in Kubernetes:

Headscale + Tailscale: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
Authelia: SSO/auth gateway for app endpoints
Envoy: Sidecar proxy for all apps
HAMI: GPU device scheduler for vLLM/Ollama pods
Prometheus: Metrics collection

Network

Interface	IP	Notes
LAN (enp129s0)	192.168.0.145/24	Primary access
Tailscale (K8s pod)	100.64.0.1	Olares internal tailnet only

Note: The host does not have Tailscale installed directly. The K8s Tailscale pod uses tailscale0 and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.

Media — Jellyfin

Jellyfin is deployed from the Olares marketplace with manual patches for NFS media and GPU transcoding. See jellyfin-olares.md for full details.

Property	Value
Namespace	`jellyfin-vishinator`
LAN Access	`http://192.168.0.145:30096`
Olares Proxy	`https://7e89d2a1.vishinator.olares.com`
Media Source	Atlantis NFS → `/media/` (movies, tv, anime, music, audiobooks)
GPU Transcoding	NVIDIA NVENC (AV1/HEVC/H264), tone mapping, hardware decode

Important: Use LAN URL for streaming (Olares proxy adds ~100ms latency per request, causes buffering).

Ollama LAN Access

Ollama is exposed directly on LAN for services that can't authenticate through the Olares proxy (e.g., Gmail auto-labeler cron jobs).

Property	Value
LAN URL	`http://192.168.0.145:31434`
Olares Proxy	`https://a5be22681.vishinator.olares.com` (requires auth)
Service	`ollama-lan` in `ollamaserver-shared` namespace
Calico Policy	`allow-lan-to-ollama` GlobalNetworkPolicy

Tdarr Node (GPU Transcoding)

Tdarr transcoding node using the RTX 5090 NVENC hardware encoder. Fastest node in the cluster.

Property	Value
Namespace	`tdarr-node`
Manifest	`olares/tdarr-node.yaml`
Version	2.67.01 (pinned by digest)
Server	Atlantis (192.168.0.200:8266)
GPU Encoders	h264_nvenc, hevc_nvenc, av1_nvenc
Workers	GPU=2, CPU=0, Health=1

NFS mounts:

/mnt/atlantis_media — media library (read-write, Tdarr needs write access to replace transcoded files in place)
/mnt/atlantis_cache — transcoding cache (read-write, shared with all nodes)

NFS cache mount (/mnt/atlantis_cache):

Source: 192.168.0.200:/volume1/data/tdarr_cache
Added to /etc/fstab on Olares for persistence
Required no_root_squash on Atlantis NFS export for 192.168.0.145 (Olares containers run as root, default root_squash maps root to nobody causing permission denied on cache writes)

Deploy/redeploy:

ssh olares "kubectl apply -f -" < olares/tdarr-node.yaml
ssh olares "kubectl get pods -n tdarr-node"
ssh olares "kubectl exec -n tdarr-node deploy/tdarr-node -- nvidia-smi"

GPU contention: NVENC uses dedicated hardware separate from CUDA cores. Tdarr + Ollama coexist fine. Tdarr + Jellyfin may compete for NVENC sessions (RTX 5090 supports up to 8 concurrent).

Troubleshooting: slow transcodes (CPU fallback)

Symptom: Tdarr jobs running but GPU utilization is 0%; ps -ef inside the node pod shows tdarr-ffmpeg ... -c:v libx265 ... instead of hevc_nvenc.

Two root causes to check in order:

1. Pod lost GPU runtime state

ssh olares 'kubectl exec -n tdarr-node $(kubectl get pod -n tdarr-node -o jsonpath="{.items[0].metadata.name}") -- nvidia-smi'

If you see Failed to initialize NVML: Unknown Error or a direct NVENC test fails with CUDA_ERROR_NO_DEVICE, the pod's GPU bindings are stale (usually after a host NVIDIA driver activity while the pod kept running). Fix:

ssh olares 'kubectl delete pod -n tdarr-node $(kubectl get pod -n tdarr-node -o jsonpath="{.items[0].metadata.name}")'
# Deployment recreates it with working GPU access

2. Library plugin is Intel-QSV or CPU-only

The Tdarr library's plugin stack may be using an Intel-only encoder that falls back silently to libx265 on NVIDIA nodes. Avoid these plugins on Olares:

Tdarr_Plugin_bsh1_Boosh_FFMPEG_QSV_HEVC — requires Intel iGPU
Tdarr_Plugin_MC93_Migz1FFMPEG_CPU — explicitly CPU

Use Tdarr_Plugin_MC93_Migz1FFMPEG (no suffix) — it auto-detects NVIDIA and uses hevc_nvenc. Swap via the Tdarr UI (http://192.168.0.200:8265/ → Libraries → per-library plugin stack) or directly via the server API:

curl -sX POST http://192.168.0.200:8266/api/v2/cruddb \
  -H "Content-Type: application/json" \
  -d '{"data":{"collection":"LibrarySettingsJSONDB","mode":"getById","docID":"<LIB_ID>"}}' \
  | jq '.pluginIDs[] | {id, checked, priority}'

Healthy state — the ffmpeg command should look like this (verify inside the pod with kubectl exec ... -- ps -ef | grep tdarr-ffmpeg):

tdarr-ffmpeg -y -hwaccel cuda -hwaccel_device 0 -i <input> \
  -map 0:0 -c:0 hevc_nvenc -qp 20 -preset p5 -gpu 0 ...

And nvidia-smi on the host should show utilization.encoder = 80-100% per active worker, each using ~500 MiB GPU RAM.

Incident: 2026-04-19 — pod had 13d uptime with stale NVML state + library plugins were QSV/CPU. Fixed by pod restart + plugin swap; went from libx265 CPU grind to 6 parallel NVENC workers at 100% encoder util.

Calico GlobalNetworkPolicies

Olares auto-creates restrictive app-np NetworkPolicies per namespace that block LAN traffic. These cannot be modified (admission webhook reverts changes) or supplemented (webhook deletes custom policies). The solution is Calico GlobalNetworkPolicies which operate at a level Olares can't override.

Active policies:

kubectl get globalnetworkpolicy
# allow-lan-to-jellyfin — 192.168.0.0/24 → app=jellyfin
# allow-lan-to-ollama   — 192.168.0.0/24 → io.kompose.service=ollama
# allow-lan-to-tdarr    — 192.168.0.0/24 ingress + all egress → app=tdarr-node

NFS Media Mount from Atlantis

Property	Value
Mount Point	`/mnt/atlantis_media`
Source	`192.168.0.200:/volume1/data/media`
Contents	movies, tv, anime, music, audiobooks, ebooks, podcasts
Performance	180-420 MB/s sequential read
Persistent	Yes — in `/etc/fstab`

# /etc/fstab
192.168.0.200:/volume1/data/media /mnt/atlantis_media nfs rw,async,hard,intr,rsize=131072,wsize=131072 0 0

Known Issues

Do NOT install host-level Tailscale — it conflicts with the K8s Tailscale pod's tailscale0 interface and causes total network loss requiring physical reboot
Ollama Qwen3.5 27B lacks tool calling — Ollama's model template doesn't support tools; use vLLM for coding agents
Only run one model at a time — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
vLLM startup takes 2-3 minutes — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
Olares auth (Authelia) blocks API endpoints by default — new model endpoints need auth bypass configured in Olares app settings
Raw kubectl deployments don't get Olares URLs — apps deployed outside Studio/Market have no managed ingress (*.vishinator.olares.com). Use SSH tunnels or NodePort (if networking allows) as workarounds
HAMI GPU scheduler requires Olares labels — pods requesting GPU without applications.app.bytetrade.io/name label will fail to schedule with cannot schedule pod without applications.app.bytetrade.io/name label
Never name a k8s service vllm — Kubernetes auto-injects VLLM_PORT env var from service discovery, which conflicts with vLLM's own config. Use vllm-server or similar
HAMI vGPU causes ffmpeg segfaults — HAMI injects libvgpu.so via /etc/ld.so.preload which intercepts CUDA calls. This causes ffmpeg to crash (exit 139) during GPU transcoding. Fix: don't request nvidia.com/gpu resources, use runtimeClassName: nvidia directly
Olares admission webhook blocks LAN access — auto-created app-np NetworkPolicies can't be modified or supplemented. Use Calico GlobalNetworkPolicy for LAN access
Olares proxy adds ~100ms latency — direct LAN access via NodePort + GlobalNetworkPolicy is 88x faster; use for streaming/high-throughput services
hostNetwork blocked — Olares admission webhook rejects hostNetwork: true pods with "HostNetwork Enabled Unsupported"
Marketplace app patches lost on update — kubectl patches to marketplace apps (NFS mounts, GPU access) are overwritten when the app is updated. Re-apply after updates

Remote Management with k9s

k9s and kubectl are installed on the homelab VM for managing Olares pods without SSH.

Setup

Component	Details
kubectl	`/usr/local/bin/kubectl` (v1.35.2)
k9s	`/usr/local/bin/k9s` (v0.50.18)
kubeconfig	`~/.kube/config` → `https://192.168.0.145:6443`
Access	Full admin (K3s default user), LAN only

The kubeconfig was copied from /etc/rancher/k3s/k3s.yaml on Olares with the server address changed from 127.0.0.1 to 192.168.0.145.

Usage

# Launch k9s (interactive TUI)
k9s

# Filter by namespace
k9s -n ollamaserver-shared

# Quick kubectl checks
kubectl get pods -A
kubectl get deployments -A | grep -iE 'ollama|vllm'
kubectl logs -n ollamaserver-shared -l io.kompose.service=ollama --tail=20
kubectl scale deployment ollama -n ollamaserver-shared --replicas=0

Limitations

LAN only — Olares has no host-level Tailscale, so k9s only works from the local network
Metrics API not available — kubectl top / k9s resource view won't work

Kubeconfig rotation — if Olares is reinstalled or K3s certs rotate, re-copy the kubeconfig:

ssh olares "sudo cat /etc/rancher/k3s/k3s.yaml" | sed 's|https://127.0.0.1:6443|https://192.168.0.145:6443|' > ~/.kube/config
chmod 600 ~/.kube/config

Dashboard Integration

The Homelab Dashboard monitors Olares via SSH:

GPU status: nvidia-smi queries displayed on Dashboard and Infrastructure pages
K3s pods: pod listing on Infrastructure page (/api/olares/pods)
Jellyfin: sessions and recently added items on Media page (via kubectl exec + curl to bypass Olares auth sidecar)
Ollama: availability check and AI chat widget (uses Ollama LAN endpoint at 192.168.0.145:31434)
Quick actions: restart Jellyfin and Ollama deployments via kubectl rollout restart

Maintenance

Reboot

ssh olares 'sudo reboot'

Allow 3-5 minutes for K8s pods to come back up. Check with:

ssh olares 'sudo kubectl get pods -A | grep -v Running'

Memory Management

With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:

ssh olares 'free -h; nvidia-smi'

20 KiB Raw Blame History

Olares

Service Overview

Purpose

LLM Services

Available Models

GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)

Scaling Operations

vLLM max_model_len

OpenClaw (Chat Agent)

Configuration

Key Settings

Troubleshooting

OpenCode Configuration

Config Location

Model Switching

Loop Prevention

Storage — NFS Mount from Atlantis

fstab entry

Mount/unmount manually

Troubleshooting

Built-in Services

Network

Media — Jellyfin

Ollama LAN Access

Tdarr Node (GPU Transcoding)

Troubleshooting: slow transcodes (CPU fallback)

Calico GlobalNetworkPolicies

NFS Media Mount from Atlantis

Known Issues

Remote Management with k9s

Setup

Usage

Limitations

Dashboard Integration

Maintenance

Reboot

Memory Management

20 KiB

Raw Blame History