18 KiB
Olares
Kubernetes Self-Hosting Platform
Service Overview
| Property | Value |
|---|---|
| Host | olares (192.168.0.145) |
| OS | Ubuntu 24.04.3 LTS |
| Platform | Olares (Kubernetes/K3s with Calico CNI) |
| Hardware | Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe |
| SSH | ssh olares (key auth, user: olares) |
Purpose
Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).
Primary use case: local LLM inference via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).
LLM Services
Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under *.vishinator.olares.com.
Available Models
| Model | Backend | Namespace | Endpoint | Context | Notes |
|---|---|---|---|---|---|
| Qwen3-Coder 30B | Ollama | ollamaserver-shared |
https://a5be22681.vishinator.olares.com/v1 |
65k tokens | MoE (3.3B active), coding-focused, currently active |
| Qwen3 30B A3B (4-bit) | vLLM | vllmqwen330ba3bv2server-shared |
https://04521407.vishinator.olares.com/v1 |
~40k tokens | MoE, fast inference, limited tool calling |
| Qwen3 30B A3B (4-bit) | vLLM | vllmqwen330ba3binstruct4bitv2-vishinator |
— | ~40k tokens | Duplicate deployment (vishinator namespace) |
| Qwen3.5 27B Q4_K_M | Ollama | ollamaqwen3527bq4kmv2server-shared |
https://37e62186.vishinator.olares.com/v1 |
40k+ (262k native) | Dense, best for agentic coding |
| GPT-OSS 20B | vLLM | vllmgptoss20bv2server-shared |
https://6941bf89.vishinator.olares.com/v1 |
65k tokens | Requires auth bypass in Olares settings |
| Qwen3.5 9B | Ollama | ollamaqwen359bv2server-shared |
— | — | Lightweight, scaled to 0 |
| Qwen3-30B-A3B AWQ 4-bit | vLLM | vllm-qwen3:32b |
— (raw kubectl, no Olares URL) | 16k tokens | Failed experiment — context too small for agentic coding, scaled to 0. See opencode.md |
GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)
- Only run one model at a time to avoid VRAM exhaustion
- vLLM
--gpu-memory-utilization 0.95is the default - Context limits are determined by available KV cache after model loading
- Use
nvidia-smior check vLLM logs for actual KV cache capacity - Before starting a model, scale down all others (see Scaling Operations below)
Scaling Operations
Only one model should be loaded at a time due to VRAM constraints. Use these commands to switch between models.
Check what's running:
ssh olares "sudo kubectl get deployments -A | grep -iE 'vllm|ollama'"
ssh olares "nvidia-smi --query-gpu=memory.used,memory.free --format=csv"
Stop all LLM deployments (free GPU):
# Qwen3-Coder (Ollama — currently active)
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=0"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=0"
# Qwen3 30B A3B vLLM (shared)
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=0"
# Qwen3 30B A3B vLLM (vishinator)
ssh olares "sudo kubectl scale deployment vllmqwen330ba3binstruct4bitv2 -n vllmqwen330ba3binstruct4bitv2-vishinator --replicas=0"
# Qwen3.5 27B Ollama
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
# GPT-OSS 20B vLLM
ssh olares "sudo kubectl scale deployment vllm -n vllmgptoss20bv2server-shared --replicas=0"
Start Qwen3-Coder (Ollama):
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=1"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=1"
Start Qwen3 30B A3B (vLLM):
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=1"
# Wait 2-3 minutes for vLLM startup, then check:
ssh olares "sudo kubectl logs -n vllmqwen330ba3bv2server-shared -l io.kompose.service=vllm --tail=5"
Start Qwen3.5 27B (Ollama):
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
Unload a model from Ollama (without scaling down the pod):
ssh olares "sudo kubectl exec -n ollamaserver-shared \$(sudo kubectl get pods -n ollamaserver-shared -l io.kompose.service=ollama -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama stop qwen3:32b"
vLLM max_model_len
The max_model_len parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:
Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens
To change it, either:
- Edit in the Olares app settings UI (persistent across redeploys)
- Patch the deployment directly (resets on redeploy):
kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json # Edit max-model-len in the command string kubectl apply -f /tmp/patch.json
OpenClaw (Chat Agent)
OpenClaw runs as a Kubernetes app in the clawdbot-vishinator namespace.
Configuration
Config file inside the pod: /home/node/.openclaw/openclaw.json
To read/write config:
ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json
Key Settings
- Compaction:
mode: "safeguard"withmaxHistoryShare: 0.5prevents context overflow - contextWindow: Must match vLLM's actual
max_model_len(not the model's native limit) - Workspace data: Lives at
/home/node/.openclaw/workspace/inside the pod - Brew packages: OpenClaw has Homebrew; install tools with
brew install <pkg>from the agent or pod
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
localhost:8000 connection refused |
Model provider not configured or not running | Check model endpoint URL in config, verify vLLM pod is running |
Context overflow |
Prompt exceeded model's context limit | Enable compaction, or /reset the session |
pairing required (WebSocket 1008) |
Device pairing data was cleared | Reload the Control UI page to re-pair |
does not support tools (400) |
Ollama model lacks tool calling template | Use vLLM with --enable-auto-tool-choice instead of Ollama |
max_tokens must be at least 1, got negative |
Context window too small for system prompt + tools | Increase max_model_len (vLLM) or num_ctx (Ollama) |
bad request / 400 from Ollama |
Request exceeds num_ctx |
Increase num_ctx in Modelfile: ollama create model -f Modelfile |
| 302 redirect on model endpoint | Olares auth (Authelia) blocking API access | Disable auth for the endpoint in Olares app settings |
| vLLM server pod scaled to 0 | Previously stopped, client pod crashes | Scale up: kubectl scale deployment vllm -n <namespace> --replicas=1 |
OpenCode Configuration
OpenCode on the homelab VM and moon are configured to use these endpoints.
Config Location
- homelab VM:
~/.config/opencode/opencode.json - moon:
~/.config/opencode/opencode.json(user: moon)
Model Switching
Change the "model" field in opencode.json:
"model": "olares//models/qwen3-30b"
Available provider/model strings:
olares//models/qwen3-30b(recommended — supports tool calling via vLLM)olares-gptoss//models/gpt-oss-20bolares-qwen35/qwen3.5:27b-q4_K_M(Ollama — does NOT support tool calling, avoid for OpenCode)
Important: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with --enable-auto-tool-choice --tool-call-parser hermes for reliable tool use.
Loop Prevention
"mode": {
"build": {
"steps": 25,
"permission": { "doom_loop": "deny" }
},
"plan": {
"steps": 15,
"permission": { "doom_loop": "deny" }
}
}
Storage — NFS Mount from Atlantis
Olares has an NFS mount from Atlantis for persistent storage shared with the homelab:
| Property | Value |
|---|---|
| Mount point | /mnt/atlantis_olares_storage |
| Source | 192.168.0.200:/volume1/documents/olares_storage |
| Access | Read/write (all_squash, anonuid=1026/anongid=100) |
| Persistent | Yes — configured in /etc/fstab |
| Capacity | 84TB pool (46TB free as of 2026-03-16) |
fstab entry
192.168.0.200:/volume1/documents/olares_storage /mnt/atlantis_olares_storage nfs rw,async,hard,intr,rsize=8192,wsize=8192,timeo=14 0 0
Mount/unmount manually
# Mount
sudo mount /mnt/atlantis_olares_storage
# Unmount
sudo umount /mnt/atlantis_olares_storage
# Check
df -h /mnt/atlantis_olares_storage
ls /mnt/atlantis_olares_storage
Troubleshooting
- If mount fails after reboot, check Atlantis is up and NFS is running:
sudo showmount -e 192.168.0.200 - Fail2ban on Olares may ban homelab-vm (
192.168.0.210) — whitelist is/etc/fail2ban/jail.d/local.confwithignoreip = 127.0.0.1/8 ::1 192.168.0.0/24 - SSH to Olares uses key auth (
ssh olaresworks from homelab-vm) — key installed 2026-03-16
Built-in Services
Olares runs its own infrastructure in Kubernetes:
- Headscale + Tailscale: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
- Authelia: SSO/auth gateway for app endpoints
- Envoy: Sidecar proxy for all apps
- HAMI: GPU device scheduler for vLLM/Ollama pods
- Prometheus: Metrics collection
Network
| Interface | IP | Notes |
|---|---|---|
| LAN (enp129s0) | 192.168.0.145/24 | Primary access |
| Tailscale (K8s pod) | 100.64.0.1 | Olares internal tailnet only |
Note: The host does not have Tailscale installed directly. The K8s Tailscale pod uses tailscale0 and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.
Media — Jellyfin
Jellyfin is deployed from the Olares marketplace with manual patches for NFS media and GPU transcoding. See jellyfin-olares.md for full details.
| Property | Value |
|---|---|
| Namespace | jellyfin-vishinator |
| LAN Access | http://192.168.0.145:30096 |
| Olares Proxy | https://7e89d2a1.vishinator.olares.com |
| Media Source | Atlantis NFS → /media/ (movies, tv, anime, music, audiobooks) |
| GPU Transcoding | NVIDIA NVENC (AV1/HEVC/H264), tone mapping, hardware decode |
Important: Use LAN URL for streaming (Olares proxy adds ~100ms latency per request, causes buffering).
Ollama LAN Access
Ollama is exposed directly on LAN for services that can't authenticate through the Olares proxy (e.g., Gmail auto-labeler cron jobs).
| Property | Value |
|---|---|
| LAN URL | http://192.168.0.145:31434 |
| Olares Proxy | https://a5be22681.vishinator.olares.com (requires auth) |
| Service | ollama-lan in ollamaserver-shared namespace |
| Calico Policy | allow-lan-to-ollama GlobalNetworkPolicy |
Tdarr Node (GPU Transcoding)
Tdarr transcoding node using the RTX 5090 NVENC hardware encoder. Fastest node in the cluster.
| Property | Value |
|---|---|
| Namespace | tdarr-node |
| Manifest | olares/tdarr-node.yaml |
| Version | 2.67.01 (pinned by digest) |
| Server | Atlantis (192.168.0.200:8266) |
| GPU Encoders | h264_nvenc, hevc_nvenc, av1_nvenc |
| Workers | GPU=2, CPU=0, Health=1 |
NFS mounts:
/mnt/atlantis_media— media library (read-write, Tdarr needs write access to replace transcoded files in place)/mnt/atlantis_cache— transcoding cache (read-write, shared with all nodes)
NFS cache mount (/mnt/atlantis_cache):
- Source:
192.168.0.200:/volume1/data/tdarr_cache - Added to
/etc/fstabon Olares for persistence - Required
no_root_squashon Atlantis NFS export for192.168.0.145(Olares containers run as root, defaultroot_squashmaps root tonobodycausing permission denied on cache writes)
Deploy/redeploy:
ssh olares "kubectl apply -f -" < olares/tdarr-node.yaml
ssh olares "kubectl get pods -n tdarr-node"
ssh olares "kubectl exec -n tdarr-node deploy/tdarr-node -- nvidia-smi"
GPU contention: NVENC uses dedicated hardware separate from CUDA cores. Tdarr + Ollama coexist fine. Tdarr + Jellyfin may compete for NVENC sessions (RTX 5090 supports up to 8 concurrent).
Calico GlobalNetworkPolicies
Olares auto-creates restrictive app-np NetworkPolicies per namespace that block LAN traffic. These cannot be modified (admission webhook reverts changes) or supplemented (webhook deletes custom policies). The solution is Calico GlobalNetworkPolicies which operate at a level Olares can't override.
Active policies:
kubectl get globalnetworkpolicy
# allow-lan-to-jellyfin — 192.168.0.0/24 → app=jellyfin
# allow-lan-to-ollama — 192.168.0.0/24 → io.kompose.service=ollama
# allow-lan-to-tdarr — 192.168.0.0/24 ingress + all egress → app=tdarr-node
NFS Media Mount from Atlantis
| Property | Value |
|---|---|
| Mount Point | /mnt/atlantis_media |
| Source | 192.168.0.200:/volume1/data/media |
| Contents | movies, tv, anime, music, audiobooks, ebooks, podcasts |
| Performance | 180-420 MB/s sequential read |
| Persistent | Yes — in /etc/fstab |
# /etc/fstab
192.168.0.200:/volume1/data/media /mnt/atlantis_media nfs rw,async,hard,intr,rsize=131072,wsize=131072 0 0
Known Issues
- Do NOT install host-level Tailscale — it conflicts with the K8s Tailscale pod's
tailscale0interface and causes total network loss requiring physical reboot - Ollama Qwen3.5 27B lacks tool calling — Ollama's model template doesn't support tools; use vLLM for coding agents
- Only run one model at a time — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
- vLLM startup takes 2-3 minutes — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
- Olares auth (Authelia) blocks API endpoints by default — new model endpoints need auth bypass configured in Olares app settings
- Raw kubectl deployments don't get Olares URLs — apps deployed outside Studio/Market have no managed ingress (
*.vishinator.olares.com). Use SSH tunnels or NodePort (if networking allows) as workarounds - HAMI GPU scheduler requires Olares labels — pods requesting GPU without
applications.app.bytetrade.io/namelabel will fail to schedule withcannot schedule pod without applications.app.bytetrade.io/name label - Never name a k8s service
vllm— Kubernetes auto-injectsVLLM_PORTenv var from service discovery, which conflicts with vLLM's own config. Usevllm-serveror similar - HAMI vGPU causes ffmpeg segfaults — HAMI injects
libvgpu.sovia/etc/ld.so.preloadwhich intercepts CUDA calls. This causes ffmpeg to crash (exit 139) during GPU transcoding. Fix: don't requestnvidia.com/gpuresources, useruntimeClassName: nvidiadirectly - Olares admission webhook blocks LAN access — auto-created
app-npNetworkPolicies can't be modified or supplemented. Use Calico GlobalNetworkPolicy for LAN access - Olares proxy adds ~100ms latency — direct LAN access via NodePort + GlobalNetworkPolicy is 88x faster; use for streaming/high-throughput services
- hostNetwork blocked — Olares admission webhook rejects
hostNetwork: truepods with "HostNetwork Enabled Unsupported" - Marketplace app patches lost on update — kubectl patches to marketplace apps (NFS mounts, GPU access) are overwritten when the app is updated. Re-apply after updates
Remote Management with k9s
k9s and kubectl are installed on the homelab VM for managing Olares pods without SSH.
Setup
| Component | Details |
|---|---|
| kubectl | /usr/local/bin/kubectl (v1.35.2) |
| k9s | /usr/local/bin/k9s (v0.50.18) |
| kubeconfig | ~/.kube/config → https://192.168.0.145:6443 |
| Access | Full admin (K3s default user), LAN only |
The kubeconfig was copied from /etc/rancher/k3s/k3s.yaml on Olares with the server address changed from 127.0.0.1 to 192.168.0.145.
Usage
# Launch k9s (interactive TUI)
k9s
# Filter by namespace
k9s -n ollamaserver-shared
# Quick kubectl checks
kubectl get pods -A
kubectl get deployments -A | grep -iE 'ollama|vllm'
kubectl logs -n ollamaserver-shared -l io.kompose.service=ollama --tail=20
kubectl scale deployment ollama -n ollamaserver-shared --replicas=0
Limitations
- LAN only — Olares has no host-level Tailscale, so k9s only works from the local network
- Metrics API not available —
kubectl top/ k9s resource view won't work - Kubeconfig rotation — if Olares is reinstalled or K3s certs rotate, re-copy the kubeconfig:
ssh olares "sudo cat /etc/rancher/k3s/k3s.yaml" | sed 's|https://127.0.0.1:6443|https://192.168.0.145:6443|' > ~/.kube/config chmod 600 ~/.kube/config
Dashboard Integration
The Homelab Dashboard monitors Olares via SSH:
- GPU status:
nvidia-smiqueries displayed on Dashboard and Infrastructure pages - K3s pods: pod listing on Infrastructure page (
/api/olares/pods) - Jellyfin: sessions and recently added items on Media page (via
kubectl exec+ curl to bypass Olares auth sidecar) - Ollama: availability check and AI chat widget (uses Ollama LAN endpoint at
192.168.0.145:31434) - Quick actions: restart Jellyfin and Ollama deployments via
kubectl rollout restart
Maintenance
Reboot
ssh olares 'sudo reboot'
Allow 3-5 minutes for K8s pods to come back up. Check with:
ssh olares 'sudo kubectl get pods -A | grep -v Running'
Memory Management
With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:
ssh olares 'free -h; nvidia-smi'