Files
homelab-optimized/docs/services/individual/olares.md
Gitea Mirror Bot a95a68e477
Some checks failed
Documentation / Build Docusaurus (push) Failing after 8s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-15 06:28:27 UTC
2026-03-15 06:28:27 +00:00

7.1 KiB

Olares

Kubernetes Self-Hosting Platform

Service Overview

Property Value
Host olares (192.168.0.145)
OS Ubuntu 24.04.3 LTS
Platform Olares (Kubernetes/K3s with Calico CNI)
Hardware Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe
SSH ssh olares (key auth, user: olares)

Purpose

Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).

Primary use case: local LLM inference via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).

LLM Services

Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under *.vishinator.olares.com.

Available Models

Model Backend Endpoint Context Notes
Qwen3 30B A3B (4-bit) vLLM https://04521407.vishinator.olares.com/v1 ~40k tokens MoE, fast inference, limited tool calling
Qwen3.5 27B Q4_K_M Ollama https://37e62186.vishinator.olares.com/v1 40k+ (262k native) Dense, best for agentic coding
GPT-OSS 20B vLLM https://6941bf89.vishinator.olares.com/v1 65k tokens Requires auth bypass in Olares settings
Gemma 3 27B vLLM https://d3ea5398.vishinator.olares.com/v1 ~8k tokens Limited by VRAM for KV cache

GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)

  • Only run one model at a time to avoid VRAM exhaustion
  • vLLM --gpu-memory-utilization 0.95 is the default
  • Context limits are determined by available KV cache after model loading
  • Use nvidia-smi or check vLLM logs for actual KV cache capacity
  • Scale unused model deployments to 0: kubectl scale deployment vllm -n <namespace> --replicas=0

vLLM max_model_len

The max_model_len parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:

Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens

To change it, either:

  1. Edit in the Olares app settings UI (persistent across redeploys)
  2. Patch the deployment directly (resets on redeploy):
    kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json
    # Edit max-model-len in the command string
    kubectl apply -f /tmp/patch.json
    

OpenClaw (Chat Agent)

OpenClaw runs as a Kubernetes app in the clawdbot-vishinator namespace.

Configuration

Config file inside the pod: /home/node/.openclaw/openclaw.json

To read/write config:

ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json

Key Settings

  • Compaction: mode: "safeguard" with maxHistoryShare: 0.5 prevents context overflow
  • contextWindow: Must match vLLM's actual max_model_len (not the model's native limit)
  • Workspace data: Lives at /home/node/.openclaw/workspace/ inside the pod
  • Brew packages: OpenClaw has Homebrew; install tools with brew install <pkg> from the agent or pod

Troubleshooting

Error Cause Fix
localhost:8000 connection refused Model provider not configured or not running Check model endpoint URL in config, verify vLLM pod is running
Context overflow Prompt exceeded model's context limit Enable compaction, or /reset the session
pairing required (WebSocket 1008) Device pairing data was cleared Reload the Control UI page to re-pair
does not support tools (400) Ollama model lacks tool calling template Use vLLM with --enable-auto-tool-choice instead of Ollama
max_tokens must be at least 1, got negative Context window too small for system prompt + tools Increase max_model_len (vLLM) or num_ctx (Ollama)
bad request / 400 from Ollama Request exceeds num_ctx Increase num_ctx in Modelfile: ollama create model -f Modelfile
302 redirect on model endpoint Olares auth (Authelia) blocking API access Disable auth for the endpoint in Olares app settings
vLLM server pod scaled to 0 Previously stopped, client pod crashes Scale up: kubectl scale deployment vllm -n <namespace> --replicas=1

OpenCode Configuration

OpenCode on the homelab VM and moon are configured to use these endpoints.

Config Location

  • homelab VM: ~/.config/opencode/opencode.json
  • moon: ~/.config/opencode/opencode.json (user: moon)

Model Switching

Change the "model" field in opencode.json:

"model": "olares//models/qwen3-30b"

Available provider/model strings:

  • olares//models/qwen3-30b (recommended — supports tool calling via vLLM)
  • olares-gptoss//models/gpt-oss-20b
  • olares-qwen35/qwen3.5:27b-q4_K_M (Ollama — does NOT support tool calling, avoid for OpenCode)

Important: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with --enable-auto-tool-choice --tool-call-parser hermes for reliable tool use.

Loop Prevention

"mode": {
  "build": {
    "steps": 25,
    "permission": { "doom_loop": "deny" }
  },
  "plan": {
    "steps": 15,
    "permission": { "doom_loop": "deny" }
  }
}

Built-in Services

Olares runs its own infrastructure in Kubernetes:

  • Headscale + Tailscale: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
  • Authelia: SSO/auth gateway for app endpoints
  • Envoy: Sidecar proxy for all apps
  • HAMI: GPU device scheduler for vLLM/Ollama pods
  • Prometheus: Metrics collection

Network

Interface IP Notes
LAN (enp129s0) 192.168.0.145/24 Primary access
Tailscale (K8s pod) 100.64.0.1 Olares internal tailnet only

Note: The host does not have Tailscale installed directly. The K8s Tailscale pod uses tailscale0 and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.

Known Issues

  • Do NOT install host-level Tailscale — it conflicts with the K8s Tailscale pod's tailscale0 interface and causes total network loss requiring physical reboot
  • Ollama Qwen3.5 27B lacks tool calling — Ollama's model template doesn't support tools; use vLLM for coding agents
  • Only run one model at a time — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
  • vLLM startup takes 2-3 minutes — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
  • Olares auth (Authelia) blocks API endpoints by default — new model endpoints need auth bypass configured in Olares app settings

Maintenance

Reboot

ssh olares 'sudo reboot'

Allow 3-5 minutes for K8s pods to come back up. Check with:

ssh olares 'sudo kubectl get pods -A | grep -v Running'

Memory Management

With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:

ssh olares 'free -h; nvidia-smi'