12 KiB
Olares
Kubernetes Self-Hosting Platform
Service Overview
| Property | Value |
|---|---|
| Host | olares (192.168.0.145) |
| OS | Ubuntu 24.04.3 LTS |
| Platform | Olares (Kubernetes/K3s with Calico CNI) |
| Hardware | Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe |
| SSH | ssh olares (key auth, user: olares) |
Purpose
Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).
Primary use case: local LLM inference via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).
LLM Services
Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under *.vishinator.olares.com.
Available Models
| Model | Backend | Namespace | Endpoint | Context | Notes |
|---|---|---|---|---|---|
| Qwen3-Coder 30B | Ollama | ollamaserver-shared |
https://a5be22681.vishinator.olares.com/v1 |
65k tokens | MoE (3.3B active), coding-focused, currently active |
| Qwen3 30B A3B (4-bit) | vLLM | vllmqwen330ba3bv2server-shared |
https://04521407.vishinator.olares.com/v1 |
~40k tokens | MoE, fast inference, limited tool calling |
| Qwen3 30B A3B (4-bit) | vLLM | vllmqwen330ba3binstruct4bitv2-vishinator |
— | ~40k tokens | Duplicate deployment (vishinator namespace) |
| Qwen3.5 27B Q4_K_M | Ollama | ollamaqwen3527bq4kmv2server-shared |
https://37e62186.vishinator.olares.com/v1 |
40k+ (262k native) | Dense, best for agentic coding |
| GPT-OSS 20B | vLLM | vllmgptoss20bv2server-shared |
https://6941bf89.vishinator.olares.com/v1 |
65k tokens | Requires auth bypass in Olares settings |
| Qwen3.5 9B | Ollama | ollamaqwen359bv2server-shared |
— | — | Lightweight, scaled to 0 |
GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)
- Only run one model at a time to avoid VRAM exhaustion
- vLLM
--gpu-memory-utilization 0.95is the default - Context limits are determined by available KV cache after model loading
- Use
nvidia-smior check vLLM logs for actual KV cache capacity - Before starting a model, scale down all others (see Scaling Operations below)
Scaling Operations
Only one model should be loaded at a time due to VRAM constraints. Use these commands to switch between models.
Check what's running:
ssh olares "sudo kubectl get deployments -A | grep -iE 'vllm|ollama'"
ssh olares "nvidia-smi --query-gpu=memory.used,memory.free --format=csv"
Stop all LLM deployments (free GPU):
# Qwen3-Coder (Ollama — currently active)
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=0"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=0"
# Qwen3 30B A3B vLLM (shared)
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=0"
# Qwen3 30B A3B vLLM (vishinator)
ssh olares "sudo kubectl scale deployment vllmqwen330ba3binstruct4bitv2 -n vllmqwen330ba3binstruct4bitv2-vishinator --replicas=0"
# Qwen3.5 27B Ollama
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
# GPT-OSS 20B vLLM
ssh olares "sudo kubectl scale deployment vllm -n vllmgptoss20bv2server-shared --replicas=0"
Start Qwen3-Coder (Ollama):
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=1"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=1"
Start Qwen3 30B A3B (vLLM):
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=1"
# Wait 2-3 minutes for vLLM startup, then check:
ssh olares "sudo kubectl logs -n vllmqwen330ba3bv2server-shared -l io.kompose.service=vllm --tail=5"
Start Qwen3.5 27B (Ollama):
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
Unload a model from Ollama (without scaling down the pod):
ssh olares "sudo kubectl exec -n ollamaserver-shared \$(sudo kubectl get pods -n ollamaserver-shared -l io.kompose.service=ollama -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama stop qwen3-coder:latest"
vLLM max_model_len
The max_model_len parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:
Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens
To change it, either:
- Edit in the Olares app settings UI (persistent across redeploys)
- Patch the deployment directly (resets on redeploy):
kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json # Edit max-model-len in the command string kubectl apply -f /tmp/patch.json
OpenClaw (Chat Agent)
OpenClaw runs as a Kubernetes app in the clawdbot-vishinator namespace.
Configuration
Config file inside the pod: /home/node/.openclaw/openclaw.json
To read/write config:
ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json
Key Settings
- Compaction:
mode: "safeguard"withmaxHistoryShare: 0.5prevents context overflow - contextWindow: Must match vLLM's actual
max_model_len(not the model's native limit) - Workspace data: Lives at
/home/node/.openclaw/workspace/inside the pod - Brew packages: OpenClaw has Homebrew; install tools with
brew install <pkg>from the agent or pod
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
localhost:8000 connection refused |
Model provider not configured or not running | Check model endpoint URL in config, verify vLLM pod is running |
Context overflow |
Prompt exceeded model's context limit | Enable compaction, or /reset the session |
pairing required (WebSocket 1008) |
Device pairing data was cleared | Reload the Control UI page to re-pair |
does not support tools (400) |
Ollama model lacks tool calling template | Use vLLM with --enable-auto-tool-choice instead of Ollama |
max_tokens must be at least 1, got negative |
Context window too small for system prompt + tools | Increase max_model_len (vLLM) or num_ctx (Ollama) |
bad request / 400 from Ollama |
Request exceeds num_ctx |
Increase num_ctx in Modelfile: ollama create model -f Modelfile |
| 302 redirect on model endpoint | Olares auth (Authelia) blocking API access | Disable auth for the endpoint in Olares app settings |
| vLLM server pod scaled to 0 | Previously stopped, client pod crashes | Scale up: kubectl scale deployment vllm -n <namespace> --replicas=1 |
OpenCode Configuration
OpenCode on the homelab VM and moon are configured to use these endpoints.
Config Location
- homelab VM:
~/.config/opencode/opencode.json - moon:
~/.config/opencode/opencode.json(user: moon)
Model Switching
Change the "model" field in opencode.json:
"model": "olares//models/qwen3-30b"
Available provider/model strings:
olares//models/qwen3-30b(recommended — supports tool calling via vLLM)olares-gptoss//models/gpt-oss-20bolares-qwen35/qwen3.5:27b-q4_K_M(Ollama — does NOT support tool calling, avoid for OpenCode)
Important: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with --enable-auto-tool-choice --tool-call-parser hermes for reliable tool use.
Loop Prevention
"mode": {
"build": {
"steps": 25,
"permission": { "doom_loop": "deny" }
},
"plan": {
"steps": 15,
"permission": { "doom_loop": "deny" }
}
}
Storage — NFS Mount from Atlantis
Olares has an NFS mount from Atlantis for persistent storage shared with the homelab:
| Property | Value |
|---|---|
| Mount point | /mnt/atlantis_olares_storage |
| Source | 192.168.0.200:/volume1/documents/olares_storage |
| Access | Read/write (all_squash, anonuid=1026/anongid=100) |
| Persistent | Yes — configured in /etc/fstab |
| Capacity | 84TB pool (46TB free as of 2026-03-16) |
fstab entry
192.168.0.200:/volume1/documents/olares_storage /mnt/atlantis_olares_storage nfs rw,async,hard,intr,rsize=8192,wsize=8192,timeo=14 0 0
Mount/unmount manually
# Mount
sudo mount /mnt/atlantis_olares_storage
# Unmount
sudo umount /mnt/atlantis_olares_storage
# Check
df -h /mnt/atlantis_olares_storage
ls /mnt/atlantis_olares_storage
Troubleshooting
- If mount fails after reboot, check Atlantis is up and NFS is running:
sudo showmount -e 192.168.0.200 - Fail2ban on Olares may ban homelab-vm (
192.168.0.210) — whitelist is/etc/fail2ban/jail.d/local.confwithignoreip = 127.0.0.1/8 ::1 192.168.0.0/24 - SSH to Olares uses key auth (
ssh olaresworks from homelab-vm) — key installed 2026-03-16
Built-in Services
Olares runs its own infrastructure in Kubernetes:
- Headscale + Tailscale: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
- Authelia: SSO/auth gateway for app endpoints
- Envoy: Sidecar proxy for all apps
- HAMI: GPU device scheduler for vLLM/Ollama pods
- Prometheus: Metrics collection
Network
| Interface | IP | Notes |
|---|---|---|
| LAN (enp129s0) | 192.168.0.145/24 | Primary access |
| Tailscale (K8s pod) | 100.64.0.1 | Olares internal tailnet only |
Note: The host does not have Tailscale installed directly. The K8s Tailscale pod uses tailscale0 and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.
Known Issues
- Do NOT install host-level Tailscale — it conflicts with the K8s Tailscale pod's
tailscale0interface and causes total network loss requiring physical reboot - Ollama Qwen3.5 27B lacks tool calling — Ollama's model template doesn't support tools; use vLLM for coding agents
- Only run one model at a time — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
- vLLM startup takes 2-3 minutes — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
- Olares auth (Authelia) blocks API endpoints by default — new model endpoints need auth bypass configured in Olares app settings
Remote Management with k9s
k9s and kubectl are installed on the homelab VM for managing Olares pods without SSH.
Setup
| Component | Details |
|---|---|
| kubectl | /usr/local/bin/kubectl (v1.35.2) |
| k9s | /usr/local/bin/k9s (v0.50.18) |
| kubeconfig | ~/.kube/config → https://192.168.0.145:6443 |
| Access | Full admin (K3s default user), LAN only |
The kubeconfig was copied from /etc/rancher/k3s/k3s.yaml on Olares with the server address changed from 127.0.0.1 to 192.168.0.145.
Usage
# Launch k9s (interactive TUI)
k9s
# Filter by namespace
k9s -n ollamaserver-shared
# Quick kubectl checks
kubectl get pods -A
kubectl get deployments -A | grep -iE 'ollama|vllm'
kubectl logs -n ollamaserver-shared -l io.kompose.service=ollama --tail=20
kubectl scale deployment ollama -n ollamaserver-shared --replicas=0
Limitations
- LAN only — Olares has no host-level Tailscale, so k9s only works from the local network
- Metrics API not available —
kubectl top/ k9s resource view won't work - Kubeconfig rotation — if Olares is reinstalled or K3s certs rotate, re-copy the kubeconfig:
ssh olares "sudo cat /etc/rancher/k3s/k3s.yaml" | sed 's|https://127.0.0.1:6443|https://192.168.0.145:6443|' > ~/.kube/config chmod 600 ~/.kube/config
Maintenance
Reboot
ssh olares 'sudo reboot'
Allow 3-5 minutes for K8s pods to come back up. Check with:
ssh olares 'sudo kubectl get pods -A | grep -v Running'
Memory Management
With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:
ssh olares 'free -h; nvidia-smi'