Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot a95a68e477

Documentation / Build Docusaurus (push) Failing after 8s

Details

Documentation / Deploy to GitHub Pages (push) Has been skipped

Details

Sanitized mirror from private repository - 2026-03-15 06:28:27 UTC

2026-03-15 06:28:27 +00:00

7.1 KiB

Raw Blame History

Olares

Kubernetes Self-Hosting Platform

Service Overview

Property	Value
Host	olares (192.168.0.145)
OS	Ubuntu 24.04.3 LTS
Platform	Olares (Kubernetes/K3s with Calico CNI)
Hardware	Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe
SSH	`ssh olares` (key auth, user: olares)

Purpose

Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).

Primary use case: local LLM inference via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).

LLM Services

Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under *.vishinator.olares.com.

Available Models

Model	Backend	Endpoint	Context	Notes
Qwen3 30B A3B (4-bit)	vLLM	`https://04521407.vishinator.olares.com/v1`	~40k tokens	MoE, fast inference, limited tool calling
Qwen3.5 27B Q4_K_M	Ollama	`https://37e62186.vishinator.olares.com/v1`	40k+ (262k native)	Dense, best for agentic coding
GPT-OSS 20B	vLLM	`https://6941bf89.vishinator.olares.com/v1`	65k tokens	Requires auth bypass in Olares settings
Gemma 3 27B	vLLM	`https://d3ea5398.vishinator.olares.com/v1`	~8k tokens	Limited by VRAM for KV cache

GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)

Only run one model at a time to avoid VRAM exhaustion
vLLM --gpu-memory-utilization 0.95 is the default
Context limits are determined by available KV cache after model loading
Use nvidia-smi or check vLLM logs for actual KV cache capacity
Scale unused model deployments to 0: kubectl scale deployment vllm -n <namespace> --replicas=0

vLLM max_model_len

The max_model_len parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:

Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens

To change it, either:

Edit in the Olares app settings UI (persistent across redeploys)

Patch the deployment directly (resets on redeploy):

kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json
# Edit max-model-len in the command string
kubectl apply -f /tmp/patch.json

OpenClaw (Chat Agent)

OpenClaw runs as a Kubernetes app in the clawdbot-vishinator namespace.

Configuration

Config file inside the pod: /home/node/.openclaw/openclaw.json

To read/write config:

ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json

Key Settings

Compaction: mode: "safeguard" with maxHistoryShare: 0.5 prevents context overflow
contextWindow: Must match vLLM's actual max_model_len (not the model's native limit)
Workspace data: Lives at /home/node/.openclaw/workspace/ inside the pod
Brew packages: OpenClaw has Homebrew; install tools with brew install <pkg> from the agent or pod

Troubleshooting

Error	Cause	Fix
`localhost:8000 connection refused`	Model provider not configured or not running	Check model endpoint URL in config, verify vLLM pod is running
`Context overflow`	Prompt exceeded model's context limit	Enable compaction, or `/reset` the session
`pairing required` (WebSocket 1008)	Device pairing data was cleared	Reload the Control UI page to re-pair
`does not support tools` (400)	Ollama model lacks tool calling template	Use vLLM with `--enable-auto-tool-choice` instead of Ollama
`max_tokens must be at least 1, got negative`	Context window too small for system prompt + tools	Increase `max_model_len` (vLLM) or `num_ctx` (Ollama)
`bad request` / 400 from Ollama	Request exceeds `num_ctx`	Increase `num_ctx` in Modelfile: `ollama create model -f Modelfile`
302 redirect on model endpoint	Olares auth (Authelia) blocking API access	Disable auth for the endpoint in Olares app settings
vLLM server pod scaled to 0	Previously stopped, client pod crashes	Scale up: `kubectl scale deployment vllm -n <namespace> --replicas=1`

OpenCode Configuration

OpenCode on the homelab VM and moon are configured to use these endpoints.

Config Location

homelab VM: ~/.config/opencode/opencode.json
moon: ~/.config/opencode/opencode.json (user: moon)

Model Switching

Change the "model" field in opencode.json:

"model": "olares//models/qwen3-30b"

Available provider/model strings:

olares//models/qwen3-30b (recommended — supports tool calling via vLLM)
olares-gptoss//models/gpt-oss-20b
olares-qwen35/qwen3.5:27b-q4_K_M (Ollama — does NOT support tool calling, avoid for OpenCode)

Important: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with --enable-auto-tool-choice --tool-call-parser hermes for reliable tool use.

Loop Prevention

"mode": {
  "build": {
    "steps": 25,
    "permission": { "doom_loop": "deny" }
  },
  "plan": {
    "steps": 15,
    "permission": { "doom_loop": "deny" }
  }
}

Built-in Services

Olares runs its own infrastructure in Kubernetes:

Headscale + Tailscale: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
Authelia: SSO/auth gateway for app endpoints
Envoy: Sidecar proxy for all apps
HAMI: GPU device scheduler for vLLM/Ollama pods
Prometheus: Metrics collection

Network

Interface	IP	Notes
LAN (enp129s0)	192.168.0.145/24	Primary access
Tailscale (K8s pod)	100.64.0.1	Olares internal tailnet only

Note: The host does not have Tailscale installed directly. The K8s Tailscale pod uses tailscale0 and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.

Known Issues

Do NOT install host-level Tailscale — it conflicts with the K8s Tailscale pod's tailscale0 interface and causes total network loss requiring physical reboot
Ollama Qwen3.5 27B lacks tool calling — Ollama's model template doesn't support tools; use vLLM for coding agents
Only run one model at a time — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
vLLM startup takes 2-3 minutes — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
Olares auth (Authelia) blocks API endpoints by default — new model endpoints need auth bypass configured in Olares app settings

Maintenance

Reboot

ssh olares 'sudo reboot'

Allow 3-5 minutes for K8s pods to come back up. Check with:

ssh olares 'sudo kubectl get pods -A | grep -v Running'

Memory Management

With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:

ssh olares 'free -h; nvidia-smi'

7.1 KiB Raw Blame History