homelab-optimized/docs/services/individual/olares.md

# Olares

**Kubernetes Self-Hosting Platform**

## Service Overview

| Property | Value |
|----------|-------|
| **Host** | olares (192.168.0.145) |
| **OS** | Ubuntu 24.04.3 LTS |
| **Platform** | Olares (Kubernetes/K3s with Calico CNI) |
| **Hardware** | Intel Core Ultra 9 275HX, 96GB DDR5, RTX 5090 Max-Q, 2TB NVMe |
| **SSH** | `ssh olares` (key auth, user: olares) |

## Purpose

Olares is a Kubernetes-based self-hosting platform running on a high-end mini PC. It provides a managed app store for deploying containerized services with built-in auth (Authelia), networking (Envoy sidecars), and GPU scheduling (HAMI).

Primary use case: **local LLM inference** via vLLM and Ollama, exposed as OpenAI-compatible API endpoints for coding agents (OpenCode, OpenClaw).

## LLM Services

Models are deployed via the Olares app store and served as OpenAI-compatible APIs. Each model gets a unique subdomain under `*.vishinator.olares.com`.

### Available Models

| Model | Backend | Namespace | Endpoint | Context | Notes |
|-------|---------|-----------|----------|---------|-------|
| Qwen3-Coder 30B | Ollama | `ollamaserver-shared` | `https://a5be22681.vishinator.olares.com/v1` | 65k tokens | MoE (3.3B active), coding-focused, currently active |
| Qwen3 30B A3B (4-bit) | vLLM | `vllmqwen330ba3bv2server-shared` | `https://04521407.vishinator.olares.com/v1` | ~40k tokens | MoE, fast inference, limited tool calling |
| Qwen3 30B A3B (4-bit) | vLLM | `vllmqwen330ba3binstruct4bitv2-vishinator` | — | ~40k tokens | Duplicate deployment (vishinator namespace) |
| Qwen3.5 27B Q4_K_M | Ollama | `ollamaqwen3527bq4kmv2server-shared` | `https://37e62186.vishinator.olares.com/v1` | 40k+ (262k native) | Dense, best for agentic coding |
| GPT-OSS 20B | vLLM | `vllmgptoss20bv2server-shared` | `https://6941bf89.vishinator.olares.com/v1` | 65k tokens | Requires auth bypass in Olares settings |
| Qwen3.5 9B | Ollama | `ollamaqwen359bv2server-shared` | — | — | Lightweight, scaled to 0 |

### GPU Memory Constraints (RTX 5090 Max-Q, 24 GB VRAM)

- Only run **one model at a time** to avoid VRAM exhaustion
- vLLM `--gpu-memory-utilization 0.95` is the default
- Context limits are determined by available KV cache after model loading
- Use `nvidia-smi` or check vLLM logs for actual KV cache capacity
- Before starting a model, scale down all others (see Scaling Operations below)

### Scaling Operations

Only one model should be loaded at a time due to VRAM constraints. Use these commands to switch between models.

**Check what's running:**
```bash
ssh olares "sudo kubectl get deployments -A | grep -iE 'vllm|ollama'"
ssh olares "nvidia-smi --query-gpu=memory.used,memory.free --format=csv"
```

**Stop all LLM deployments (free GPU):**
```bash
# Qwen3-Coder (Ollama — currently active)
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=0"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=0"

# Qwen3 30B A3B vLLM (shared)
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=0"

# Qwen3 30B A3B vLLM (vishinator)
ssh olares "sudo kubectl scale deployment vllmqwen330ba3binstruct4bitv2 -n vllmqwen330ba3binstruct4bitv2-vishinator --replicas=0"

# Qwen3.5 27B Ollama
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=0"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=0"

# GPT-OSS 20B vLLM
ssh olares "sudo kubectl scale deployment vllm -n vllmgptoss20bv2server-shared --replicas=0"
```

**Start Qwen3-Coder (Ollama):**
```bash
ssh olares "sudo kubectl scale deployment ollama -n ollamaserver-shared --replicas=1"
ssh olares "sudo kubectl scale deployment terminal -n ollamaserver-shared --replicas=1"
```

**Start Qwen3 30B A3B (vLLM):**
```bash
ssh olares "sudo kubectl scale deployment vllm -n vllmqwen330ba3bv2server-shared --replicas=1"
# Wait 2-3 minutes for vLLM startup, then check:
ssh olares "sudo kubectl logs -n vllmqwen330ba3bv2server-shared -l io.kompose.service=vllm --tail=5"
```

**Start Qwen3.5 27B (Ollama):**
```bash
ssh olares "sudo kubectl scale deployment ollama -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
ssh olares "sudo kubectl scale deployment api -n ollamaqwen3527bq4kmv2server-shared --replicas=1"
```

**Unload a model from Ollama (without scaling down the pod):**
```bash
ssh olares "sudo kubectl exec -n ollamaserver-shared \$(sudo kubectl get pods -n ollamaserver-shared -l io.kompose.service=ollama -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama stop qwen3-coder:latest"
```

### vLLM max_model_len

The `max_model_len` parameter is set in the deployment command args. To check the hardware-safe maximum, look at vLLM startup logs:

```
Available KV cache memory: X.XX GiB
GPU KV cache size: XXXXX tokens
```

To change it, either:
1. Edit in the **Olares app settings UI** (persistent across redeploys)
2. Patch the deployment directly (resets on redeploy):
   ```bash
   kubectl get deployment vllm -n <namespace> -o json > /tmp/patch.json
   # Edit max-model-len in the command string
   kubectl apply -f /tmp/patch.json
   ```

## OpenClaw (Chat Agent)

OpenClaw runs as a Kubernetes app in the `clawdbot-vishinator` namespace.

### Configuration

Config file inside the pod: `/home/node/.openclaw/openclaw.json`

To read/write config:
```bash
ssh olares
sudo kubectl exec -n clawdbot-vishinator <pod> -c clawdbot -- cat /home/node/.openclaw/openclaw.json
```

### Key Settings

- **Compaction**: `mode: "safeguard"` with `maxHistoryShare: 0.5` prevents context overflow
- **contextWindow**: Must match vLLM's actual `max_model_len` (not the model's native limit)
- **Workspace data**: Lives at `/home/node/.openclaw/workspace/` inside the pod
- **Brew packages**: OpenClaw has Homebrew; install tools with `brew install <pkg>` from the agent or pod

### Troubleshooting

| Error | Cause | Fix |
|-------|-------|-----|
| `localhost:8000 connection refused` | Model provider not configured or not running | Check model endpoint URL in config, verify vLLM pod is running |
| `Context overflow` | Prompt exceeded model's context limit | Enable compaction, or `/reset` the session |
| `pairing required` (WebSocket 1008) | Device pairing data was cleared | Reload the Control UI page to re-pair |
| `does not support tools` (400) | Ollama model lacks tool calling template | Use vLLM with `--enable-auto-tool-choice` instead of Ollama |
| `max_tokens must be at least 1, got negative` | Context window too small for system prompt + tools | Increase `max_model_len` (vLLM) or `num_ctx` (Ollama) |
| `bad request` / 400 from Ollama | Request exceeds `num_ctx` | Increase `num_ctx` in Modelfile: `ollama create model -f Modelfile` |
| 302 redirect on model endpoint | Olares auth (Authelia) blocking API access | Disable auth for the endpoint in Olares app settings |
| vLLM server pod scaled to 0 | Previously stopped, client pod crashes | Scale up: `kubectl scale deployment vllm -n <namespace> --replicas=1` |

## OpenCode Configuration

OpenCode on the homelab VM and moon are configured to use these endpoints.

### Config Location

- **homelab VM**: `~/.config/opencode/opencode.json`
- **moon**: `~/.config/opencode/opencode.json` (user: moon)

### Model Switching

Change the `"model"` field in `opencode.json`:

```json
"model": "olares//models/qwen3-30b"
```

Available provider/model strings:
- `olares//models/qwen3-30b` (recommended — supports tool calling via vLLM)
- `olares-gptoss//models/gpt-oss-20b`
- `olares-qwen35/qwen3.5:27b-q4_K_M` (Ollama — does NOT support tool calling, avoid for OpenCode)

**Important**: OpenCode requires tool/function calling support. Ollama models often lack tool call templates, causing 400 errors. Use vLLM with `--enable-auto-tool-choice --tool-call-parser hermes` for reliable tool use.

### Loop Prevention

```json
"mode": {
  "build": {
    "steps": 25,
    "permission": { "doom_loop": "deny" }
  },
  "plan": {
    "steps": 15,
    "permission": { "doom_loop": "deny" }
  }
}
```

## Built-in Services

Olares runs its own infrastructure in Kubernetes:

- **Headscale + Tailscale**: Internal mesh network (separate tailnet from homelab, IP 100.64.0.1)
- **Authelia**: SSO/auth gateway for app endpoints
- **Envoy**: Sidecar proxy for all apps
- **HAMI**: GPU device scheduler for vLLM/Ollama pods
- **Prometheus**: Metrics collection

## Network

| Interface | IP | Notes |
|-----------|-----|-------|
| LAN (enp129s0) | 192.168.0.145/24 | Primary access |
| Tailscale (K8s pod) | 100.64.0.1 | Olares internal tailnet only |

Note: The host does **not** have Tailscale installed directly. The K8s Tailscale pod uses `tailscale0` and conflicts with host-level tailscale (causes network outage if both run). Access via LAN only.

## Known Issues

- **Do NOT install host-level Tailscale** — it conflicts with the K8s Tailscale pod's `tailscale0` interface and causes total network loss requiring physical reboot
- **Ollama Qwen3.5 27B lacks tool calling** — Ollama's model template doesn't support tools; use vLLM for coding agents
- **Only run one model at a time** — running multiple vLLM instances exhausts 24GB VRAM; scale unused deployments to 0
- **vLLM startup takes 2-3 minutes** — requests during startup return 502/connection refused; wait for "Application startup complete" in logs
- **Olares auth (Authelia) blocks API endpoints by default** — new model endpoints need auth bypass configured in Olares app settings

## Maintenance

### Reboot
```bash
ssh olares 'sudo reboot'
```
Allow 3-5 minutes for K8s pods to come back up. Check with:
```bash
ssh olares 'sudo kubectl get pods -A | grep -v Running'
```

### Memory Management
With 96 GB RAM, multiple models can load into system memory but GPU VRAM is the bottleneck. Monitor with:
```bash
ssh olares 'free -h; nvidia-smi'
```