homelab-optimized/docs/services/individual/opencode.md

# OpenCode

**AI-Powered Coding Agent CLI**

## Service Overview

| Property | Value |
|----------|-------|
| **Service Name** | opencode |
| **Category** | AI / Development |
| **Hosts** | homelab VM (192.168.0.210), moon (100.64.0.6) |
| **Install** | `curl -fsSL https://opencode.ai/install \| bash` |
| **Config** | `~/.config/opencode/opencode.json` |
| **LLM Backend** | Olares Ollama (Qwen3-Coder 30B A3B) |
| **Agent Name** | Vesper |

## Purpose

OpenCode is an interactive CLI coding agent (similar to Claude Code) that connects to local LLM backends for AI-assisted software engineering. It runs on developer workstations and connects to the Olares Kubernetes appliance for GPU-accelerated inference.

## Architecture

```
Developer Host (homelab VM / moon)
  └── opencode CLI
        └── HTTPS → Olares (192.168.0.145)
              └── Ollama (RTX 5090 Max-Q, 24GB VRAM)
                    └── qwen3:32b (Qwen3-Coder 30B A3B, Q4_K_M)
```

### Ollama Infrastructure

- **Host**: Olares appliance at 192.168.0.145 (SSH: `ssh olares`)
- **Runtime**: Kubernetes (k3s), namespace `ollamaserver-shared`
- **Pod**: `ollama-*` in deployment `ollama`
- **API endpoint**: `https://a5be22681.vishinator.olares.com`
- **GPU**: NVIDIA RTX 5090 Laptop GPU, 24GB VRAM, compute capability 12.0
- **Flash attention**: Enabled (`OLLAMA_FLASH_ATTENTION=1` env var on deployment)

### Models on Ollama

| Model | Size | Context | VRAM Usage | Notes |
|-------|------|---------|------------|-------|
| `qwen3:32b` | 18GB | 32k tokens | ~22GB (fits in VRAM) | **Default for everything** |
| `qwen3:32b-65k:latest` | 18GB | 65k tokens | ~25.3GB (spills to system RAM) | Experimental, not recommended (see below) |
| `devstral-small-2:latest` | 15GB | 32k tokens | — | Alternative model |

### Shared LLM — All Services Use the Same Model

`qwen3:32b` is used by opencode, email organizers (3 accounts), and AnythingLLM. Since Ollama only keeps one model in VRAM at a time on 24GB, everything must use the same model name to avoid constant load/unload cycles (~12s each swap).

## Configuration

Config: `~/.config/opencode/opencode.json`

### Default Model

```json
"model": "olares-qwen3:32b//qwen3:32b"
```

Context is set to 40k in the opencode config (Ollama physically loads 32k). This matches the original configuration before the vLLM endpoint went down.

### Agent Personality (Vesper)

OpenCode is configured with a personality via both `instructions` in the config and `AGENTS.md` in the repo root:

- **Name**: Vesper
- **Style**: Concise, witty, competent — executes commands directly instead of explaining
- **Guardian role**: Proactively warns about bad practices (secrets in git, missing dry-runs, open permissions)
- **Safety practices**: Works in branches, dry-runs first, backs up before modifying, verifies after acting

### Configured Provider

Single provider (dead vLLM endpoints were removed):

```json
"olares-qwen3:32b": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "Olares Ollama (Qwen3-Coder)",
  "options": { "baseURL": "https://a5be22681.vishinator.olares.com/v1" },
  "models": {
    "qwen3:32b": { "context": 40000, "output": 8192 }
  }
}
```

### Permissions (Full Autonomy)

```json
"permission": {
  "bash": "allow",
  "edit": "allow",
  "write": "allow",
  "read": "allow",
  "glob": "allow",
  "grep": "allow",
  "question": "allow",
  "external_directory": "allow",
  "mcp": "allow"
}
```

### Loop Prevention

```json
"mode": {
  "build": { "steps": 50, "permission": { "doom_loop": "deny" } },
  "plan": { "steps": 25, "permission": { "doom_loop": "deny" } }
}
```

### MCP Integration

The homelab MCP server is configured on the homelab VM:

```json
"mcp": {
  "homelab": {
    "type": "local",
    "command": ["python3", "/home/homelab/organized/repos/homelab/scripts/homelab-mcp/server.py"],
    "enabled": true
  }
}
```

## Host-Specific Setup

### homelab VM (192.168.0.210)

- **User**: homelab
- **Binary**: `~/.opencode/bin/opencode`
- **Config**: `~/.config/opencode/opencode.json`
- **Backup**: `~/.config/opencode/opencode.json.bak.*`
- **MCP**: homelab MCP server enabled

### moon (100.64.0.6 via Tailscale)

- **User**: moon (access via `ssh moon`, then `sudo -i su - moon`)
- **Binary**: `~/.opencode/bin/opencode`
- **Config**: `~/.config/opencode/opencode.json`
- **May need config updated** to point at active Ollama endpoint

## Failed Experiment: 65k Context (2026-03-24)

Attempted to increase context from 32k to 65k to reduce compaction in opencode. Did not work well.

### What Was Tried

1. **Created `qwen3:32b-65k` model** — Modelfile wrapper with `PARAMETER num_ctx 65536` around the same weights as `qwen3:32b`
2. **Enabled flash attention** — `OLLAMA_FLASH_ATTENTION=1` on the Ollama k8s deployment. This allowed the 65k context to load (wouldn't fit without it)
3. **Pointed all services** (opencode, email organizers, AnythingLLM) at the 65k model

### What Happened

- The 65k model loaded but used **25.3GB VRAM** on a 24GB GPU — the ~1.3GB overflow spilled to system RAM via resizable BAR
- OpenCode still compacted constantly — the model's behavior (mass-globbing 50 files, web fetching full pages) consumed context faster than the extra headroom helped
- Having two model names (`qwen3:32b` and `qwen3:32b-65k:latest`) caused Ollama to constantly swap models in VRAM when different services used different names

### Why It Failed

The compaction wasn't a context size problem — it was a **model behavior problem**. Qwen3-Coder 30B with opencode's system prompt + MCP tool definitions (~15-20k tokens) leaves only ~12-15k for conversation at 32k. One or two large tool results (glob with 50 matches, web fetch) fills the remainder. More context just delays the inevitable by one more tool call.

### What Was Reverted

- OpenCode and email organizers back to `qwen3:32b` (32k actual, 40k in config)
- Flash attention left enabled (harmless, improves VRAM efficiency)
- `qwen3:32b-65k` model left on Ollama (unused, can be removed)

### To Remove the 65k Model

```bash
ssh olares "sudo k3s kubectl exec -n ollamaserver-shared \$(sudo k3s kubectl get pod -n ollamaserver-shared -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama rm qwen3:32b-65k"
```

### To Disable Flash Attention

```bash
ssh olares "sudo k3s kubectl set env deployment/ollama -n ollamaserver-shared -c ollama OLLAMA_FLASH_ATTENTION-"
```

### What Would Actually Fix Compaction

- **More VRAM** (48GB+ GPU) to run 65k+ context without spill
- **Smarter model** that doesn't waste context on mass globs and web fetches
- **Fewer MCP tools** registered (each tool definition consumes tokens in every request)

## Failed Experiment: vLLM with Qwen3-30B-A3B AWQ (2026-03-30)

Attempted to replace Ollama with vLLM for better inference performance and context handling. Did not work for agentic coding.

### What Was Tried

1. **Deployed vLLM via raw kubectl** on Olares — `vllm/vllm-openai:latest` with model `cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit` (compressed-tensors quantization, MoE with 3B active params)
2. **Overcame multiple deployment issues:**
   - HAMI GPU scheduler requires Olares-specific labels (`applications.app.bytetrade.io/name`, `hami.io/vgpu-node`, etc.) — pods won't schedule without them
   - Kubernetes service named `vllm` injected `VLLM_PORT=tcp://...` env var, conflicting with vLLM's own `VLLM_PORT` config — renamed service to `vllm-server`
   - Model uses `compressed-tensors` quantization, not `awq` — vLLM auto-detects when `--quantization` flag is omitted
   - `nvidia.com/gpumem` resource limit needed for HAMI to allocate VRAM (set to `20480` MiB)
3. **Got vLLM running and serving** — model loaded, CUDA graphs compiled, inference working via ClusterIP
4. **NodePort (30800) didn't work** — Olares networking blocks external NodePort access. Used SSH tunnel instead (`autossh -L 30800:ClusterIP:8000 olares`)
5. **Added `--enable-auto-tool-choice --tool-call-parser hermes`** for OpenCode tool calling support
6. **Added `olares-vllm` provider preset** to OpenCode config alongside existing Ollama preset

### Why It Failed

- **16K context too small** — OpenCode's system prompt + instructions + MCP tool definitions = ~20K tokens, exceeding the model's `max_model_len` of 16384. Error: `maximum context length is 16384 tokens, your request has 20750 input tokens`
- Increasing `max_model_len` wasn't viable — only 1.56 GiB KV cache available after model loading, barely enough for 17K tokens at 16384
- The A3B model (3B active params) also has weaker reasoning than the dense 30B, making it doubly unsuitable for agentic coding

### What Was Reverted

- OpenCode switched back to `olares-qwen3:32b//qwen3:32b` (Ollama, dense 30B, 32K+ context)
- vLLM deployment scaled to 0 (namespace `vllm-qwen3:32b` still exists)
- `vllm-tunnel.service` (autossh systemd unit) stopped and disabled on homelab-vm
- ComfyUI was scaled down during testing, can be re-enabled

### Artifacts Left Behind

- **Olares k8s namespace**: `vllm-qwen3:32b` (deployment scaled to 0, PVC with cached model ~10GB)
- **Systemd unit**: `/etc/systemd/system/vllm-tunnel.service` (disabled)
- **OpenCode config**: `olares-vllm` provider preset remains but is not the active model
- **HuggingFace model cache**: On PVC `vllm-model-cache` in `vllm-qwen3:32b` namespace

### Cleanup (if desired)

```bash
# Delete the vLLM deployment and namespace
ssh olares "kubectl delete namespace vllm-qwen3:32b"

# Remove the tunnel service
sudo rm /etc/systemd/system/vllm-tunnel.service
sudo systemctl daemon-reload

# Remove vLLM provider from opencode config
# Edit ~/.config/opencode/opencode.json and remove the "olares-vllm" block
```

### Lessons Learned

- **MoE models with small active params are poor for agentic coding** — tool schemas alone can exceed their context limits
- **Olares raw kubectl deployments bypass the app framework** — no managed URLs, no auth integration, no ingress. Use Studio or Market for proper integration
- **HAMI GPU scheduler needs specific pod labels** — any GPU workload deployed outside Olares Market needs `applications.app.bytetrade.io/name` and related labels
- **Kubernetes service names can collide with app env vars** — never name a k8s service the same as the app binary (vLLM reads `VLLM_PORT` which k8s auto-sets from service discovery)
- **Dense Qwen3-Coder 30B via Ollama remains the best option** for this hardware — sufficient context (32K), good reasoning, and Ollama's auto-unload keeps VRAM available for other apps

## Requirements

- **Tool calling support required** — OpenCode sends tools with every request. Models without tool call templates return 400 errors
- **Large context needed** — System prompt + tool definitions use ~15-20k tokens. Models with less than 32k context will fail
- **Flash attention recommended** — `OLLAMA_FLASH_ATTENTION=1` on the Ollama deployment allows larger contexts within VRAM limits

## Troubleshooting

| Error | Cause | Fix |
|-------|-------|-----|
| `bad request` / 400 | Model doesn't support tools, or context exceeded | Switch to model with tool calling support |
| `model not found` | Wrong model name (e.g., `qwen3:coder` vs `qwen3:32b`) | Check `ollama list` for exact names |
| Constant compaction | Model consuming context with large tool results | Reduce web fetches, use targeted globs, or increase VRAM |
| 502 Bad Gateway | Ollama pod restarting or endpoint down | Check pod: `ssh olares "sudo k3s kubectl get pods -n ollamaserver-shared"` |
| Stuck in loops | Model keeps retrying failed tool calls | `doom_loop: "deny"` and reduce `steps` |
| Won't run ansible | Model too cautious, AGENTS.md too restrictive | Check instructions in config and AGENTS.md |
| Web fetch eating context | Model searching internet for local info | Instructions tell it to read local files first |
| Model swap lag | Different services using different model names | Ensure all services use the same model name |