Files
homelab-optimized/docs/services/individual/opencode.md
Gitea Mirror Bot 0b88d4860e
Some checks failed
Documentation / Build Docusaurus (push) Failing after 5m21s
Documentation / Deploy to GitHub Pages (push) Has been skipped
Sanitized mirror from private repository - 2026-03-27 10:50:30 UTC
2026-03-27 10:50:30 +00:00

206 lines
8.0 KiB
Markdown

# OpenCode
**AI-Powered Coding Agent CLI**
## Service Overview
| Property | Value |
|----------|-------|
| **Service Name** | opencode |
| **Category** | AI / Development |
| **Hosts** | homelab VM (192.168.0.210), moon (100.64.0.6) |
| **Install** | `curl -fsSL https://opencode.ai/install \| bash` |
| **Config** | `~/.config/opencode/opencode.json` |
| **LLM Backend** | Olares Ollama (Qwen3-Coder 30B A3B) |
| **Agent Name** | Vesper |
## Purpose
OpenCode is an interactive CLI coding agent (similar to Claude Code) that connects to local LLM backends for AI-assisted software engineering. It runs on developer workstations and connects to the Olares Kubernetes appliance for GPU-accelerated inference.
## Architecture
```
Developer Host (homelab VM / moon)
└── opencode CLI
└── HTTPS → Olares (192.168.0.145)
└── Ollama (RTX 5090 Max-Q, 24GB VRAM)
└── qwen3-coder:latest (Qwen3-Coder 30B A3B, Q4_K_M)
```
### Ollama Infrastructure
- **Host**: Olares appliance at 192.168.0.145 (SSH: `ssh olares`)
- **Runtime**: Kubernetes (k3s), namespace `ollamaserver-shared`
- **Pod**: `ollama-*` in deployment `ollama`
- **API endpoint**: `https://a5be22681.vishinator.olares.com`
- **GPU**: NVIDIA RTX 5090 Laptop GPU, 24GB VRAM, compute capability 12.0
- **Flash attention**: Enabled (`OLLAMA_FLASH_ATTENTION=1` env var on deployment)
### Models on Ollama
| Model | Size | Context | VRAM Usage | Notes |
|-------|------|---------|------------|-------|
| `qwen3-coder:latest` | 18GB | 32k tokens | ~22GB (fits in VRAM) | **Default for everything** |
| `qwen3-coder-65k:latest` | 18GB | 65k tokens | ~25.3GB (spills to system RAM) | Experimental, not recommended (see below) |
| `devstral-small-2:latest` | 15GB | 32k tokens | — | Alternative model |
### Shared LLM — All Services Use the Same Model
`qwen3-coder:latest` is used by opencode, email organizers (3 accounts), and AnythingLLM. Since Ollama only keeps one model in VRAM at a time on 24GB, everything must use the same model name to avoid constant load/unload cycles (~12s each swap).
## Configuration
Config: `~/.config/opencode/opencode.json`
### Default Model
```json
"model": "olares-qwen3-coder//qwen3-coder:latest"
```
Context is set to 40k in the opencode config (Ollama physically loads 32k). This matches the original configuration before the vLLM endpoint went down.
### Agent Personality (Vesper)
OpenCode is configured with a personality via both `instructions` in the config and `AGENTS.md` in the repo root:
- **Name**: Vesper
- **Style**: Concise, witty, competent — executes commands directly instead of explaining
- **Guardian role**: Proactively warns about bad practices (secrets in git, missing dry-runs, open permissions)
- **Safety practices**: Works in branches, dry-runs first, backs up before modifying, verifies after acting
### Configured Provider
Single provider (dead vLLM endpoints were removed):
```json
"olares-qwen3-coder": {
"npm": "@ai-sdk/openai-compatible",
"name": "Olares Ollama (Qwen3-Coder)",
"options": { "baseURL": "https://a5be22681.vishinator.olares.com/v1" },
"models": {
"qwen3-coder:latest": { "context": 40000, "output": 8192 }
}
}
```
### Permissions (Full Autonomy)
```json
"permission": {
"bash": "allow",
"edit": "allow",
"write": "allow",
"read": "allow",
"glob": "allow",
"grep": "allow",
"question": "allow",
"external_directory": "allow",
"mcp": "allow"
}
```
### Loop Prevention
```json
"mode": {
"build": { "steps": 50, "permission": { "doom_loop": "deny" } },
"plan": { "steps": 25, "permission": { "doom_loop": "deny" } }
}
```
### MCP Integration
The homelab MCP server is configured on the homelab VM:
```json
"mcp": {
"homelab": {
"type": "local",
"command": ["python3", "/home/homelab/organized/repos/homelab/scripts/homelab-mcp/server.py"],
"enabled": true
}
}
```
## Host-Specific Setup
### homelab VM (192.168.0.210)
- **User**: homelab
- **Binary**: `~/.opencode/bin/opencode`
- **Config**: `~/.config/opencode/opencode.json`
- **Backup**: `~/.config/opencode/opencode.json.bak.*`
- **MCP**: homelab MCP server enabled
### moon (100.64.0.6 via Tailscale)
- **User**: moon (access via `ssh moon`, then `sudo -i su - moon`)
- **Binary**: `~/.opencode/bin/opencode`
- **Config**: `~/.config/opencode/opencode.json`
- **May need config updated** to point at active Ollama endpoint
## Failed Experiment: 65k Context (2026-03-24)
Attempted to increase context from 32k to 65k to reduce compaction in opencode. Did not work well.
### What Was Tried
1. **Created `qwen3-coder-65k` model** — Modelfile wrapper with `PARAMETER num_ctx 65536` around the same weights as `qwen3-coder:latest`
2. **Enabled flash attention**`OLLAMA_FLASH_ATTENTION=1` on the Ollama k8s deployment. This allowed the 65k context to load (wouldn't fit without it)
3. **Pointed all services** (opencode, email organizers, AnythingLLM) at the 65k model
### What Happened
- The 65k model loaded but used **25.3GB VRAM** on a 24GB GPU — the ~1.3GB overflow spilled to system RAM via resizable BAR
- OpenCode still compacted constantly — the model's behavior (mass-globbing 50 files, web fetching full pages) consumed context faster than the extra headroom helped
- Having two model names (`qwen3-coder:latest` and `qwen3-coder-65k:latest`) caused Ollama to constantly swap models in VRAM when different services used different names
### Why It Failed
The compaction wasn't a context size problem — it was a **model behavior problem**. Qwen3-Coder 30B with opencode's system prompt + MCP tool definitions (~15-20k tokens) leaves only ~12-15k for conversation at 32k. One or two large tool results (glob with 50 matches, web fetch) fills the remainder. More context just delays the inevitable by one more tool call.
### What Was Reverted
- OpenCode and email organizers back to `qwen3-coder:latest` (32k actual, 40k in config)
- Flash attention left enabled (harmless, improves VRAM efficiency)
- `qwen3-coder-65k` model left on Ollama (unused, can be removed)
### To Remove the 65k Model
```bash
ssh olares "sudo k3s kubectl exec -n ollamaserver-shared \$(sudo k3s kubectl get pod -n ollamaserver-shared -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama rm qwen3-coder-65k"
```
### To Disable Flash Attention
```bash
ssh olares "sudo k3s kubectl set env deployment/ollama -n ollamaserver-shared -c ollama OLLAMA_FLASH_ATTENTION-"
```
### What Would Actually Fix Compaction
- **More VRAM** (48GB+ GPU) to run 65k+ context without spill
- **Smarter model** that doesn't waste context on mass globs and web fetches
- **Fewer MCP tools** registered (each tool definition consumes tokens in every request)
## Requirements
- **Tool calling support required** — OpenCode sends tools with every request. Models without tool call templates return 400 errors
- **Large context needed** — System prompt + tool definitions use ~15-20k tokens. Models with less than 32k context will fail
- **Flash attention recommended** — `OLLAMA_FLASH_ATTENTION=1` on the Ollama deployment allows larger contexts within VRAM limits
## Troubleshooting
| Error | Cause | Fix |
|-------|-------|-----|
| `bad request` / 400 | Model doesn't support tools, or context exceeded | Switch to model with tool calling support |
| `model not found` | Wrong model name (e.g., `qwen3:coder` vs `qwen3-coder:latest`) | Check `ollama list` for exact names |
| Constant compaction | Model consuming context with large tool results | Reduce web fetches, use targeted globs, or increase VRAM |
| 502 Bad Gateway | Ollama pod restarting or endpoint down | Check pod: `ssh olares "sudo k3s kubectl get pods -n ollamaserver-shared"` |
| Stuck in loops | Model keeps retrying failed tool calls | `doom_loop: "deny"` and reduce `steps` |
| Won't run ansible | Model too cautious, AGENTS.md too restrictive | Check instructions in config and AGENTS.md |
| Web fetch eating context | Model searching internet for local info | Instructions tell it to read local files first |
| Model swap lag | Different services using different model names | Ensure all services use the same model name |