Files
homelab-optimized/docs/services/individual/opencode.md
Gitea Mirror Bot 3ecb830131
Some checks failed
Documentation / Deploy to GitHub Pages (push) Has been cancelled
Documentation / Build Docusaurus (push) Has been cancelled
Sanitized mirror from private repository - 2026-04-05 10:05:14 UTC
2026-04-05 10:05:14 +00:00

12 KiB

OpenCode

AI-Powered Coding Agent CLI

Service Overview

Property Value
Service Name opencode
Category AI / Development
Hosts homelab VM (192.168.0.210), moon (100.64.0.6)
Install curl -fsSL https://opencode.ai/install | bash
Config ~/.config/opencode/opencode.json
LLM Backend Olares Ollama (Qwen3-Coder 30B A3B)
Agent Name Vesper

Purpose

OpenCode is an interactive CLI coding agent (similar to Claude Code) that connects to local LLM backends for AI-assisted software engineering. It runs on developer workstations and connects to the Olares Kubernetes appliance for GPU-accelerated inference.

Architecture

Developer Host (homelab VM / moon)
  └── opencode CLI
        └── HTTPS → Olares (192.168.0.145)
              └── Ollama (RTX 5090 Max-Q, 24GB VRAM)
                    └── qwen3-coder:latest (Qwen3-Coder 30B A3B, Q4_K_M)

Ollama Infrastructure

  • Host: Olares appliance at 192.168.0.145 (SSH: ssh olares)
  • Runtime: Kubernetes (k3s), namespace ollamaserver-shared
  • Pod: ollama-* in deployment ollama
  • API endpoint: https://a5be22681.vishinator.olares.com
  • GPU: NVIDIA RTX 5090 Laptop GPU, 24GB VRAM, compute capability 12.0
  • Flash attention: Enabled (OLLAMA_FLASH_ATTENTION=1 env var on deployment)

Models on Ollama

Model Size Context VRAM Usage Notes
qwen3-coder:latest 18GB 32k tokens ~22GB (fits in VRAM) Default for everything
qwen3-coder-65k:latest 18GB 65k tokens ~25.3GB (spills to system RAM) Experimental, not recommended (see below)
devstral-small-2:latest 15GB 32k tokens Alternative model

Shared LLM — All Services Use the Same Model

qwen3-coder:latest is used by opencode, email organizers (3 accounts), and AnythingLLM. Since Ollama only keeps one model in VRAM at a time on 24GB, everything must use the same model name to avoid constant load/unload cycles (~12s each swap).

Configuration

Config: ~/.config/opencode/opencode.json

Default Model

"model": "olares-qwen3-coder//qwen3-coder:latest"

Context is set to 40k in the opencode config (Ollama physically loads 32k). This matches the original configuration before the vLLM endpoint went down.

Agent Personality (Vesper)

OpenCode is configured with a personality via both instructions in the config and AGENTS.md in the repo root:

  • Name: Vesper
  • Style: Concise, witty, competent — executes commands directly instead of explaining
  • Guardian role: Proactively warns about bad practices (secrets in git, missing dry-runs, open permissions)
  • Safety practices: Works in branches, dry-runs first, backs up before modifying, verifies after acting

Configured Provider

Single provider (dead vLLM endpoints were removed):

"olares-qwen3-coder": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "Olares Ollama (Qwen3-Coder)",
  "options": { "baseURL": "https://a5be22681.vishinator.olares.com/v1" },
  "models": {
    "qwen3-coder:latest": { "context": 40000, "output": 8192 }
  }
}

Permissions (Full Autonomy)

"permission": {
  "bash": "allow",
  "edit": "allow",
  "write": "allow",
  "read": "allow",
  "glob": "allow",
  "grep": "allow",
  "question": "allow",
  "external_directory": "allow",
  "mcp": "allow"
}

Loop Prevention

"mode": {
  "build": { "steps": 50, "permission": { "doom_loop": "deny" } },
  "plan": { "steps": 25, "permission": { "doom_loop": "deny" } }
}

MCP Integration

The homelab MCP server is configured on the homelab VM:

"mcp": {
  "homelab": {
    "type": "local",
    "command": ["python3", "/home/homelab/organized/repos/homelab/scripts/homelab-mcp/server.py"],
    "enabled": true
  }
}

Host-Specific Setup

homelab VM (192.168.0.210)

  • User: homelab
  • Binary: ~/.opencode/bin/opencode
  • Config: ~/.config/opencode/opencode.json
  • Backup: ~/.config/opencode/opencode.json.bak.*
  • MCP: homelab MCP server enabled

moon (100.64.0.6 via Tailscale)

  • User: moon (access via ssh moon, then sudo -i su - moon)
  • Binary: ~/.opencode/bin/opencode
  • Config: ~/.config/opencode/opencode.json
  • May need config updated to point at active Ollama endpoint

Failed Experiment: 65k Context (2026-03-24)

Attempted to increase context from 32k to 65k to reduce compaction in opencode. Did not work well.

What Was Tried

  1. Created qwen3-coder-65k model — Modelfile wrapper with PARAMETER num_ctx 65536 around the same weights as qwen3-coder:latest
  2. Enabled flash attentionOLLAMA_FLASH_ATTENTION=1 on the Ollama k8s deployment. This allowed the 65k context to load (wouldn't fit without it)
  3. Pointed all services (opencode, email organizers, AnythingLLM) at the 65k model

What Happened

  • The 65k model loaded but used 25.3GB VRAM on a 24GB GPU — the ~1.3GB overflow spilled to system RAM via resizable BAR
  • OpenCode still compacted constantly — the model's behavior (mass-globbing 50 files, web fetching full pages) consumed context faster than the extra headroom helped
  • Having two model names (qwen3-coder:latest and qwen3-coder-65k:latest) caused Ollama to constantly swap models in VRAM when different services used different names

Why It Failed

The compaction wasn't a context size problem — it was a model behavior problem. Qwen3-Coder 30B with opencode's system prompt + MCP tool definitions (~15-20k tokens) leaves only ~12-15k for conversation at 32k. One or two large tool results (glob with 50 matches, web fetch) fills the remainder. More context just delays the inevitable by one more tool call.

What Was Reverted

  • OpenCode and email organizers back to qwen3-coder:latest (32k actual, 40k in config)
  • Flash attention left enabled (harmless, improves VRAM efficiency)
  • qwen3-coder-65k model left on Ollama (unused, can be removed)

To Remove the 65k Model

ssh olares "sudo k3s kubectl exec -n ollamaserver-shared \$(sudo k3s kubectl get pod -n ollamaserver-shared -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama rm qwen3-coder-65k"

To Disable Flash Attention

ssh olares "sudo k3s kubectl set env deployment/ollama -n ollamaserver-shared -c ollama OLLAMA_FLASH_ATTENTION-"

What Would Actually Fix Compaction

  • More VRAM (48GB+ GPU) to run 65k+ context without spill
  • Smarter model that doesn't waste context on mass globs and web fetches
  • Fewer MCP tools registered (each tool definition consumes tokens in every request)

Failed Experiment: vLLM with Qwen3-30B-A3B AWQ (2026-03-30)

Attempted to replace Ollama with vLLM for better inference performance and context handling. Did not work for agentic coding.

What Was Tried

  1. Deployed vLLM via raw kubectl on Olares — vllm/vllm-openai:latest with model cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit (compressed-tensors quantization, MoE with 3B active params)
  2. Overcame multiple deployment issues:
    • HAMI GPU scheduler requires Olares-specific labels (applications.app.bytetrade.io/name, hami.io/vgpu-node, etc.) — pods won't schedule without them
    • Kubernetes service named vllm injected VLLM_PORT=tcp://... env var, conflicting with vLLM's own VLLM_PORT config — renamed service to vllm-server
    • Model uses compressed-tensors quantization, not awq — vLLM auto-detects when --quantization flag is omitted
    • nvidia.com/gpumem resource limit needed for HAMI to allocate VRAM (set to 20480 MiB)
  3. Got vLLM running and serving — model loaded, CUDA graphs compiled, inference working via ClusterIP
  4. NodePort (30800) didn't work — Olares networking blocks external NodePort access. Used SSH tunnel instead (autossh -L 30800:ClusterIP:8000 olares)
  5. Added --enable-auto-tool-choice --tool-call-parser hermes for OpenCode tool calling support
  6. Added olares-vllm provider preset to OpenCode config alongside existing Ollama preset

Why It Failed

  • 16K context too small — OpenCode's system prompt + instructions + MCP tool definitions = ~20K tokens, exceeding the model's max_model_len of 16384. Error: maximum context length is 16384 tokens, your request has 20750 input tokens
  • Increasing max_model_len wasn't viable — only 1.56 GiB KV cache available after model loading, barely enough for 17K tokens at 16384
  • The A3B model (3B active params) also has weaker reasoning than the dense 30B, making it doubly unsuitable for agentic coding

What Was Reverted

  • OpenCode switched back to olares-qwen3-coder//qwen3-coder:latest (Ollama, dense 30B, 32K+ context)
  • vLLM deployment scaled to 0 (namespace vllm-qwen3-coder still exists)
  • vllm-tunnel.service (autossh systemd unit) stopped and disabled on homelab-vm
  • ComfyUI was scaled down during testing, can be re-enabled

Artifacts Left Behind

  • Olares k8s namespace: vllm-qwen3-coder (deployment scaled to 0, PVC with cached model ~10GB)
  • Systemd unit: /etc/systemd/system/vllm-tunnel.service (disabled)
  • OpenCode config: olares-vllm provider preset remains but is not the active model
  • HuggingFace model cache: On PVC vllm-model-cache in vllm-qwen3-coder namespace

Cleanup (if desired)

# Delete the vLLM deployment and namespace
ssh olares "kubectl delete namespace vllm-qwen3-coder"

# Remove the tunnel service
sudo rm /etc/systemd/system/vllm-tunnel.service
sudo systemctl daemon-reload

# Remove vLLM provider from opencode config
# Edit ~/.config/opencode/opencode.json and remove the "olares-vllm" block

Lessons Learned

  • MoE models with small active params are poor for agentic coding — tool schemas alone can exceed their context limits
  • Olares raw kubectl deployments bypass the app framework — no managed URLs, no auth integration, no ingress. Use Studio or Market for proper integration
  • HAMI GPU scheduler needs specific pod labels — any GPU workload deployed outside Olares Market needs applications.app.bytetrade.io/name and related labels
  • Kubernetes service names can collide with app env vars — never name a k8s service the same as the app binary (vLLM reads VLLM_PORT which k8s auto-sets from service discovery)
  • Dense Qwen3-Coder 30B via Ollama remains the best option for this hardware — sufficient context (32K), good reasoning, and Ollama's auto-unload keeps VRAM available for other apps

Requirements

  • Tool calling support required — OpenCode sends tools with every request. Models without tool call templates return 400 errors
  • Large context needed — System prompt + tool definitions use ~15-20k tokens. Models with less than 32k context will fail
  • Flash attention recommendedOLLAMA_FLASH_ATTENTION=1 on the Ollama deployment allows larger contexts within VRAM limits

Troubleshooting

Error Cause Fix
bad request / 400 Model doesn't support tools, or context exceeded Switch to model with tool calling support
model not found Wrong model name (e.g., qwen3:coder vs qwen3-coder:latest) Check ollama list for exact names
Constant compaction Model consuming context with large tool results Reduce web fetches, use targeted globs, or increase VRAM
502 Bad Gateway Ollama pod restarting or endpoint down Check pod: ssh olares "sudo k3s kubectl get pods -n ollamaserver-shared"
Stuck in loops Model keeps retrying failed tool calls doom_loop: "deny" and reduce steps
Won't run ansible Model too cautious, AGENTS.md too restrictive Check instructions in config and AGENTS.md
Web fetch eating context Model searching internet for local info Instructions tell it to read local files first
Model swap lag Different services using different model names Ensure all services use the same model name