Vish/homelab-optimized

Fork 0

Files

Gitea Mirror Bot 3ecb830131

Documentation / Deploy to GitHub Pages (push) Has been cancelled

Details

Documentation / Build Docusaurus (push) Has been cancelled

Details

Sanitized mirror from private repository - 2026-04-05 10:05:14 UTC

2026-04-05 10:05:14 +00:00

12 KiB

Raw Blame History

OpenCode

AI-Powered Coding Agent CLI

Service Overview

Property	Value
Service Name	opencode
Category	AI / Development
Hosts	homelab VM (192.168.0.210), moon (100.64.0.6)
Install	`curl -fsSL https://opencode.ai/install \| bash`
Config	`~/.config/opencode/opencode.json`
LLM Backend	Olares Ollama (Qwen3-Coder 30B A3B)
Agent Name	Vesper

Purpose

OpenCode is an interactive CLI coding agent (similar to Claude Code) that connects to local LLM backends for AI-assisted software engineering. It runs on developer workstations and connects to the Olares Kubernetes appliance for GPU-accelerated inference.

Architecture

Developer Host (homelab VM / moon)
  └── opencode CLI
        └── HTTPS → Olares (192.168.0.145)
              └── Ollama (RTX 5090 Max-Q, 24GB VRAM)
                    └── qwen3-coder:latest (Qwen3-Coder 30B A3B, Q4_K_M)

Ollama Infrastructure

Host: Olares appliance at 192.168.0.145 (SSH: ssh olares)
Runtime: Kubernetes (k3s), namespace ollamaserver-shared
Pod: ollama-* in deployment ollama
API endpoint: https://a5be22681.vishinator.olares.com
GPU: NVIDIA RTX 5090 Laptop GPU, 24GB VRAM, compute capability 12.0
Flash attention: Enabled (OLLAMA_FLASH_ATTENTION=1 env var on deployment)

Models on Ollama

Model	Size	Context	VRAM Usage	Notes
`qwen3-coder:latest`	18GB	32k tokens	~22GB (fits in VRAM)	Default for everything
`qwen3-coder-65k:latest`	18GB	65k tokens	~25.3GB (spills to system RAM)	Experimental, not recommended (see below)
`devstral-small-2:latest`	15GB	32k tokens	—	Alternative model

Shared LLM — All Services Use the Same Model

qwen3-coder:latest is used by opencode, email organizers (3 accounts), and AnythingLLM. Since Ollama only keeps one model in VRAM at a time on 24GB, everything must use the same model name to avoid constant load/unload cycles (~12s each swap).

Configuration

Config: ~/.config/opencode/opencode.json

Default Model

"model": "olares-qwen3-coder//qwen3-coder:latest"

Context is set to 40k in the opencode config (Ollama physically loads 32k). This matches the original configuration before the vLLM endpoint went down.

Agent Personality (Vesper)

OpenCode is configured with a personality via both instructions in the config and AGENTS.md in the repo root:

Name: Vesper
Style: Concise, witty, competent — executes commands directly instead of explaining
Guardian role: Proactively warns about bad practices (secrets in git, missing dry-runs, open permissions)
Safety practices: Works in branches, dry-runs first, backs up before modifying, verifies after acting

Configured Provider

Single provider (dead vLLM endpoints were removed):

"olares-qwen3-coder": {
  "npm": "@ai-sdk/openai-compatible",
  "name": "Olares Ollama (Qwen3-Coder)",
  "options": { "baseURL": "https://a5be22681.vishinator.olares.com/v1" },
  "models": {
    "qwen3-coder:latest": { "context": 40000, "output": 8192 }
  }
}

Permissions (Full Autonomy)

"permission": {
  "bash": "allow",
  "edit": "allow",
  "write": "allow",
  "read": "allow",
  "glob": "allow",
  "grep": "allow",
  "question": "allow",
  "external_directory": "allow",
  "mcp": "allow"
}

Loop Prevention

"mode": {
  "build": { "steps": 50, "permission": { "doom_loop": "deny" } },
  "plan": { "steps": 25, "permission": { "doom_loop": "deny" } }
}

MCP Integration

The homelab MCP server is configured on the homelab VM:

"mcp": {
  "homelab": {
    "type": "local",
    "command": ["python3", "/home/homelab/organized/repos/homelab/scripts/homelab-mcp/server.py"],
    "enabled": true
  }
}

Host-Specific Setup

homelab VM (192.168.0.210)

User: homelab
Binary: ~/.opencode/bin/opencode
Config: ~/.config/opencode/opencode.json
Backup: ~/.config/opencode/opencode.json.bak.*
MCP: homelab MCP server enabled

moon (100.64.0.6 via Tailscale)

User: moon (access via ssh moon, then sudo -i su - moon)
Binary: ~/.opencode/bin/opencode
Config: ~/.config/opencode/opencode.json
May need config updated to point at active Ollama endpoint

Failed Experiment: 65k Context (2026-03-24)

Attempted to increase context from 32k to 65k to reduce compaction in opencode. Did not work well.

What Was Tried

Created qwen3-coder-65k model — Modelfile wrapper with PARAMETER num_ctx 65536 around the same weights as qwen3-coder:latest
Enabled flash attention — OLLAMA_FLASH_ATTENTION=1 on the Ollama k8s deployment. This allowed the 65k context to load (wouldn't fit without it)
Pointed all services (opencode, email organizers, AnythingLLM) at the 65k model

What Happened

The 65k model loaded but used 25.3GB VRAM on a 24GB GPU — the ~1.3GB overflow spilled to system RAM via resizable BAR
OpenCode still compacted constantly — the model's behavior (mass-globbing 50 files, web fetching full pages) consumed context faster than the extra headroom helped
Having two model names (qwen3-coder:latest and qwen3-coder-65k:latest) caused Ollama to constantly swap models in VRAM when different services used different names

Why It Failed

The compaction wasn't a context size problem — it was a model behavior problem. Qwen3-Coder 30B with opencode's system prompt + MCP tool definitions (~15-20k tokens) leaves only ~12-15k for conversation at 32k. One or two large tool results (glob with 50 matches, web fetch) fills the remainder. More context just delays the inevitable by one more tool call.

What Was Reverted

OpenCode and email organizers back to qwen3-coder:latest (32k actual, 40k in config)
Flash attention left enabled (harmless, improves VRAM efficiency)
qwen3-coder-65k model left on Ollama (unused, can be removed)

To Remove the 65k Model

ssh olares "sudo k3s kubectl exec -n ollamaserver-shared \$(sudo k3s kubectl get pod -n ollamaserver-shared -o jsonpath='{.items[0].metadata.name}') -c ollama -- ollama rm qwen3-coder-65k"

To Disable Flash Attention

ssh olares "sudo k3s kubectl set env deployment/ollama -n ollamaserver-shared -c ollama OLLAMA_FLASH_ATTENTION-"

What Would Actually Fix Compaction

More VRAM (48GB+ GPU) to run 65k+ context without spill
Smarter model that doesn't waste context on mass globs and web fetches
Fewer MCP tools registered (each tool definition consumes tokens in every request)

Failed Experiment: vLLM with Qwen3-30B-A3B AWQ (2026-03-30)

Attempted to replace Ollama with vLLM for better inference performance and context handling. Did not work for agentic coding.

What Was Tried

Deployed vLLM via raw kubectl on Olares — vllm/vllm-openai:latest with model cpatonn/Qwen3-30B-A3B-Instruct-2507-AWQ-4bit (compressed-tensors quantization, MoE with 3B active params)
Overcame multiple deployment issues:
- HAMI GPU scheduler requires Olares-specific labels (applications.app.bytetrade.io/name, hami.io/vgpu-node, etc.) — pods won't schedule without them
- Kubernetes service named vllm injected VLLM_PORT=tcp://... env var, conflicting with vLLM's own VLLM_PORT config — renamed service to vllm-server
- Model uses compressed-tensors quantization, not awq — vLLM auto-detects when --quantization flag is omitted
- nvidia.com/gpumem resource limit needed for HAMI to allocate VRAM (set to 20480 MiB)
Got vLLM running and serving — model loaded, CUDA graphs compiled, inference working via ClusterIP
NodePort (30800) didn't work — Olares networking blocks external NodePort access. Used SSH tunnel instead (autossh -L 30800:ClusterIP:8000 olares)
Added --enable-auto-tool-choice --tool-call-parser hermes for OpenCode tool calling support
Added olares-vllm provider preset to OpenCode config alongside existing Ollama preset

Why It Failed

16K context too small — OpenCode's system prompt + instructions + MCP tool definitions = ~20K tokens, exceeding the model's max_model_len of 16384. Error: maximum context length is 16384 tokens, your request has 20750 input tokens
Increasing max_model_len wasn't viable — only 1.56 GiB KV cache available after model loading, barely enough for 17K tokens at 16384
The A3B model (3B active params) also has weaker reasoning than the dense 30B, making it doubly unsuitable for agentic coding

What Was Reverted

OpenCode switched back to olares-qwen3-coder//qwen3-coder:latest (Ollama, dense 30B, 32K+ context)
vLLM deployment scaled to 0 (namespace vllm-qwen3-coder still exists)
vllm-tunnel.service (autossh systemd unit) stopped and disabled on homelab-vm
ComfyUI was scaled down during testing, can be re-enabled

Artifacts Left Behind

Olares k8s namespace: vllm-qwen3-coder (deployment scaled to 0, PVC with cached model ~10GB)
Systemd unit: /etc/systemd/system/vllm-tunnel.service (disabled)
OpenCode config: olares-vllm provider preset remains but is not the active model
HuggingFace model cache: On PVC vllm-model-cache in vllm-qwen3-coder namespace

Cleanup (if desired)

# Delete the vLLM deployment and namespace
ssh olares "kubectl delete namespace vllm-qwen3-coder"

# Remove the tunnel service
sudo rm /etc/systemd/system/vllm-tunnel.service
sudo systemctl daemon-reload

# Remove vLLM provider from opencode config
# Edit ~/.config/opencode/opencode.json and remove the "olares-vllm" block

Lessons Learned

MoE models with small active params are poor for agentic coding — tool schemas alone can exceed their context limits
Olares raw kubectl deployments bypass the app framework — no managed URLs, no auth integration, no ingress. Use Studio or Market for proper integration
HAMI GPU scheduler needs specific pod labels — any GPU workload deployed outside Olares Market needs applications.app.bytetrade.io/name and related labels
Kubernetes service names can collide with app env vars — never name a k8s service the same as the app binary (vLLM reads VLLM_PORT which k8s auto-sets from service discovery)
Dense Qwen3-Coder 30B via Ollama remains the best option for this hardware — sufficient context (32K), good reasoning, and Ollama's auto-unload keeps VRAM available for other apps

Requirements

Tool calling support required — OpenCode sends tools with every request. Models without tool call templates return 400 errors
Large context needed — System prompt + tool definitions use ~15-20k tokens. Models with less than 32k context will fail
Flash attention recommended — OLLAMA_FLASH_ATTENTION=1 on the Ollama deployment allows larger contexts within VRAM limits

Troubleshooting

Error	Cause	Fix
`bad request` / 400	Model doesn't support tools, or context exceeded	Switch to model with tool calling support
`model not found`	Wrong model name (e.g., `qwen3:coder` vs `qwen3-coder:latest`)	Check `ollama list` for exact names
Constant compaction	Model consuming context with large tool results	Reduce web fetches, use targeted globs, or increase VRAM
502 Bad Gateway	Ollama pod restarting or endpoint down	Check pod: `ssh olares "sudo k3s kubectl get pods -n ollamaserver-shared"`
Stuck in loops	Model keeps retrying failed tool calls	`doom_loop: "deny"` and reduce `steps`
Won't run ansible	Model too cautious, AGENTS.md too restrictive	Check instructions in config and AGENTS.md
Web fetch eating context	Model searching internet for local info	Instructions tell it to read local files first
Model swap lag	Different services using different model names	Ensure all services use the same model name

12 KiB Raw Blame History

OpenCode

Service Overview

Purpose

Architecture

Ollama Infrastructure

Models on Ollama

Shared LLM — All Services Use the Same Model

Configuration

Default Model

Agent Personality (Vesper)

Configured Provider

Permissions (Full Autonomy)

Loop Prevention

MCP Integration

Host-Specific Setup

homelab VM (192.168.0.210)

moon (100.64.0.6 via Tailscale)

Failed Experiment: 65k Context (2026-03-24)

What Was Tried

What Happened

Why It Failed

What Was Reverted

To Remove the 65k Model

To Disable Flash Attention

What Would Actually Fix Compaction

Failed Experiment: vLLM with Qwen3-30B-A3B AWQ (2026-03-30)

What Was Tried

Why It Failed

What Was Reverted

Artifacts Left Behind

Cleanup (if desired)

Lessons Learned

Requirements

Troubleshooting

12 KiB

Raw Blame History