Evolution: Oh My OpenAgent Configuration Iteration Log
The previous article covered the initial configuration setup. This one documents the adjustments after two weeks of running: expanding from single vendor to a four-tier model pool, adding fallback chains, hitting the GLM-4.5-air trap of analyzing without writing code.
This post covers: fallback strategy design, complete free model pool inventory and analysis, concurrency control configuration, and the decision process for GLM-4.5-air replacement.
After the previous article’s initial configuration, I ran it for two weeks — all the issues that needed fixing surfaced.
From Single Model to Multi-Vendor
The initial config had a problem: almost all Agents were tied to the Zhipu GLM series, with free models only filling a few minor positions. Running it revealed two issues —
First, GLM-4.7-flash and GLM-4.5-flash had already been deprecated — the config entries were dead weight. Second, some GLM models had unreliable connectivity; glm-4.7-flashx and glm-5v-turbo timed out completely. When the primary model failed, there was no reliable fallback chain, and the entire Agent would hang.
The solution was straightforward: expand the model pool. Zhipu GLM as the workhorse, NVIDIA paid models as alternatives, OpenRouter free models as the safety net.
The model sources now span four tiers:
| Tier | Vendor | Purpose |
|---|---|---|
| 1 | zhipuai-coding-plan | GLM full series, core tasks |
| 2 | nvidia | Multi-brand hosting (Qwen, Nemotron, Devstral, DeepSeek) |
| 3 | openrouter | Free models only (free suffix) |
| 4 | opencode built-in | Free fallback |
Fallback Chain: What Happens When the API Goes Down
This was entirely missing from the initial config. Added the runtime_fallback configuration:
| |
Every Agent and Category now has a fallback_models array. When the primary model returns 400/429/503/529 errors, it automatically cascades down. A typical chain:
| |
This actually triggered several times, mainly 429 (Zhipu rate limiting). Falling back to NVIDIA’s Qwen3-Coder was essentially seamless — no cliff-drop in code quality.
The GLM-4.5-air Pitfall: Analyzes But Doesn’t Act
This was the most interesting issue discovered after two weeks of running.
In the initial config, GLM-4.5-air handled three types of tasks: quick (small changes), unspecified-low (low-complexity odd jobs), and writing (documentation). The positioning was a value model — fast, cheap, adequate.
But in practice, a fixed pattern emerged: ask it to change code, and it would diligently analyze what should be changed, why, and the impact of the changes — then simply not make them. Every response was “I suggest you modify XXX” rather than delivering the modified code.
This happened too many times — not an intermittent issue.
I analyzed the config — glm-4.5-air appeared in 10 positions, 5 of which required writing code:
| Position | Role | Writes Code? |
|---|---|---|
sisyphus compaction | Context compression | No |
atlas fallback | Todo management | No |
librarian fallback | Document search | No |
explore fallback | Code search | No |
writing primary model | Write docs | No (writes text, not code) |
quick primary model | Single file modification | Yes |
unspecified-low primary model | Simple tasks | Yes |
sisyphus-junior fallback | Task execution | Yes |
deep fallback | Research + implementation | Yes |
unspecified-high fallback | Complex tasks | Yes |
The strategy was clear: keep the 5 non-code positions as-is, and replace all 5 code-writing positions.
Free Model Pool
To replace them, I needed alternatives. Running opencode models returned 29 currently available free models (25 OpenRouter + 4 OpenCode built-in). I listed and analyzed all of them.
OpenRouter Free Models (25)
Primary Coding Models
| Model | Architecture | Coding Ability | Context | Tool Calling | Notes |
|---|---|---|---|---|---|
| gemma-4-31b-it | Dense 31B | HumanEval+ 88.4%, LiveCodeBench 80.0% | 256K | Native | Strongest coder in free pool, BFCL 92.25% |
| llama-3.3-70b-instruct | Dense 70B | HumanEval 88.4% | 128K | Native | Large model, solid coding and reasoning |
| gemma-4-26b-a4b-it | MoE 26B (4B activated) | LiveCodeBench 77.1% | 256K | Native | 4B activation means speed, quality close to 31B |
| gemma-3-12b-it | Dense 12B | HumanEval 85.4% | 128K | Uncertain | Good coding scores, MATH 83.8%, but tool calling lacks official docs |
| nemotron-3-nano-30b-a3b | MoE 30B (3B activated) | HumanEval 78%, SWE-Bench 38.8% | 1M | Native | 1M context is a killer feature, tool calling BFCL 53.8% |
Gemma 4 26B is a MoE architecture — 26B total parameters but only 4B activated — meaning inference speed is close to a 4B model while coding quality approaches 30B level. Perfect for quick tasks (renaming variables, fixing imports). The 31B Dense version is an even stronger coder (HumanEval+ 88.4%), but slower.
Reasoning Specialists
| Model | Architecture | Reasoning Ability | Context | Notes |
|---|---|---|---|---|
| gpt-oss-120b | MoE 120B | AIME 2025 92.1% | 128K | OpenAI open-source, extremely strong reasoning but mediocre coding (HumanEval 73%) |
| gpt-oss-20b | MoE 20B (3.6B activated) | AIME 2025 92.1% | 128K | Small footprint but same reasoning strength as 120B |
| nemotron-3-super-120b-a12b | MoE 120B (12B activated) | Strong reasoning | 256K | Balanced coding and reasoning |
| nemotron-3-nano-omni-30b-a3b-reasoning | MoE 31B (3B activated) | LiveCodeBench 63.2% | 256K | nano-30b’s multimodal + reasoning variant, adds vision/audio |
The GPT-OSS series AIME scores are impressive — 92.1% matching many closed-source models. But coding is the weak spot (HumanEval 73%), making them suitable for reasoning scenarios rather than coding.
Multimodal
| Model | Architecture | Capability | Context | Notes |
|---|---|---|---|---|
| nemotron-nano-12b-v2-vl | Dense 12B | Vision (image understanding) | — | Only vision model in the free pool, used by multimodal-looker |
| gemma-3n-e4b-it | MatFormer ~4B | Vision | 32K | Too small, tool calling immature |
| gemma-3n-e2b-it | MatFormer ~2B | Vision | 32K | Even smaller, basically unusable for coding |
Vision model choices are limited. nemotron-nano-12b-v2-vl is the only free model that can handle images with reliable tool calling. The Gemma 3n series, though vision-capable, are too small (2-4B effective parameters), with LiveCodeBench at only 25.7%.
Mid-Range
| Model | Architecture | Coding Ability | Context | Notes |
|---|---|---|---|---|
| gemma-3-27b-it | Dense 27B | LiveCodeBench 29.7% | 128K | Much better than older Gemma-2 27B, but far behind Gemma 4 series |
| nemotron-nano-9b-v2 | Dense 9B | Lightweight coding | — | Lightweight but functional, good for fallback |
| glm-4.5-air | Dense (?) | Strong analysis, weak coding | — | Analyzes without writing code, adequate for writing and search |
| minimax-m2.5 | — | General purpose | — | MiniMax product, decent general capability |
Too Small for Coding
| Model | Parameters | Notes |
|---|---|---|
| gemma-3-4b-it | 4B | Weak coding ability |
| llama-3.2-3b-instruct | 3B | Too small |
| lfm-2.5-1.2b-instruct | 1.2B | AIME25 14%, basically can’t code |
| lfm-2.5-1.2b-thinking | 1.2B | Same, reasoning doesn’t help |
| laguna-m.1 | — | From Poolside, capability unknown |
| laguna-xs.2 | — | Same |
| dolphin-mistral-24b-venice-edition | 24B | Venice custom, uncensored but tool calling unclear |
| hermes-3-llama-3.1-405b | 405B | Large params but no tool calling, unusable for Agent scenarios |
| trinity-large-preview | — | Unstable endpoint |
OpenCode Built-in Free Models (4)
| Model | Notes |
|---|---|
| big-pickle | General purpose, ultimate fallback |
| hy3-preview-free | Tencent Hunyuan, good for Chinese scenarios |
| minimax-m2.5-free | MiniMax product, general purpose |
| nematron-3-super-free | NVIDIA Nemotron 3 Super, balanced reasoning and coding |
OpenCode built-in models don’t have public benchmark data, but from actual usage they’re adequate as fallback safety nets. big-pickle is the most common fallback — any Agent can run on it, and while quality is average, it never hangs.
The Final Replacement Plan
| |
Side effect: quick and unspecified-low switched from paid to free models — small tasks now cost nothing.
Positions where I kept glm-4.5-air unchanged — context compression, document search, code search, Todo management, documentation writing — these are all read-only or pure text output scenarios where “analyzes but doesn’t act” is actually quite fitting.
Concurrency Control
The initial config had no concurrency limits — multiple Agents running simultaneously would often hit vendor rate limits. This time I configured concurrency based on model size and vendor constraints:
- GLM-5.1 (flagship, slow): 3 concurrent
- GLM-5-turbo / GLM-4.7 (mid-range): 5-8 concurrent
- NVIDIA large models (480B, 253B): 2-3 concurrent
- OpenRouter free: 3-8 concurrent (by model size)
- OpenCode built-in free: 10-20 concurrent
In practice, rate limit hits have essentially stopped. The occasional 429 errors that used to happen now become seamless fallback switches thanks to the fallback chain.
Current Complete Configuration
The restructured configuration (only showing free model and OpenCode built-in model related chains):
| |
Lessons Learned
Model works ≠ model is suitable. GLM-4.5-air’s coding ability is fine — it just refuses to output code changes, only analysis. This behavioral quirk can’t be discovered through benchmarks; you have to use it to find out.
Free model pools change fast. Models that didn’t exist two weeks ago are now available, and those that were available before have been deprecated. Run opencode models periodically to see what’s new.
Fallback chains aren’t a nice-to-have; they’re essential. Without fallback, a single 429 error can stall the entire Agent. With fallback in place, primary model rate limiting switches to alternatives seamlessly.
Concurrency limits must align with vendor constraints. Large models have inherently low concurrency limits (2-3). Without proper configuration, multiple Agents running simultaneously trigger rate limiting, then fallback, then fallback models also hit limits — cascading failure.
For the next phase, I want to try putting NVIDIA’s new models (DeepSeek-v4, Llama-4 Maverick) into the fallback chain for testing. I also plan to run a connectivity test to screen all usable OpenRouter free pool models and see if there’s a better option than Gemma 4 26B for quick tasks.