Evolution: Oh My OpenAgent Configuration Iteration Log

May 7, 2026 AI Tools AI Programming, Multi-Model Orchestration 1653 words 8 min read

🔊

The previous article covered the initial configuration setup. This one documents the adjustments after two weeks of running: expanding from single vendor to a four-tier model pool, adding fallback chains, hitting the GLM-4.5-air trap of analyzing without writing code.
This post covers: fallback strategy design, complete free model pool inventory and analysis, concurrency control configuration, and the decision process for GLM-4.5-air replacement.

After the previous article’s initial configuration, I ran it for two weeks — all the issues that needed fixing surfaced.

From Single Model to Multi-Vendor

The initial config had a problem: almost all Agents were tied to the Zhipu GLM series, with free models only filling a few minor positions. Running it revealed two issues —

First, GLM-4.7-flash and GLM-4.5-flash had already been deprecated — the config entries were dead weight. Second, some GLM models had unreliable connectivity; glm-4.7-flashx and glm-5v-turbo timed out completely. When the primary model failed, there was no reliable fallback chain, and the entire Agent would hang.

The solution was straightforward: expand the model pool. Zhipu GLM as the workhorse, NVIDIA paid models as alternatives, OpenRouter free models as the safety net.

The model sources now span four tiers:

Tier	Vendor	Purpose
1	zhipuai-coding-plan	GLM full series, core tasks
2	nvidia	Multi-brand hosting (Qwen, Nemotron, Devstral, DeepSeek)
3	openrouter	Free models only (`free` suffix)
4	opencode built-in	Free fallback

Fallback Chain: What Happens When the API Goes Down

This was entirely missing from the initial config. Added the runtime_fallback configuration:

json
1
2
3
4
5
6
"runtime_fallback": {
  "enabled": true,
  "retry_on_errors": [400, 429, 503, 529],
  "max_fallback_attempts": 3,
  "cooldown_seconds": 60
}

Every Agent and Category now has a fallback_models array. When the primary model returns 400/429/503/529 errors, it automatically cascades down. A typical chain:

1
glm-5.1 → glm-5-turbo → glm-4.7 → NVIDIA strong model → OpenRouter free → opencode free fallback

This actually triggered several times, mainly 429 (Zhipu rate limiting). Falling back to NVIDIA’s Qwen3-Coder was essentially seamless — no cliff-drop in code quality.

The GLM-4.5-air Pitfall: Analyzes But Doesn’t Act

This was the most interesting issue discovered after two weeks of running.

In the initial config, GLM-4.5-air handled three types of tasks: quick (small changes), unspecified-low (low-complexity odd jobs), and writing (documentation). The positioning was a value model — fast, cheap, adequate.

But in practice, a fixed pattern emerged: ask it to change code, and it would diligently analyze what should be changed, why, and the impact of the changes — then simply not make them. Every response was “I suggest you modify XXX” rather than delivering the modified code.

This happened too many times — not an intermittent issue.

I analyzed the config — glm-4.5-air appeared in 10 positions, 5 of which required writing code:

Position	Role	Writes Code?
`sisyphus` compaction	Context compression	No
`atlas` fallback	Todo management	No
`librarian` fallback	Document search	No
`explore` fallback	Code search	No
`writing` primary model	Write docs	No (writes text, not code)
`quick` primary model	Single file modification	Yes
`unspecified-low` primary model	Simple tasks	Yes
`sisyphus-junior` fallback	Task execution	Yes
`deep` fallback	Research + implementation	Yes
`unspecified-high` fallback	Complex tasks	Yes

The strategy was clear: keep the 5 non-code positions as-is, and replace all 5 code-writing positions.

Free Model Pool

To replace them, I needed alternatives. Running opencode models returned 29 currently available free models (25 OpenRouter + 4 OpenCode built-in). I listed and analyzed all of them.

OpenRouter Free Models (25)

Primary Coding Models

Model	Architecture	Coding Ability	Context	Tool Calling	Notes
gemma-4-31b-it	Dense 31B	HumanEval+ 88.4%, LiveCodeBench 80.0%	256K	Native	Strongest coder in free pool, BFCL 92.25%
llama-3.3-70b-instruct	Dense 70B	HumanEval 88.4%	128K	Native	Large model, solid coding and reasoning
gemma-4-26b-a4b-it	MoE 26B (4B activated)	LiveCodeBench 77.1%	256K	Native	4B activation means speed, quality close to 31B
gemma-3-12b-it	Dense 12B	HumanEval 85.4%	128K	Uncertain	Good coding scores, MATH 83.8%, but tool calling lacks official docs
nemotron-3-nano-30b-a3b	MoE 30B (3B activated)	HumanEval 78%, SWE-Bench 38.8%	1M	Native	1M context is a killer feature, tool calling BFCL 53.8%

Gemma 4 26B is a MoE architecture — 26B total parameters but only 4B activated — meaning inference speed is close to a 4B model while coding quality approaches 30B level. Perfect for quick tasks (renaming variables, fixing imports). The 31B Dense version is an even stronger coder (HumanEval+ 88.4%), but slower.

Reasoning Specialists

Model	Architecture	Reasoning Ability	Context	Notes
gpt-oss-120b	MoE 120B	AIME 2025 92.1%	128K	OpenAI open-source, extremely strong reasoning but mediocre coding (HumanEval 73%)
gpt-oss-20b	MoE 20B (3.6B activated)	AIME 2025 92.1%	128K	Small footprint but same reasoning strength as 120B
nemotron-3-super-120b-a12b	MoE 120B (12B activated)	Strong reasoning	256K	Balanced coding and reasoning
nemotron-3-nano-omni-30b-a3b-reasoning	MoE 31B (3B activated)	LiveCodeBench 63.2%	256K	nano-30b’s multimodal + reasoning variant, adds vision/audio

The GPT-OSS series AIME scores are impressive — 92.1% matching many closed-source models. But coding is the weak spot (HumanEval 73%), making them suitable for reasoning scenarios rather than coding.

Multimodal

Model	Architecture	Capability	Context	Notes
nemotron-nano-12b-v2-vl	Dense 12B	Vision (image understanding)	—	Only vision model in the free pool, used by multimodal-looker
gemma-3n-e4b-it	MatFormer ~4B	Vision	32K	Too small, tool calling immature
gemma-3n-e2b-it	MatFormer ~2B	Vision	32K	Even smaller, basically unusable for coding

Vision model choices are limited. nemotron-nano-12b-v2-vl is the only free model that can handle images with reliable tool calling. The Gemma 3n series, though vision-capable, are too small (2-4B effective parameters), with LiveCodeBench at only 25.7%.

Mid-Range

Model	Architecture	Coding Ability	Context	Notes
gemma-3-27b-it	Dense 27B	LiveCodeBench 29.7%	128K	Much better than older Gemma-2 27B, but far behind Gemma 4 series
nemotron-nano-9b-v2	Dense 9B	Lightweight coding	—	Lightweight but functional, good for fallback
glm-4.5-air	Dense (?)	Strong analysis, weak coding	—	Analyzes without writing code, adequate for writing and search
minimax-m2.5	—	General purpose	—	MiniMax product, decent general capability

Too Small for Coding

Model	Parameters	Notes
gemma-3-4b-it	4B	Weak coding ability
llama-3.2-3b-instruct	3B	Too small
lfm-2.5-1.2b-instruct	1.2B	AIME25 14%, basically can’t code
lfm-2.5-1.2b-thinking	1.2B	Same, reasoning doesn’t help
laguna-m.1	—	From Poolside, capability unknown
laguna-xs.2	—	Same
dolphin-mistral-24b-venice-edition	24B	Venice custom, uncensored but tool calling unclear
hermes-3-llama-3.1-405b	405B	Large params but no tool calling, unusable for Agent scenarios
trinity-large-preview	—	Unstable endpoint

OpenCode Built-in Free Models (4)

Model	Notes
big-pickle	General purpose, ultimate fallback
hy3-preview-free	Tencent Hunyuan, good for Chinese scenarios
minimax-m2.5-free	MiniMax product, general purpose
nematron-3-super-free	NVIDIA Nemotron 3 Super, balanced reasoning and coding

OpenCode built-in models don’t have public benchmark data, but from actual usage they’re adequate as fallback safety nets. big-pickle is the most common fallback — any Agent can run on it, and while quality is average, it never hangs.

The Final Replacement Plan

1
2
3
4
5
6
7
8
quick / unspecified-low primary model:
  glm-4.5-air → gemma-4-26b-a4b-it:free

sisyphus-junior fallback:
  glm-4.5-air → nemotron-3-nano-30b-a3b:free

deep / unspecified-high fallback:
  glm-4.5-air → llama-3.3-70b-instruct:free

Side effect: quick and unspecified-low switched from paid to free models — small tasks now cost nothing.

Positions where I kept glm-4.5-air unchanged — context compression, document search, code search, Todo management, documentation writing — these are all read-only or pure text output scenarios where “analyzes but doesn’t act” is actually quite fitting.

Concurrency Control

The initial config had no concurrency limits — multiple Agents running simultaneously would often hit vendor rate limits. This time I configured concurrency based on model size and vendor constraints:

GLM-5.1 (flagship, slow): 3 concurrent
GLM-5-turbo / GLM-4.7 (mid-range): 5-8 concurrent
NVIDIA large models (480B, 253B): 2-3 concurrent
OpenRouter free: 3-8 concurrent (by model size)
OpenCode built-in free: 10-20 concurrent

In practice, rate limit hits have essentially stopped. The occasional 429 errors that used to happen now become seamless fallback switches thanks to the fallback chain.

Current Complete Configuration

The restructured configuration (only showing free model and OpenCode built-in model related chains):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
agents:
  sisyphus:     glm-5.1    → ... → gemma-4-31b:free → llama-3.3-70b:free → big-pickle
  hephaestus:   glm-5.1    → ... → gemma-4-31b:free → big-pickle
  oracle:       glm-5.1    → ... → gemma-4-31b:free
  prometheus:   glm-5.1    → ... → gemma-4-31b:free → llama-3.3-70b:free
  metis:        glm-5.1    → ... → gemma-4-31b:free
  momus:        glm-5.1    → ... → gemma-4-31b:free
  atlas:        glm-5-turbo → ... → gemma-4-26b:free → big-pickle
  librarian:    glm-4.7    → ... → gemma-4-26b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  explore:      glm-4.7    → ... → gemma-4-26b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  multimodal:   nemotron-nano-12b-v2-vl:free → nemotron-3-super:free → nematron-3-super-free(opencode)
  junior:       glm-5-turbo → ... → nemotron-3-nano-30b:free → gemma-4-31b:free → llama-3.3-70b:free → big-pickle

categories:
  visual-engineering: glm-5.1 → ... → gemma-4-31b:free → big-pickle
  ultrabrain:         glm-5.1 → ... → gemma-4-31b:free → llama-3.3-70b:free
  deep:               glm-5-turbo → ... → llama-3.3-70b:free → nemotron-3-nano-30b:free
  artistry:           glm-5.1 → ... → gemma-4-31b:free
  quick:              gemma-4-26b-a4b-it:free → nemotron-nano-9b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  unspecified-low:    gemma-4-26b-a4b-it:free → nemotron-nano-9b:free → minimax-m2.5-free(opencode) → hy3(opencode) → ling-2.6-flash(opencode)
  unspecified-high:   glm-5-turbo → ... → gemma-4-31b:free → llama-3.3-70b:free → big-pickle
  writing:            glm-4.5-air → gemma-4-26b:free → minimax-m2.5-free(opencode) → ling-2.6-flash(opencode) → gpt-5-nano(opencode)

Lessons Learned

Model works ≠ model is suitable. GLM-4.5-air’s coding ability is fine — it just refuses to output code changes, only analysis. This behavioral quirk can’t be discovered through benchmarks; you have to use it to find out.

Free model pools change fast. Models that didn’t exist two weeks ago are now available, and those that were available before have been deprecated. Run opencode models periodically to see what’s new.

Fallback chains aren’t a nice-to-have; they’re essential. Without fallback, a single 429 error can stall the entire Agent. With fallback in place, primary model rate limiting switches to alternatives seamlessly.

Concurrency limits must align with vendor constraints. Large models have inherently low concurrency limits (2-3). Without proper configuration, multiple Agents running simultaneously trigger rate limiting, then fallback, then fallback models also hit limits — cascading failure.

For the next phase, I want to try putting NVIDIA’s new models (DeepSeek-v4, Llama-4 Maverick) into the fallback chain for testing. I also plan to run a connectivity test to screen all usable OpenRouter free pool models and see if there’s a better option than Gemma 4 26B for quick tasks.