Evolution: Oh My OpenAgent Configuration Iteration Log

The previous article covered the initial configuration setup. This one documents the adjustments after two weeks of running: expanding from single vendor to a four-tier model pool, adding fallback chains, hitting the GLM-4.5-air trap of analyzing without writing code.

This post covers: fallback strategy design, complete free model pool inventory and analysis, concurrency control configuration, and the decision process for GLM-4.5-air replacement.

After the previous article’s initial configuration, I ran it for two weeks — all the issues that needed fixing surfaced.

From Single Model to Multi-Vendor

The initial config had a problem: almost all Agents were tied to the Zhipu GLM series, with free models only filling a few minor positions. Running it revealed two issues —

First, GLM-4.7-flash and GLM-4.5-flash had already been deprecated — the config entries were dead weight. Second, some GLM models had unreliable connectivity; glm-4.7-flashx and glm-5v-turbo timed out completely. When the primary model failed, there was no reliable fallback chain, and the entire Agent would hang.

The solution was straightforward: expand the model pool. Zhipu GLM as the workhorse, NVIDIA paid models as alternatives, OpenRouter free models as the safety net.

The model sources now span four tiers:

TierVendorPurpose
1zhipuai-coding-planGLM full series, core tasks
2nvidiaMulti-brand hosting (Qwen, Nemotron, Devstral, DeepSeek)
3openrouterFree models only (free suffix)
4opencode built-inFree fallback

Fallback Chain: What Happens When the API Goes Down

This was entirely missing from the initial config. Added the runtime_fallback configuration:

json
1
2
3
4
5
6
"runtime_fallback": {
  "enabled": true,
  "retry_on_errors": [400, 429, 503, 529],
  "max_fallback_attempts": 3,
  "cooldown_seconds": 60
}

Every Agent and Category now has a fallback_models array. When the primary model returns 400/429/503/529 errors, it automatically cascades down. A typical chain:

1
glm-5.1 → glm-5-turbo → glm-4.7 → NVIDIA strong model → OpenRouter free → opencode free fallback

This actually triggered several times, mainly 429 (Zhipu rate limiting). Falling back to NVIDIA’s Qwen3-Coder was essentially seamless — no cliff-drop in code quality.

The GLM-4.5-air Pitfall: Analyzes But Doesn’t Act

This was the most interesting issue discovered after two weeks of running.

In the initial config, GLM-4.5-air handled three types of tasks: quick (small changes), unspecified-low (low-complexity odd jobs), and writing (documentation). The positioning was a value model — fast, cheap, adequate.

But in practice, a fixed pattern emerged: ask it to change code, and it would diligently analyze what should be changed, why, and the impact of the changes — then simply not make them. Every response was “I suggest you modify XXX” rather than delivering the modified code.

This happened too many times — not an intermittent issue.

I analyzed the config — glm-4.5-air appeared in 10 positions, 5 of which required writing code:

PositionRoleWrites Code?
sisyphus compactionContext compressionNo
atlas fallbackTodo managementNo
librarian fallbackDocument searchNo
explore fallbackCode searchNo
writing primary modelWrite docsNo (writes text, not code)
quick primary modelSingle file modificationYes
unspecified-low primary modelSimple tasksYes
sisyphus-junior fallbackTask executionYes
deep fallbackResearch + implementationYes
unspecified-high fallbackComplex tasksYes

The strategy was clear: keep the 5 non-code positions as-is, and replace all 5 code-writing positions.

Free Model Pool

To replace them, I needed alternatives. Running opencode models returned 29 currently available free models (25 OpenRouter + 4 OpenCode built-in). I listed and analyzed all of them.

OpenRouter Free Models (25)

Primary Coding Models

ModelArchitectureCoding AbilityContextTool CallingNotes
gemma-4-31b-itDense 31BHumanEval+ 88.4%, LiveCodeBench 80.0%256KNativeStrongest coder in free pool, BFCL 92.25%
llama-3.3-70b-instructDense 70BHumanEval 88.4%128KNativeLarge model, solid coding and reasoning
gemma-4-26b-a4b-itMoE 26B (4B activated)LiveCodeBench 77.1%256KNative4B activation means speed, quality close to 31B
gemma-3-12b-itDense 12BHumanEval 85.4%128KUncertainGood coding scores, MATH 83.8%, but tool calling lacks official docs
nemotron-3-nano-30b-a3bMoE 30B (3B activated)HumanEval 78%, SWE-Bench 38.8%1MNative1M context is a killer feature, tool calling BFCL 53.8%

Gemma 4 26B is a MoE architecture — 26B total parameters but only 4B activated — meaning inference speed is close to a 4B model while coding quality approaches 30B level. Perfect for quick tasks (renaming variables, fixing imports). The 31B Dense version is an even stronger coder (HumanEval+ 88.4%), but slower.

Reasoning Specialists

ModelArchitectureReasoning AbilityContextNotes
gpt-oss-120bMoE 120BAIME 2025 92.1%128KOpenAI open-source, extremely strong reasoning but mediocre coding (HumanEval 73%)
gpt-oss-20bMoE 20B (3.6B activated)AIME 2025 92.1%128KSmall footprint but same reasoning strength as 120B
nemotron-3-super-120b-a12bMoE 120B (12B activated)Strong reasoning256KBalanced coding and reasoning
nemotron-3-nano-omni-30b-a3b-reasoningMoE 31B (3B activated)LiveCodeBench 63.2%256Knano-30b’s multimodal + reasoning variant, adds vision/audio

The GPT-OSS series AIME scores are impressive — 92.1% matching many closed-source models. But coding is the weak spot (HumanEval 73%), making them suitable for reasoning scenarios rather than coding.

Multimodal

ModelArchitectureCapabilityContextNotes
nemotron-nano-12b-v2-vlDense 12BVision (image understanding)Only vision model in the free pool, used by multimodal-looker
gemma-3n-e4b-itMatFormer ~4BVision32KToo small, tool calling immature
gemma-3n-e2b-itMatFormer ~2BVision32KEven smaller, basically unusable for coding

Vision model choices are limited. nemotron-nano-12b-v2-vl is the only free model that can handle images with reliable tool calling. The Gemma 3n series, though vision-capable, are too small (2-4B effective parameters), with LiveCodeBench at only 25.7%.

Mid-Range

ModelArchitectureCoding AbilityContextNotes
gemma-3-27b-itDense 27BLiveCodeBench 29.7%128KMuch better than older Gemma-2 27B, but far behind Gemma 4 series
nemotron-nano-9b-v2Dense 9BLightweight codingLightweight but functional, good for fallback
glm-4.5-airDense (?)Strong analysis, weak codingAnalyzes without writing code, adequate for writing and search
minimax-m2.5General purposeMiniMax product, decent general capability

Too Small for Coding

ModelParametersNotes
gemma-3-4b-it4BWeak coding ability
llama-3.2-3b-instruct3BToo small
lfm-2.5-1.2b-instruct1.2BAIME25 14%, basically can’t code
lfm-2.5-1.2b-thinking1.2BSame, reasoning doesn’t help
laguna-m.1From Poolside, capability unknown
laguna-xs.2Same
dolphin-mistral-24b-venice-edition24BVenice custom, uncensored but tool calling unclear
hermes-3-llama-3.1-405b405BLarge params but no tool calling, unusable for Agent scenarios
trinity-large-previewUnstable endpoint

OpenCode Built-in Free Models (4)

ModelNotes
big-pickleGeneral purpose, ultimate fallback
hy3-preview-freeTencent Hunyuan, good for Chinese scenarios
minimax-m2.5-freeMiniMax product, general purpose
nematron-3-super-freeNVIDIA Nemotron 3 Super, balanced reasoning and coding

OpenCode built-in models don’t have public benchmark data, but from actual usage they’re adequate as fallback safety nets. big-pickle is the most common fallback — any Agent can run on it, and while quality is average, it never hangs.

The Final Replacement Plan

1
2
3
4
5
6
7
8
quick / unspecified-low primary model:
  glm-4.5-air → gemma-4-26b-a4b-it:free

sisyphus-junior fallback:
  glm-4.5-air → nemotron-3-nano-30b-a3b:free

deep / unspecified-high fallback:
  glm-4.5-air → llama-3.3-70b-instruct:free

Side effect: quick and unspecified-low switched from paid to free models — small tasks now cost nothing.

Positions where I kept glm-4.5-air unchanged — context compression, document search, code search, Todo management, documentation writing — these are all read-only or pure text output scenarios where “analyzes but doesn’t act” is actually quite fitting.

Concurrency Control

The initial config had no concurrency limits — multiple Agents running simultaneously would often hit vendor rate limits. This time I configured concurrency based on model size and vendor constraints:

  • GLM-5.1 (flagship, slow): 3 concurrent
  • GLM-5-turbo / GLM-4.7 (mid-range): 5-8 concurrent
  • NVIDIA large models (480B, 253B): 2-3 concurrent
  • OpenRouter free: 3-8 concurrent (by model size)
  • OpenCode built-in free: 10-20 concurrent

In practice, rate limit hits have essentially stopped. The occasional 429 errors that used to happen now become seamless fallback switches thanks to the fallback chain.

Current Complete Configuration

The restructured configuration (only showing free model and OpenCode built-in model related chains):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
agents:
  sisyphus:     glm-5.1    → ... → gemma-4-31b:free → llama-3.3-70b:free → big-pickle
  hephaestus:   glm-5.1    → ... → gemma-4-31b:free → big-pickle
  oracle:       glm-5.1    → ... → gemma-4-31b:free
  prometheus:   glm-5.1    → ... → gemma-4-31b:free → llama-3.3-70b:free
  metis:        glm-5.1    → ... → gemma-4-31b:free
  momus:        glm-5.1    → ... → gemma-4-31b:free
  atlas:        glm-5-turbo → ... → gemma-4-26b:free → big-pickle
  librarian:    glm-4.7    → ... → gemma-4-26b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  explore:      glm-4.7    → ... → gemma-4-26b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  multimodal:   nemotron-nano-12b-v2-vl:free → nemotron-3-super:free → nematron-3-super-free(opencode)
  junior:       glm-5-turbo → ... → nemotron-3-nano-30b:free → gemma-4-31b:free → llama-3.3-70b:free → big-pickle

categories:
  visual-engineering: glm-5.1 → ... → gemma-4-31b:free → big-pickle
  ultrabrain:         glm-5.1 → ... → gemma-4-31b:free → llama-3.3-70b:free
  deep:               glm-5-turbo → ... → llama-3.3-70b:free → nemotron-3-nano-30b:free
  artistry:           glm-5.1 → ... → gemma-4-31b:free
  quick:              gemma-4-26b-a4b-it:free → nemotron-nano-9b:free → gpt-5-nano(opencode) → ling-2.6-flash(opencode)
  unspecified-low:    gemma-4-26b-a4b-it:free → nemotron-nano-9b:free → minimax-m2.5-free(opencode) → hy3(opencode) → ling-2.6-flash(opencode)
  unspecified-high:   glm-5-turbo → ... → gemma-4-31b:free → llama-3.3-70b:free → big-pickle
  writing:            glm-4.5-air → gemma-4-26b:free → minimax-m2.5-free(opencode) → ling-2.6-flash(opencode) → gpt-5-nano(opencode)

Lessons Learned

Model works ≠ model is suitable. GLM-4.5-air’s coding ability is fine — it just refuses to output code changes, only analysis. This behavioral quirk can’t be discovered through benchmarks; you have to use it to find out.

Free model pools change fast. Models that didn’t exist two weeks ago are now available, and those that were available before have been deprecated. Run opencode models periodically to see what’s new.

Fallback chains aren’t a nice-to-have; they’re essential. Without fallback, a single 429 error can stall the entire Agent. With fallback in place, primary model rate limiting switches to alternatives seamlessly.

Concurrency limits must align with vendor constraints. Large models have inherently low concurrency limits (2-3). Without proper configuration, multiple Agents running simultaneously trigger rate limiting, then fallback, then fallback models also hit limits — cascading failure.

For the next phase, I want to try putting NVIDIA’s new models (DeepSeek-v4, Llama-4 Maverick) into the fallback chain for testing. I also plan to run a connectivity test to screen all usable OpenRouter free pool models and see if there’s a better option than Gemma 4 26B for quick tasks.