$ cat ./posty/devlogi/iris-devlog-1-qwen35-9b-winner.md

iris devlog #1 — qwen3.5:9b winner, 25 models tested

maj 31 02:36 | 5 min | autor: esej | #devlogi

Two days of testing. 25+ LLM models from 0.8B to 122B. Goal: find a model for my IRIS agent — Polish persona, native tool-calling, access to Obsidian vault. Result: qwen3.5:9b with think=True. 28 seconds per question, 7.5/10 quality, runs on two cards providing a combined 20 GB VRAM.

Hardware

  • Dual-socket Xeon E5-2690v3 (24 cores / 48 threads)
  • 96 GB DDR4 ECC
  • RTX 5070 12 GB (Blackwell) + Tesla P4 8 GB (Pascal)
  • Total 20 GB VRAM for Ollama
  • Windows 11 Pro PL, S3 sleep after 15 minutes idle

The idea is simple: cheap retired hardware from the secondary market (Xeon E5 + Tesla P4 for pennies) plus one modern gaming card. In theory, it should suffice for mid-range models. In practice — a fascinating clash of hardware with models.

Lesson #1: Dense 70B+ on 20 GB VRAM = RAM-bound garbage

  • llama3.3:70b — 0.9 tok/s, 5 out of 16 tests timed out
  • deepseek-r1:70b — 0.8 tok/s, 3 timeouts + HTTP 400 on native tool
  • mixtral:8x22b — 1.4 tok/s, FizzBuzz bug (wrong order of conditions)
  • qwen3-coder-next:80b-A3B — 5.6 tok/s, 0 timeouts (reference)

What happened: dense 70B in Q4 weighs ~40 GB. I have 20 GB VRAM. The rest? Goes to RAM and CPU. Memory bandwidth DDR4 (~50 GB/s) vs VRAM (~600 GB/s on 5070) = ~12× slower. Token generation with a large model becomes a tortoise on a chariot.

Conclusion: active params matter more than total params. MoE with 3-10B active will beat dense 30-70B with limited VRAM.

Lesson #2: MoE A3B on the edge of VRAM = sweet spot

  • qwen3.6:35b dense (35B / 35B active) — 10.2 tok/s
  • qwen3.5:35b-a3b MoE (35B / 3.6B active) — 9.6 tok/s
  • qwen3.5:122b-a10b (122B / 10B active) — 2.6 tok/s (75 GB Q4 = large spillover)

Surprise: dense 35B won against MoE 35B-A3B. Reason: both models in Q4 weigh 22 GB. They fit almost entirely into 20 GB VRAM. The bottleneck is memory bandwidth within VRAM, not active parameters.

Lesson #3: Tool calling is NOT free

Raw tok/s benchmark is only one-third of the story. Live test with a real agent loop (8 questions from vault MCP + math + code):

ModelTime/8qVault MCPMathCode
qwen3.6:35b-a3b19m54s✓ full exploration✓ with verification
qwen3.6:27b dense17m57schaoticFizzBuzz bug
qwen3:14b3m31sAccess denied failgenerator-in-print
qwen3.5:9b3m4sfinds✓ with self-correction✓ multi-line
qwen3.5:4b2mHALLUCINATINGmulti-line instead of 1-line
gemma4:e4bcrashAccess denieddidn't make it
ministral-3:14b2m40s✓ Clarify patternarithmetic errorbroken slice

Key takeaway: qwen3:14b was 5.7× faster than qwen3.6:35b-a3b, but tool calling completely failed. "Access denied" for vault, "script to download courses works, check API" instead of real code. Speed without quality = useless for an agent.

Lesson #4: Native function calling > strong but "dumb" models

Native function calling = the model learned the JSON tool call format during training. Without it, the model guesses what to do — sometimes well, more often hallucinating paths or ignoring tools and responding with imagination.

  • Qwen3.5/3.6 — entire family, BFCL ~62-72, stable Polish
  • Mistral / Ministral — native FC OK, but Chinese characters on truncation
  • Nemotron — NVIDIA tuning on BFCL, top reasoning
  • Granite 4.1 — BFCL 68.27, but no official Polish (disqualifier)
  • Gemma 4 — weaker FC stability, vault MCP fail, word hallucinations ("Sardynegry")
  • Hermes-4-14B GGUF — broken chat template, generates empty responses

Lesson #5: Multi-specialist at 20 GB VRAM = myth

The idea seemed brilliant: instead of one "do-it-all model" — several specialists. Math to Nemotron-3-Nano-4B (AIME 89.1), code to Qwen2.5-Coder-7B (HumanEval 88.4), router qwen3.5:9b.

Problem: each specialist ~3-5 GB Q4 + KV cache ~5 GB. Plus router. Plus aux. We quickly exceed 20 GB VRAM. Ollama starts evict & reload — 30-second timeouts, 500 errors, swap loop.

GPU available: 0.7 GB ← almost zero
"model requires more gpu memory, evicting"
after evict: 10.1 GB free
new model loaded
500 ERROR (reload didn't finish in 30s)

Conclusion: multi-specialist makes sense for datacenter VRAM (80 GB+). For 20 GB, one universal model with good FC beats the specialist idea.

Lesson #6: Polish and folklore hallucinations

Every LLM, regardless of size, has a common problem with Polish: proverbs are made up.

Test question: "Tell me a Polish proverb about chickens that don't hatch."

  • qwen3.6:35b-a3b — "A hen laying eggs doesn't brood"
  • qwen3.5:9b — "A hen that doesn't lay eggs doesn't feed on grain — it just gets angry"
  • qwen3:14b — "Hens don't hatch — mushrooms don't hatch"
  • gemma4:e4b — "Hens don't hatch — mushrooms don't hatch"
  • ministral-3:14b — "From eggs of chicks without heating, none will hatch early in the morning"

The real (known to me) proverbs are "Don't praise the day before sunset" or "Don't count chicks before hatching". No model knew this. Fundamental gap in training data for Polish idioms.

The only model capable of defending itself by admitting ignorance: qwen3.5:9b with think=True and a properly written persona — "I don't make up what I don't know". Plus it honestly noted "thought about OR" when making things up.

Final stack

model: qwen3.5:9b              # router + aux + heavy
context_length: 262144         # 256K native Qwen3.5
kv_cache_type: q4_0            # compact
think: True                    # CoT compensates for 9B capacity
reasoning_effort: medium       # balanced
delegation: disabled           # one model = no swap
auxiliary_models: same         # qwen3.5:9b for compression/title/curator

VRAM balance: qwen3.5:9b Q4 ~6 GB + KV cache 256K q4_0 ~10 GB = 16 GB in 20 GB VRAM (4 GB buffer).

Performance: 8 questions in 3m 4s. 25 questions in 11m 32s. Vs qwen3.6:35b-a3b (highest quality) — 5× faster.

Quality breakdown (25 tests):

  • Math 4/4 (with self-correction via think=True)
  • Technical knowledge (CAP, HTTP/2, race condition, GC)
  • Code: JS debounce, SQL OFFSET, Bash one-liner (with a minor du bug)
  • Vault MCP — finds files, reads daily notes
  • Memory recall — USER profile + journal
  • Polish proverbs always made up
  • Translation sometimes imperfect ("Distributed systems" → "networking")
  • Haiku doesn't keep 5-7-5

TL;DR for the impatient

HardwareModelTrade-off
8 GB VRAMqwen3.5:4b or gemma4:e4bSpeed, poor quality
12 GB VRAMqwen3.5:9b with think=TrueSweet spot
20 GB VRAMqwen3.6:35b-a3b MoEMaximum quality, slow
24+ GB VRAMHeavy MoE with native FCDatacenter territory
80+ GB VRAMMulti-specialist senseNot a home use case

For a home AI agent with a Polish persona + vault MCP + native tool calling, in 2026, on ~20 GB VRAM: qwen3.5:9b with think=True is the choice.

And one more thing: don't try 14B base models for agent loops. They are fast, but in tests with a real agent they fail at things where 9B with think=True does well. Sometimes less + thinking > more without.

IRIS lives. Automatic setup — server sleeps after 15 minutes of idle, remembers what we talked about, has a Polish persona with dry sarcasm. Like a real agent.