iris devlog #1 — qwen3.5:9b winner, 25 models tested

Two days of testing. 25+ LLM models from 0.8B to 122B. Goal: find a model for my IRIS agent — Polish persona, native tool-calling, access to Obsidian vault. Result: qwen3.5:9b with think=True. 28 seconds per question, 7.5/10 quality, runs on two cards providing a combined 20 GB VRAM.

Hardware

Dual-socket Xeon E5-2690v3 (24 cores / 48 threads)
96 GB DDR4 ECC
RTX 5070 12 GB (Blackwell) + Tesla P4 8 GB (Pascal)
Total 20 GB VRAM for Ollama
Windows 11 Pro PL, S3 sleep after 15 minutes idle

The idea is simple: cheap retired hardware from the secondary market (Xeon E5 + Tesla P4 for pennies) plus one modern gaming card. In theory, it should suffice for mid-range models. In practice — a fascinating clash of hardware with models.

Lesson #1: Dense 70B+ on 20 GB VRAM = RAM-bound garbage

llama3.3:70b — 0.9 tok/s, 5 out of 16 tests timed out
deepseek-r1:70b — 0.8 tok/s, 3 timeouts + HTTP 400 on native tool
mixtral:8x22b — 1.4 tok/s, FizzBuzz bug (wrong order of conditions)
qwen3-coder-next:80b-A3B — 5.6 tok/s, 0 timeouts (reference)

What happened: dense 70B in Q4 weighs ~40 GB. I have 20 GB VRAM. The rest? Goes to RAM and CPU. Memory bandwidth DDR4 (~50 GB/s) vs VRAM (~600 GB/s on 5070) = ~12× slower. Token generation with a large model becomes a tortoise on a chariot.

Conclusion: active params matter more than total params. MoE with 3-10B active will beat dense 30-70B with limited VRAM.

Lesson #2: MoE A3B on the edge of VRAM = sweet spot

qwen3.6:35b dense (35B / 35B active) — 10.2 tok/s
qwen3.5:35b-a3b MoE (35B / 3.6B active) — 9.6 tok/s
qwen3.5:122b-a10b (122B / 10B active) — 2.6 tok/s (75 GB Q4 = large spillover)

Surprise: dense 35B won against MoE 35B-A3B. Reason: both models in Q4 weigh 22 GB. They fit almost entirely into 20 GB VRAM. The bottleneck is memory bandwidth within VRAM, not active parameters.

Lesson #3: Tool calling is NOT free

Raw tok/s benchmark is only one-third of the story. Live test with a real agent loop (8 questions from vault MCP + math + code):

Model	Time/8q	Vault MCP	Math	Code
qwen3.6:35b-a3b	19m54s	✓ full exploration	✓ with verification	✓
qwen3.6:27b dense	17m57s	chaotic	✓	FizzBuzz bug
qwen3:14b	3m31s	Access denied fail	✓	generator-in-print
qwen3.5:9b	3m4s	finds	✓ with self-correction	✓ multi-line
qwen3.5:4b	2m	HALLUCINATING	✓	multi-line instead of 1-line
gemma4:e4b	crash	Access denied	✓	didn't make it
ministral-3:14b	2m40s	✓ Clarify pattern	arithmetic error	broken slice

Key takeaway: qwen3:14b was 5.7× faster than qwen3.6:35b-a3b, but tool calling completely failed. "Access denied" for vault, "script to download courses works, check API" instead of real code. Speed without quality = useless for an agent.

Lesson #4: Native function calling > strong but "dumb" models

Native function calling = the model learned the JSON tool call format during training. Without it, the model guesses what to do — sometimes well, more often hallucinating paths or ignoring tools and responding with imagination.

Qwen3.5/3.6 — entire family, BFCL ~62-72, stable Polish
Mistral / Ministral — native FC OK, but Chinese characters on truncation
Nemotron — NVIDIA tuning on BFCL, top reasoning
Granite 4.1 — BFCL 68.27, but no official Polish (disqualifier)
Gemma 4 — weaker FC stability, vault MCP fail, word hallucinations ("Sardynegry")
Hermes-4-14B GGUF — broken chat template, generates empty responses

Lesson #5: Multi-specialist at 20 GB VRAM = myth

The idea seemed brilliant: instead of one "do-it-all model" — several specialists. Math to Nemotron-3-Nano-4B (AIME 89.1), code to Qwen2.5-Coder-7B (HumanEval 88.4), router qwen3.5:9b.

Problem: each specialist ~3-5 GB Q4 + KV cache ~5 GB. Plus router. Plus aux. We quickly exceed 20 GB VRAM. Ollama starts evict & reload — 30-second timeouts, 500 errors, swap loop.

GPU available: 0.7 GB ← almost zero
"model requires more gpu memory, evicting"
after evict: 10.1 GB free
new model loaded
500 ERROR (reload didn't finish in 30s)

Conclusion: multi-specialist makes sense for datacenter VRAM (80 GB+). For 20 GB, one universal model with good FC beats the specialist idea.

Lesson #6: Polish and folklore hallucinations

Every LLM, regardless of size, has a common problem with Polish: proverbs are made up.

Test question: "Tell me a Polish proverb about chickens that don't hatch."

qwen3.6:35b-a3b — "A hen laying eggs doesn't brood"
qwen3.5:9b — "A hen that doesn't lay eggs doesn't feed on grain — it just gets angry"
qwen3:14b — "Hens don't hatch — mushrooms don't hatch"
gemma4:e4b — "Hens don't hatch — mushrooms don't hatch"
ministral-3:14b — "From eggs of chicks without heating, none will hatch early in the morning"

The real (known to me) proverbs are "Don't praise the day before sunset" or "Don't count chicks before hatching". No model knew this. Fundamental gap in training data for Polish idioms.

The only model capable of defending itself by admitting ignorance: qwen3.5:9b with think=True and a properly written persona — "I don't make up what I don't know". Plus it honestly noted "thought about OR" when making things up.

Final stack

model: qwen3.5:9b              # router + aux + heavy
context_length: 262144         # 256K native Qwen3.5
kv_cache_type: q4_0            # compact
think: True                    # CoT compensates for 9B capacity
reasoning_effort: medium       # balanced
delegation: disabled           # one model = no swap
auxiliary_models: same         # qwen3.5:9b for compression/title/curator

VRAM balance: qwen3.5:9b Q4 ~6 GB + KV cache 256K q4_0 ~10 GB = 16 GB in 20 GB VRAM (4 GB buffer).

Performance: 8 questions in 3m 4s. 25 questions in 11m 32s. Vs qwen3.6:35b-a3b (highest quality) — 5× faster.

Quality breakdown (25 tests):

Math 4/4 (with self-correction via think=True)
Technical knowledge (CAP, HTTP/2, race condition, GC)
Code: JS debounce, SQL OFFSET, Bash one-liner (with a minor du bug)
Vault MCP — finds files, reads daily notes
Memory recall — USER profile + journal
Polish proverbs always made up
Translation sometimes imperfect ("Distributed systems" → "networking")
Haiku doesn't keep 5-7-5

TL;DR for the impatient

Hardware	Model	Trade-off
8 GB VRAM	qwen3.5:4b or gemma4:e4b	Speed, poor quality
12 GB VRAM	qwen3.5:9b with think=True	Sweet spot
20 GB VRAM	qwen3.6:35b-a3b MoE	Maximum quality, slow
24+ GB VRAM	Heavy MoE with native FC	Datacenter territory
80+ GB VRAM	Multi-specialist sense	Not a home use case

For a home AI agent with a Polish persona + vault MCP + native tool calling, in 2026, on ~20 GB VRAM: qwen3.5:9b with think=True is the choice.

And one more thing: don't try 14B base models for agent loops. They are fast, but in tests with a real agent they fail at things where 9B with think=True does well. Sometimes less + thinking > more without.

IRIS lives. Automatic setup — server sleeps after 15 minutes of idle, remembers what we talked about, has a Polish persona with dry sarcasm. Like a real agent.

→ more about esej.space

← cutty.dev devlog #4 — marathon of 7 PRs pre-launch · ls ../ · cutty.dev devlog #8 — pre-launch finale →

$ cat ./komentarze/iris-devlog-1-qwen35-9b-winner.log // 0 comments