iris devlog #1 — qwen3.5:9b winner, 25 models tested
Two days of testing. 25+ LLM models from 0.8B to 122B. Goal: find a model for my IRIS agent — Polish persona, native tool-calling, access to Obsidian vault. Result: qwen3.5:9b with think=True. 28 seconds per question, 7.5/10 quality, runs on two cards providing a combined 20 GB VRAM.
Hardware
- Dual-socket Xeon E5-2690v3 (24 cores / 48 threads)
- 96 GB DDR4 ECC
- RTX 5070 12 GB (Blackwell) + Tesla P4 8 GB (Pascal)
- Total 20 GB VRAM for Ollama
- Windows 11 Pro PL, S3 sleep after 15 minutes idle
The idea is simple: cheap retired hardware from the secondary market (Xeon E5 + Tesla P4 for pennies) plus one modern gaming card. In theory, it should suffice for mid-range models. In practice — a fascinating clash of hardware with models.
Lesson #1: Dense 70B+ on 20 GB VRAM = RAM-bound garbage
- llama3.3:70b — 0.9 tok/s, 5 out of 16 tests timed out
- deepseek-r1:70b — 0.8 tok/s, 3 timeouts + HTTP 400 on native tool
- mixtral:8x22b — 1.4 tok/s, FizzBuzz bug (wrong order of conditions)
- qwen3-coder-next:80b-A3B — 5.6 tok/s, 0 timeouts (reference)
What happened: dense 70B in Q4 weighs ~40 GB. I have 20 GB VRAM. The rest? Goes to RAM and CPU. Memory bandwidth DDR4 (~50 GB/s) vs VRAM (~600 GB/s on 5070) = ~12× slower. Token generation with a large model becomes a tortoise on a chariot.
Conclusion: active params matter more than total params. MoE with 3-10B active will beat dense 30-70B with limited VRAM.
Lesson #2: MoE A3B on the edge of VRAM = sweet spot
- qwen3.6:35b dense (35B / 35B active) — 10.2 tok/s
- qwen3.5:35b-a3b MoE (35B / 3.6B active) — 9.6 tok/s
- qwen3.5:122b-a10b (122B / 10B active) — 2.6 tok/s (75 GB Q4 = large spillover)
Surprise: dense 35B won against MoE 35B-A3B. Reason: both models in Q4 weigh 22 GB. They fit almost entirely into 20 GB VRAM. The bottleneck is memory bandwidth within VRAM, not active parameters.
Lesson #3: Tool calling is NOT free
Raw tok/s benchmark is only one-third of the story. Live test with a real agent loop (8 questions from vault MCP + math + code):
| Model | Time/8q | Vault MCP | Math | Code |
|---|---|---|---|---|
| qwen3.6:35b-a3b | 19m54s | ✓ full exploration | ✓ with verification | ✓ |
| qwen3.6:27b dense | 17m57s | chaotic | ✓ | FizzBuzz bug |
| qwen3:14b | 3m31s | Access denied fail | ✓ | generator-in-print |
| qwen3.5:9b | 3m4s | finds | ✓ with self-correction | ✓ multi-line |
| qwen3.5:4b | 2m | HALLUCINATING | ✓ | multi-line instead of 1-line |
| gemma4:e4b | crash | Access denied | ✓ | didn't make it |
| ministral-3:14b | 2m40s | ✓ Clarify pattern | arithmetic error | broken slice |
Key takeaway: qwen3:14b was 5.7× faster than qwen3.6:35b-a3b, but tool calling completely failed. "Access denied" for vault, "script to download courses works, check API" instead of real code. Speed without quality = useless for an agent.
Lesson #4: Native function calling > strong but "dumb" models
Native function calling = the model learned the JSON tool call format during training. Without it, the model guesses what to do — sometimes well, more often hallucinating paths or ignoring tools and responding with imagination.
- Qwen3.5/3.6 — entire family, BFCL ~62-72, stable Polish
- Mistral / Ministral — native FC OK, but Chinese characters on truncation
- Nemotron — NVIDIA tuning on BFCL, top reasoning
- Granite 4.1 — BFCL 68.27, but no official Polish (disqualifier)
- Gemma 4 — weaker FC stability, vault MCP fail, word hallucinations ("Sardynegry")
- Hermes-4-14B GGUF — broken chat template, generates empty responses
Lesson #5: Multi-specialist at 20 GB VRAM = myth
The idea seemed brilliant: instead of one "do-it-all model" — several specialists. Math to Nemotron-3-Nano-4B (AIME 89.1), code to Qwen2.5-Coder-7B (HumanEval 88.4), router qwen3.5:9b.
Problem: each specialist ~3-5 GB Q4 + KV cache ~5 GB. Plus router. Plus aux. We quickly exceed 20 GB VRAM. Ollama starts evict & reload — 30-second timeouts, 500 errors, swap loop.
GPU available: 0.7 GB ← almost zero "model requires more gpu memory, evicting" after evict: 10.1 GB free new model loaded 500 ERROR (reload didn't finish in 30s)
Conclusion: multi-specialist makes sense for datacenter VRAM (80 GB+). For 20 GB, one universal model with good FC beats the specialist idea.
Lesson #6: Polish and folklore hallucinations
Every LLM, regardless of size, has a common problem with Polish: proverbs are made up.
Test question: "Tell me a Polish proverb about chickens that don't hatch."
- qwen3.6:35b-a3b — "A hen laying eggs doesn't brood"
- qwen3.5:9b — "A hen that doesn't lay eggs doesn't feed on grain — it just gets angry"
- qwen3:14b — "Hens don't hatch — mushrooms don't hatch"
- gemma4:e4b — "Hens don't hatch — mushrooms don't hatch"
- ministral-3:14b — "From eggs of chicks without heating, none will hatch early in the morning"
The real (known to me) proverbs are "Don't praise the day before sunset" or "Don't count chicks before hatching". No model knew this. Fundamental gap in training data for Polish idioms.
The only model capable of defending itself by admitting ignorance: qwen3.5:9b with think=True and a properly written persona — "I don't make up what I don't know". Plus it honestly noted "thought about OR" when making things up.
Final stack
model: qwen3.5:9b # router + aux + heavy
context_length: 262144 # 256K native Qwen3.5
kv_cache_type: q4_0 # compact
think: True # CoT compensates for 9B capacity
reasoning_effort: medium # balanced
delegation: disabled # one model = no swap
auxiliary_models: same # qwen3.5:9b for compression/title/curator
VRAM balance: qwen3.5:9b Q4 ~6 GB + KV cache 256K q4_0 ~10 GB = 16 GB in 20 GB VRAM (4 GB buffer).
Performance: 8 questions in 3m 4s. 25 questions in 11m 32s. Vs qwen3.6:35b-a3b (highest quality) — 5× faster.
Quality breakdown (25 tests):
- Math 4/4 (with self-correction via think=True)
- Technical knowledge (CAP, HTTP/2, race condition, GC)
- Code: JS debounce, SQL OFFSET, Bash one-liner (with a minor du bug)
- Vault MCP — finds files, reads daily notes
- Memory recall — USER profile + journal
- Polish proverbs always made up
- Translation sometimes imperfect ("Distributed systems" → "networking")
- Haiku doesn't keep 5-7-5
TL;DR for the impatient
| Hardware | Model | Trade-off |
|---|---|---|
| 8 GB VRAM | qwen3.5:4b or gemma4:e4b | Speed, poor quality |
| 12 GB VRAM | qwen3.5:9b with think=True | Sweet spot |
| 20 GB VRAM | qwen3.6:35b-a3b MoE | Maximum quality, slow |
| 24+ GB VRAM | Heavy MoE with native FC | Datacenter territory |
| 80+ GB VRAM | Multi-specialist sense | Not a home use case |
For a home AI agent with a Polish persona + vault MCP + native tool calling, in 2026, on ~20 GB VRAM: qwen3.5:9b with think=True is the choice.
And one more thing: don't try 14B base models for agent loops. They are fast, but in tests with a real agent they fail at things where 9B with think=True does well. Sometimes less + thinking > more without.
IRIS lives. Automatic setup — server sleeps after 15 minutes of idle, remembers what we talked about, has a Polish persona with dry sarcasm. Like a real agent.