We Tested Local Models So You Don't Have To

2026-03-08 · Hivemind Team We Tested Local Models So You Don't Have To

The Model That Looks Best on Paper Might Not Be the One That Works

We spent the last few days doing something every self-hosted AI platform eventually has to do: figuring out which local models actually work in production. Not on benchmarks, not in demos — in real agent workflows with real tool calling.

Here's what we learned.

We Started With the Obvious Choice

We began with the Unsloth Dynamic GGUF of Qwen3-Coder 30B. On paper, it's excellent — top-tier Aider Polyglot scores, near-lossless quantization, massive community buzz. In practice? Our agents were hallucinating tool calls. Instead of invoking functions through the actual tool-calling mechanism, the model was narrating what it would do: writing \[Executing: shell ls -la]\ in plain text instead of making real API calls.

The tool definitions were right. The system prompt was right. The model just… refused to use the tools it was given, preferring to role-play using them instead.

If you've ever debugged an agent that "works" but never actually does anything, you know how maddening this is. The logs look reasonable. The model sounds confident. But nothing happens.

The Fix Wasn't the Model Weights. It Was the Chat Template.

We switched to the Ollama-native \qwen3-coder:30b\ — same architecture, same parameter count, same quantization level — and tool calling worked perfectly on the first try.

The difference was Ollama's built-in chat template, which is tested against their tool-calling pipeline. The Unsloth GGUF uses a different template that Ollama didn't handle correctly. The model wasn't broken. The way it was being prompted was.

This is the kind of thing that doesn't show up in any benchmark. A model can score 90% on HumanEval and still be completely unusable if the chat template mangles the tool-calling tokens.

Takeaway: If you're running models through Ollama, prefer Ollama-native model packages over third-party GGUFs. The chat template compatibility is worth more than marginal quantization improvements.

The Settings Matter as Much as the Model

Even with the right model and the right template, we weren't done. Ollama's default temperature is ~0.8. For tool calling, that's asking the model to get creative with function names and parameter values. We dropped it to 0.1 and the hallucinations stopped.

We also learned that Ollama defaults to a tiny context window (often 2048 tokens) unless you explicitly set \num_ctx\. With 9 tools and a detailed system prompt, that was silently truncating the prompt and producing garbage. No error, no warning — just an agent that couldn't see half its instructions.

The working config:

\\\`

temperature: 0.1

top_p: 0.8

top_k: 20

repeat_penalty: 1.05

num_ctx: 32768

\\\`

Every one of these settings was the result of a debugging session. The defaults are tuned for chat, not for structured tool calling. If you're running agents locally, you need to tune them explicitly.

MoE Architecture Is a Game-Changer for Local Inference

Here's the reason Qwen3-Coder 30B is even viable on consumer hardware: it's a Mixture-of-Experts model. 30B total parameters, but only 3.3B active per token. On an M4 Max with 32GB unified memory, it's remarkably fast. It feels nothing like running a traditional 30B dense model.

This is the architectural shift that makes self-hosted AI agents practical. You get the quality of a large model with the inference speed of a small one. The trade-off is memory — you still need to fit all 30B parameters in RAM — but on Apple Silicon with unified memory, that's increasingly feasible.

For HiveMentality, this means our users can run capable coding agents on a MacBook Pro. No cloud API calls, no per-token costs, no data leaving the machine. That's the promise of self-hosted AI, and MoE models are what makes it real.

What We're Building Next

We're designing HiveMentality's backend adapter to support any OpenAI-compatible endpoint — not just Ollama. One universal adapter, many backends. Users configure a URL, we auto-discover available models via \/v1/models\, and the router picks the best backend for each task. No more hardcoded integrations.

The goal is simple: you should be able to point Hivemind at any local inference server — Ollama, llama.cpp, vLLM, TGI, LM Studio — and have it just work. Same agent configs, same tool definitions, same Mission Control dashboard. The backend becomes an implementation detail.

The Gap Nobody Talks About

The local AI ecosystem is moving fast. Models that didn't exist three months ago are now matching proprietary alternatives on coding benchmarks. But "matching on benchmarks" and "working reliably in production agent loops" are two very different things.

The gap between them is chat templates, sampling parameters, and context window configuration — boring infrastructure work that nobody talks about. It's not glamorous. It doesn't make for good Twitter threads. But it's the difference between a demo that looks impressive and an agent that actually ships code.

We'll keep sharing what we learn. If you're running local models for agent workflows, join the Discord — we're building a community of people figuring this out together.

The best model is the one that works. Not the one that benchmarks best.