Intuition Labs · φ research · 21 · measured · 19 may 2026

one switch changed everything

We built a coding agent around our local model. It started failing tasks that should have been trivial. The cause was a single default setting: the model deliberates at length before every action. For an agent loop where each step is a quick shell command, that deliberation pays its cost on every single turn. Flipping that one switch made the agent 5.5x faster on the previously-failing tasks and turned three timeouts into three clean passes.

5.5× faster · 3-of-3 reliability flip
density = pass-rate ÷ wall-seconds  ·  same model, same harness, same prompts — one substrate switch

the takeaway in one paragraph

A modern open-source model has two response modes: think-out-loud-first (the model spends time deliberating before answering) and answer-directly. For an interactive Q&A, deliberation usually helps. For an agentic tool-use loop — where the agent issues one shell command per step, gets the result, and decides the next step — deliberation is overhead on every single turn. Our agent uses 8-15 turns on a coding task; if each turn pays 30-60 seconds of upfront deliberation, the budget evaporates. Our local model has deliberation on by default. Telling it 'just answer directly' for agent calls (not for chat — chat still benefits from thinking) made the agent 5.5× faster on tasks that previously timed out, and flipped 3 cases from fail to pass.

the paired flip · the cleanest evidence

The simplest experiment to interpret is one variable changed, everything else held constant. We ran our coding agent against three tasks (a fizzbuzz script, a file-processing script, a git-repo setup) at default settings. The agent failed all three by timing out at 120 seconds per model call. The model often had the right answer mid-loop — on word-count, it ran 'wc /app/input.txt', got the exact correct counts back, then timed out trying to write a small Python script around those numbers.

Then we flipped one setting: tell the model to answer directly on agent calls, not deliberate. We re-ran the exact same three tasks. All three passed in 23-25 seconds each:

taskbefore (thinking on)after (thinking off)wall ratio
fizzbuzz (3-line script)0.0 · 133s timeout1.0 · 24s pass5.5×
word-count (file I/O)0.0 · 141s timeout1.0 · 25s pass5.6×
git-init (repo setup)0.0 · 133s timeout1.0 · 23s pass5.8×

Same agent code. Same model. Same prompts. Same hardware. Same hour. The only thing different is one boolean in the request body. The result: 3-of-3 reliability flip, average 5.5× wall reduction, three reward-zero outcomes turned into three reward-one outcomes.

why thinking-mode was the bottleneck

The local model has a chat-template mode called 'enable_thinking'. When ON (the default), the model emits a long deliberation block before its actual answer. For chat use, this is great — you can see the reasoning. For an agent that's about to translate one observation into one command, the deliberation is pure overhead. Worse, the deliberation can grow with prompt complexity — a longer transcript means longer thinking, which can exceed the per-call timeout cap.

The fail traces tell the story. On word-count, step 1 was 'wc /app/input.txt' — fast, answer back: '5 21 125'. Step 2 was supposed to translate that into a tiny Python script that prints those numbers. The model started its thinking block and didn't finish it before the 120-second per-call cap fired. The script never got written. From the agent's perspective, the model just stopped responding mid-task.

Two failure modes have the same root cause but they look very different. The popular general-purpose agent framework hits this from the opposite direction: its turn budgets are much bigger, so the model has room to deliberate, but its outer task-timeout (6 minutes by default) eventually catches up. We saw it time out on the same word-count task too. Same model, same problem, different shape of overrun.

the switch (and what it cost)

Our agent now passes 'enable_thinking: false' on every model call inside the agent loop. The model answers directly, without the long deliberation prefix. We did NOT change this globally — chat-style queries (the ones that benefit from deliberation) still default to thinking-on. The agent's calls are tagged as agent calls, and the agent's policy is 'answer directly, you can think on retry if you fail.'

The implementation is one extra field on each request and one merge in the proxy. About fifteen lines of code. No model changes, no infrastructure changes.

Cost: chat responses now require a separate parameter to enable thinking when a user wants reasoning shown. The agent's calls and the user's chat calls live behind the same proxy but route different defaults. That separation is itself a useful pattern — agent intent is different from human intent.

the variance discovery

We tried to also do a head-to-head comparison: same model, our agent vs a popular general-purpose framework. We expected to publish density numbers (pass-rate per second). What we found instead surprised us: the popular framework is bimodally distributed on the harder tasks. On word-count, three trials gave a fast pass at 57 seconds, a full-budget timeout at 380 seconds, and another timeout at 383 seconds. Same task, same harness, same model, same prompt.

On fizzbuzz, four trials gave 47/48/58 second passes and one 322-second timeout. The substrate variance is real — running once gives you a single sample of a wide distribution, and any single-trial comparison can mislead by 5-6× on wall time or flip a pass to a fail entirely.

Our agent with thinking-off showed much tighter variance: when it passes, it passes in 23-25 seconds, consistently. That tighter band IS part of the win. Reliability is pass-rate AND predictability of wall-time. A harness that solves a task in 'somewhere between 47 and 322 seconds' is operationally different from one that solves it in 'about 24 seconds.'

what we learned

lessonwhy it matters
Substrate tuning beats harness choice (in this regime)Most of the speedup came from a model setting, not the loop design. Tighter loops compound it, but they're not the primary lever.
Deliberation is a tool for after-fail, not for defaultWhen the model has a quick answer (run a shell command), force it to answer. Reserve deliberation for the second attempt if the first fails.
Single-trial comparisons can be 6× wrongOur first measurement said our agent was slower; with N=3 we see the popular framework is bimodal and our tuned agent is tight. Always measure variance.
Same model, two configurations, very different productsA "local LLM" isn't one thing. The same weights in two configs behave very differently. The build is the difference.

links