Intuition Labs · φ research · 32 · measured · 21 may 2026

the throttle and the edge

Last week we reported fifty percent on a published coding corpus. The same model on the same exercises moved to sixty eight percent after one configuration change, and to seventy seven percent after a second. The number we called a capability rate was actually a measurement of how much room the box was giving the model to think. Inside the residual failures, one separate failure mode appears that requires a different kind of fix — best-of-three sampling at non-zero temperature on the retry turn unlocks the case where the model wrote almost all of the answer and was locked into the wrong edge by greedy decoding.

50% → 77% on the same model · the box was the binding constraint
pass-rate(per-turn-budget = X seconds) is a function of X · most of what we called capacity was actually room  ·  paired-seed comparison on 34 python exercises across three harness configurations

the takeaway in one paragraph

The fifty percent we measured last week was a measurement of how much room the box was giving the model. Not a measurement of what the model could do. The same model on the same exercises moved to sixty eight percent after one configuration change, and to seventy seven percent after a second. Inside the residual failures, one new failure mode appears that requires a different kind of fix, and a small number of exercises stay failed regardless — they are where the model itself is the binding constraint.

hypothesis

A published coding benchmark returned fifty percent on thirty four python exercises against a local thirty five billion parameter mixture of experts model. We did not know whether that number was a ceiling on the model or on the harness around it. The two failure modes look identical from outside — pass-rate is pass-rate — but the right next move differs. If the model is at its ceiling, the move is a different model. If the harness is at its ceiling, the move is to fix the harness. We tested the harness side.

method

Thirty four exercises drawn from a published coding benchmark. One local model served by one inference stack. Three configurations run paired against each other so the same exercise produces a pair of outcomes.

The baseline uses a per-turn time budget of one hundred and eighty seconds, three turns per exercise, and one inference engine that holds state across all thirty four runs.

The second configuration raises the per-turn budget to three hundred seconds, restarts the inference engine between each exercise, and reshapes the test feedback so the failing assertion appears first and unrelated warnings are stripped.

The third configuration keeps the second's changes and adds two cheap client-side options: prompt caching on retries, and a syntax check before each test run.

A fourth small slice fired three candidate completions in parallel at non-zero temperature on the retry turns where the second configuration had failed, and picked the candidate that passed the most tests.

Model, prompts, hardware, and evaluation harness are byte-for-byte the same across configurations. Pairing makes a McNemar exact two-sided test valid on the discordant set; Wilson ninety five percent intervals are appropriate for the pass-rate point estimates.

results

configurationpass rateWilson 95%paired delta vs baselineMcNemar exact p
baseline (180s per-turn budget)17 / 34 = 50.0%34.1 – 65.9%
extended budget + engine restart + reshaped feedback23 / 34 = 67.6%50.8 – 80.9%+6 passes, +17.6 pp0.0312
extended stack + prompt cache + syntax check21 / 34 = 61.8%45.0 – 76.1%+4 passes, +11.8 pp0.3438 vs baseline
extended stack + best-of-3 at T=0.7 on retry turns26 / 34 = 76.5%60.0 – 87.6%+9 passes, +26.5 pp0.0039

The second configuration is the biggest result. Six exercises recovered that the baseline had failed (bottle-song, dominoes, hangman, rest-api, robot-name, scale-generator). Zero regressed. The McNemar exact two-sided p is three percent, significant at the conventional five percent threshold.

The third configuration was a wash relative to the second. Against v2 it gained two exercises and lost four at a paired p of about sixty nine percent. Prompt caching and a syntax check shaved wall time but did not move the pass rate at this sample size. Substrate run-to-run noise dominated their nominal effect.

The fourth configuration is the strongest result. Nine exercises recovered against baseline with zero regressions and a McNemar exact two-sided p of four-tenths of a percent. Three of those recoveries — bowling, tree-building and zebra-puzzle — were exercises that the second configuration could not reach. On each, the model knew nearly all of the answer on its first attempt and was locked by greedy decoding into the wrong edge case on retries. One of three parallel attempts at non-zero temperature found the missing case.

mechanism

Three things were going on.

The first is what we called the substrate throttle. In the baseline, every failed exercise had at least one turn hit the hundred-and-eighty-second wall. Nine of seventeen failures had all three turns hit it. The model never finished writing what it was working on. The fraction of all turns that hit the time limit was sixty five percent. After the second configuration, that fraction dropped to thirty four percent. Six of the recoveries are the model finishing thoughts it had previously been cut off inside.

The second is engine state. A local inference stack accumulates a small amount of state across requests. After an hour or two of sustained load, calls that used to return in two seconds start returning in twenty. The fix is to restart the engine between runs. This unlocked the bottle-song, dominoes and hangman cluster — those exercises did not need more time, they needed a clean engine. The model passed each of them in one turn under fifty seconds once the engine was reset.

The third is sampling. The bowling recovery points at something separable. The model passed twenty seven of twenty eight bowling tests on its first try and missed one edge case. On retry at temperature zero, it produced the same wrong answer each time because greedy decoding collapses to the same path. The fourth configuration sampled three attempts in parallel at non-zero temperature, and one of the three landed the missing edge case. This lever does not give the model more time and it does not restart the engine. It widens the distribution of attempts on turns where the first attempt landed close.

implication

Three independent levers exist for this kind of harness, and they fix three different things.

Room to think plus a clean engine between runs is the biggest lever. It moved the pass rate by about eighteen percentage points and it is the one most likely to be invisible to a casual user, because the failure mode it fixes — turn hit the time limit — looks identical to any other failure when reported.

Prompt caching and a syntax check shave wall time per turn but did not move the pass rate at this sample size. The lesson is operational. You cannot tell whether a cheap option helped or hurt without paired comparison.

Multiple attempts at non-zero temperature on retries is a narrow attack surface. Most residual failures are not the bowling shape. For the ones where the model has nearly all of the answer and is locked into the wrong edge case by greedy decoding, this lever may be the only one that unlocks them.

The capability ceiling on this benchmark is bounded above somewhere around seventy five to eighty percent. Two failures out of thirty four — paasio and react — finished all three turns of the second configuration under the time budget and still landed short. The model had room there and the answer was still wrong. Under the fourth configuration both hit the new three-hundred-second wall on at least one turn, so v4 alone does not separate substrate-throttle from capability-cap on those two. The v2 evidence is what supports the ceiling claim. That is a different thing from the fifty percent baseline, which is closer to a measurement of harness budget than of model capability. The published leaderboard reports a smaller dense model at sixteen percent and a larger sparse model at sixty percent on the full multi-language corpus, running harnesses we did not control. We measured the python subset only, so strict comparison is not valid, but the order of magnitude lines up.

limitations

The sample is thirty four exercises, python only. The published benchmark spans six languages. The levers have not yet been measured across the other five.

We tried to predict which exercises the multiple-attempts lever would unlock using the proximity of the first turn to passing — how many tests passed before the first failure. The prediction got seven of nine cases right but was wrong in two directions. It missed forth, which had the highest proximity score of any still-failing exercise and still did not unlock. And it missed zebra-puzzle, which had a proximity score of zero and did unlock. Proximity is a real but noisy signal. The mechanism that unlocks bowling does not generalize cleanly to all close-but-failed cases, and it can fire on cases that did not look close in advance.

The local inference stack has a real variance floor. At temperature zero on this hardware, ninety one percent of paired identical-prompt calls diverge token-by-token because of the interaction of speculative decoding, mixture of experts routing, and the local compute path. Every comparison in this note sits on top of that variance. Paired-seed McNemar is the right tool for the regime; single-shot before-after numbers are noise.

This is one model on one hardware. A denser model without an internal thinking mode, or a different inference stack, would shift the exact percentages. The mechanism — the harness was throttling thinking time and the model recovered when given room — carries across deployments.

what we are taking from this

The harness supercharge around a local model is removed throttling, clean state between runs, and a narrow band of more attempts at hard edges. We were measuring how long we let the model write when we thought we were measuring what the model could write. The next number we publish has to be paired-seed gated.

citations

links