the harness was the multiplier
Two different agent frameworks ran the same local model against the same three coding tasks. One framework averaged 161 seconds per task. The other averaged 37. The model is identical. The hardware is identical. The prompts are byte-for-byte the same. The difference is the build around the model — the configuration choices, the speculative decoding, the way the loop talks to the container. We have been working on these pieces one at a time over the past several months. Today's measurement puts them together. The substrate around a model is worth roughly four and a half times on wall-time and six and a half times on passes-per-second of compute. The model is not the product. The product is the model plus everything we built around it.
the takeaway in one paragraph
We ran two agent frameworks against three small coding tasks on a local model. The first framework is a well-known open-source agent for terminal tasks, the kind of thing many teams pick up as a default. The second is the agent we have been building, which routes through the configuration choices we keep notes about in this series. Same model behind both. Across the full run our agent passed eighty-nine percent of trials at an average of thirty-seven seconds. The popular framework passed sixty percent at one hundred sixty-one seconds. The compound of the choices on this list (no deliberation by default for agent calls, speculative decoding with a small draft model, a terse parse contract, host-side state) is what makes the four-and-a-half-times wall-time difference and the six-and-a-half-times difference in how many passes per second of compute the box returns.
the comparison
Three tasks. Each is a small but realistic coding job. Fizzbuzz is a three-line script. Word-count writes a file-processing script and verifies exact integer output. Git-init initializes a repository, creates two files with exact content, makes a single commit with an exact message, and stages both files. Git-init is the longest of the three because it has seven small actions to execute in order.
Clean-trial walls per task (the trials where the substrate did not interrupt the run):
| task | our agent (mean wall) | popular framework (mean wall) | wall ratio |
|---|---|---|---|
| fizzbuzz (3-line script) | 24 seconds (n=3) | 43 seconds (n=1) | 1.8× |
| word-count (file I/O) | 32 seconds (n=2) | 58 seconds (n=1) | 1.8× |
| git-init (7 small actions) | 23 seconds (n=2) | 442 seconds (n=1) | 19× |
Aggregate across all fourteen trials we ran today: our agent at 0.024 passes per wall-second, the popular framework at 0.0037. The compound difference is wider than any single task ratio because reliability compounds with speed. Failing fewer trials AND finishing faster on the ones that succeed multiplies the numbers together.
The width of the git-init gap deserves a note. The task has seven small subactions in sequence: make a directory, initialize the repository, rename the branch, write two files with exact bytes, stage them, commit with an exact message. Our agent runs each as a quick shell command and moves on. The popular framework deliberates before each one because deliberation is its default. Seven deliberations compound. The ratio on a task with one or two actions is a much smaller number than the ratio on a task with seven.
why the gap
Four substrate-level levers ride together in our build. Each one is small. The product of all four is what produces the four-to-nineteen times wall ratio above.
**One. Deliberation off by default for agent calls.** The model has a switch that controls whether it emits a long deliberation block before its actual answer. For chat-style use that switch should stay on — you want to see the reasoning. For an agent that is about to translate one observation into one shell command, the deliberation is overhead on every turn. We documented this single switch and its 5.5× wall reduction in note 21.
**Two. Speculative decoding with a small draft model.** The serving layer runs a 0.6-billion-parameter draft model alongside the main 35-billion-parameter model. The draft proposes the next several tokens, the main model accepts or corrects them in one shot. On agreement, the box outputs tokens roughly one-and-a-half times faster than stock decoding. The popular framework's request path does not pass speculative-decoding parameters and gets stock decoding.
**Three. A terse parse contract.** Our agent expects exactly one of two response shapes per turn: a thinking tag plus a command tag, or a thinking tag plus a done tag. No verbose narration, no markdown frames, no preambles. Fewer output tokens per turn means less wall time per turn.
**Four. Host-side state with container-side execution.** Our agent's loop runs on the host. It reaches into the container only to run a single command, get the result, and feed that back to the model. The popular framework runs a virtual terminal session inside the container and parses what the terminal emits, which adds latency and a second variance source. Both designs work. The host-side design happens to be faster on this hardware.
Each lever on its own is incremental. Stacked together they are the four-and-a-half-times wall ratio. The model is the same. What changed is how the substrate carries the model's output to the world.
honest read
Four of fourteen trials today were affected by substrate hiccups rather than by either harness's behavior. The serving layer accumulates state during long mixed-load runs and eventually slows down — a routine sub-second call can stretch to two minutes if the box has been pushed hard without a bounce. We documented this once before and it is parked for deeper investigation. Two of the popular framework's trials hit it. One of ours did too. We restart the serving layer between groups of trials to reset it; the restart took perhaps two minutes in the middle of today's run.
Two of three tasks on the comparison side have N=1 today. That is enough to show direction. It is not enough to pin a tight confidence interval. Word-count and git-init each have a single passing trial in the popular framework's column. A tighter version of this comparison would run five paired trials per task with a serving-layer restart between each, costing perhaps another half hour of wall time. We may publish that follow-on. The direction is unlikely to change; the precise numbers will.
Local serving on commodity hardware has a real variance floor. Even when you hold model, prompt, hardware, and harness completely constant, ninety-one percent of paired calls diverge token-by-token at temperature zero because of the interplay between speculative decoding, mixture-of-experts routing, and the Vulkan compute path. Every comparison in this series sits on top of that variance, not under it.
This is one model. Same harness comparison on a different model — a denser model with no built-in deliberation switch, for example — will shift the numbers. The drafter we use is matched to this specific model family. The deliberation-off switch is meaningful because this model has a deliberation mode. Carrying the result over to other model families requires re-measuring.
what a substrate fix looks like
The comparison would not have run at all this morning. Every single trial returned the same error: no reward file found. The agent wrote its solution into the container. The verification script inside the container tried to write its judgment to a shared directory. The shared directory stayed empty. The host process looked, found nothing, and marked the trial as a substrate error.
The cause was a small detail in how the container runtime mounts directories on a security-hardened operating system. The container processes — including the container's root — could not write to a directory the host owned, even though the directory's normal Unix permissions said they should be able to. The host's mandatory access control was treating the shared directory as a host-only context. The container saw a path it could read but not write to, and bash redirects fail silently when the target path is unwritable.
The fix was a twelve-line patch to the framework's mount-file generator. One word added to each generated mount entry: relabel-on-mount. The framework's job is to tell the runtime to translate the host's security label into a context the container can write to. Most container hosts don't enforce this kind of label, so most installations of the framework never hit the bug. On our hardware they do.
This is the kind of unglamorous infrastructure work that runs underneath a comparison like the one above. Most of the time you don't think about it. The day you do, you have to actually fix it before you can measure anything. The box we run has to know about itself at the host-OS level, the runtime level, the model-serving level, and the application level. Each of those levels has its own ways of going slightly wrong. The substrate is not a single thing; it is a stack of agreements between layers that has to actually hold.
what we learned
| lesson | why it matters |
|---|---|
| Model + harness is the product, model alone is not | A "local LLM" is not one thing. The same weights in two configs behave very differently on real work. The wrapping is part of the answer. |
| Levers compound; each one looks small alone | The four substrate choices we documented over the past several months each saved a small percentage. Together they multiply into a four-to-nineteen-times ratio. |
| Task shape determines the size of the speedup | On a single-action task the ratio is small. On a seven-action task the ratio is large. Deliberation overhead is paid once per turn. |
| Reliability is wall predictability, not just pass-rate | Twenty-four seconds with low variance is operationally different from "somewhere between twenty and three hundred seconds". The narrow distribution is part of the win. |
| Substrate has to know about itself | Today's comparison required a twelve-line fix to the security-label handling of the container mount generator. The substrate has to actively cooperate with the OS underneath it. That cooperation is its own piece of work. |