constructed, not trained
a coding agent has rules. some rules can be PROVEN correct instead of tested. we just did that for five of ours — and for every way two of those rules can run together.
the takeaway in one paragraph
Coding agents like ours have small rules that fire on every request. Things like: 'if the model ran out of room, retry with more space.' Today we proved one of those rules is correct without testing it the slow way. We compiled it into a tiny neural network and showed that the network always emits the same decision as the rule. Ten thousand random inputs. Zero disagreements. Twenty-six seconds total. The slow way takes us hours and gives a fuzzier answer. The fast way takes half a minute and gives a certificate.
If you build agents, this matters because about half of the small rules you ship are like this — pure logic, no learning involved. You have been benchmarking all of them. You can stop benchmarking the half that are pure logic. Prove them once, ship the proof, move on.
the everyday problem
When a coding agent is wrong, finding out is expensive. You run it against eight or twenty or fifty problems, you watch the pass rate, you squint at the noise, and you decide whether the change you just made helped or hurt. We have been doing this. It is slow and the answer is rarely as crisp as you want it to be.
But not every change is the kind of thing you actually need a bench for. Some changes are pure rules. 'If finish_reason is length, retry with thinking on and more tokens.' That is not a question of taste or a question of model behavior. It is a function: input goes in, decision comes out, always the same way. Running 8 noisy coding problems to decide whether this function does what it says is the wrong tool. The right tool is to PROVE the function does what it says.
We had been using the slow tool for both kinds of change. That is what made the cycle structurally hard. Some of the bench was measuring noise around something we could just check.
what we read
Jonathan Bates put out a small PyTorch library that does something striking. You give it a description of a Turing machine — the rules of a tiny abstract computer — and it gives you back a neural network whose weights are set ANALYTICALLY to simulate that machine. No training. No gradient descent. No data. Just: here is the rule, here are the weights that obey the rule.
Why does this exist? Because two old theorems (one from 1995, one from 2021) prove that simple neural networks can simulate any computer. Bates is the first person who wrote down the construction as runnable Python. You can compile a Turing machine into a network in a few hundred milliseconds.
We saw this and asked: which of our agent's rules are tiny machines like this? Answer: most of them. Five out of the nine rules we ship are pure deterministic logic. The other four involve sampling, drafting, or training, so they actually need an empirical bench. But the five pure ones don't.
what we did
We picked the smallest of our pure rules — the one that says 'if the model ran out of room, retry with thinking enabled and four times as many tokens.' Two states. Trivial logic. We wrote it down as a Turing machine in five lines, compiled it through Bates' library into a small neural network, and ran a side-by-side test.
The test: 10,000 random inputs to both. The original Python rule on one side, the compiled neural network on the other. For each input, do they emit the same decision?
| result | |
|---|---|
| random inputs tested | 10,000 |
| agreements | 10,000 |
| disagreements | 0 |
| full input space coverage | yes (2 / 2 unique inputs) |
| total wall time | 25.76 seconds |
| time per input | 2.58 ms |
The network agrees with the rule on every single input. The network was never trained. The agreement comes from how the weights were set, not from any data we showed the network. That is the certificate.
why this changes how we ship
Before: every rule we ship goes through a coding-problem bench. We run the agent on a fixed set of exercises, with and without the rule, and compare pass rates. The bench is noisy. Effects smaller than about one extra solve are hard to see clearly. For rules that are pure logic, the bench was measuring the wrong thing the whole time.
Now: rules that are pure logic get a proof.yaml file. The rule is written down as a tiny machine. We compile it once. We test it against the running version on a sample. If they always agree, the rule is correct by construction. The bench is reserved for things that actually need empirical testing — model choices, sampling tricks, anything that involves randomness or learned weights.
About half of the rules we ship are pure logic. That means about half of the bench burden goes away. The other half still needs the bench — and the bench gets to focus on the rules where it actually has something to measure.
| rule | kind | how we check it |
|---|---|---|
| retry-on-length | pure logic | proof (this note) |
| model-alias rewrite | pure logic | proof (next) |
| prepend-failing-test on retry | pure logic | proof (next) |
| halt when confidence is high | pure logic | proof (next) |
| error-class hints | pure logic | proof (next) |
| parallel verification | partly logical | partial proof + bench |
| ensemble decode | partly logical | partial proof + bench |
| drafter (speculative decode) | involves sampling | bench |
| idle-time distillation | involves training | bench |
Five rules move to proof-cadenced. Two stay hybrid. Two remain empirical. The line is where it should be — anywhere learned weights or a sampler are involved, we keep the bench. Anywhere the rule is 'if X then Y else Z,' we prove it.
the other four — verified the same session
We did not stop at one. The pre-registration said: if the first proof checks out, compile the other four pure-logic rules in the library. Same protocol, same kind of test, ten thousand random inputs each.
| rule | input shape | agreements | wall |
|---|---|---|---|
| retry-on-length | one-bit gate | 10,000 / 10,000 | 26 s |
| prepend failing test on retry | three-bit AND | 10,000 / 10,000 | 47 s |
| halt when confidence is high | four-bit AND | 10,000 / 10,000 | 84 s |
| hint by error class | six-way category | 10,000 / 10,000 | 26 s |
| model-alias rewrite | one-bit gate | 10,000 / 10,000 | 26 s |
Fifty thousand random inputs across five rules. Zero disagreements. Three minutes of compute total. Every rule is also exhaustively checked over its full small input space — 32 unique input combinations across the five rules, all match.
About half of the rule library is now proof-cadenced. The other half — the rules that involve sampling or learned weights — keep their empirical bench. The line is exactly where it should be: anywhere randomness or learning lives, we keep measuring; anywhere the rule is pure dispatch logic, we prove it.
what to take away
- About half of the rules a typical coding agent ships are pure dispatch logic. They can be proved instead of tested. Bates' library makes the proof cheap.
- The split is honest. The parser that decides 'did the model run out of room?' is text-handling and stays empirical. The logic that says 'if it ran out of room, retry' is pure dispatch and gets proved.
- Time-to-ship for these rules drops from hours of benchmarking to seconds of checking. That compounds the more rules you have.
- The proof does not replace the production code. We still run the original Python. The compiled network is a certificate — the Python is correct in the way we claimed.
the same trick, two rules at once
An obvious question: if proving one rule works, does it work for TWO rules running together? Note 10 measured that two helpful rules combined can interact in surprising ways. Most of those interactions involve at least one rule with a sampler or learned weight — and for those the bench is still the right tool. But for two PURE-LOGIC rules running together, the combined behavior is also constructible. We tested this — first on one pair, then on every pair the library can form.
The setup. Take two rules that both fire on the same kind of request — say, prepend failing-test on retry and add an error-class hint when there's a traceback. Both pre-call, both pure logic, both proved individually in the previous step. Build a single Turing machine whose dispatch is the PRODUCT of those two: given (does the residual rule fire?, what's the error class?), it emits the joint decision both Python rules would emit running in sequence. Differential-test it.
Then we wrote a small composer that does this for any pair of the five rules, automatically. The composer reads each rule's δ and emits the joint machine. Five rules means ten possible pairs. We ran the test on all ten.
| pair (rules combined) | underlying machine | exhaustive joint inputs matched |
|---|---|---|
| retry-on-length × prepend failing test | 20 states · 27 trans | 16 / 16 |
| retry-on-length × halt-on-converge | 24 states · 35 trans | 32 / 32 |
| retry-on-length × error-class hint | 28 states · 27 trans | 12 / 12 |
| retry-on-length × model-alias rewrite | 12 states · 11 trans | 4 / 4 |
| prepend failing test × halt-on-converge | 28 states · 43 trans | 128 / 128 |
| prepend failing test × error-class hint | 32 states · 35 trans | 48 / 48 |
| prepend failing test × model-alias rewrite | 16 states · 19 trans | 16 / 16 |
| halt-on-converge × error-class hint | 34 states · 39 trans | 96 / 96 |
| halt-on-converge × model-alias rewrite | 18 states · 23 trans | 32 / 32 |
| error-class hint × model-alias rewrite | 32 states · 31 trans | 12 / 12 |
Three hundred and ninety-six joint inputs across ten pairs. Every single one agrees with running the two Python rules in sequence. Six seconds of compute total. Every way these five pure-logic rules can run together is now correct by construction.
The construction frame extends from singles to pairs cleanly. The empirical interaction story note 10 told — where two helpful rules can fight when they meet — gets a constructive answer for the pure-logic-only quadrant: pairs that involve a sampler or a learned weight stay on the bench; pairs where both rules are pure dispatch are settled here, once, in seconds.
what comes next
Bigger picture: this is the third lens we use on every rule, now also extended to pairs of rules. The first lens is what the rule is for. The second lens is how it runs on this particular box. The third lens — opened by Bates — is whether it can be proven correct by construction. Most coding-agent frameworks have the first lens. Some have the second. None we know of have the third. Adding it cost a few hours today. Not adding it costs forever — the slow bench loop keeps doing work it does not need to do.
Composition was the question note 10 raised. For pure-logic rules combined with pure-logic rules, we now have an answer that does not require benchmarking. For pure-logic rules combined with rules that involve sampling or learning, the bench is still the right tool — and it gets a cleaner signal because part of the load has moved off it.
what we're not claiming
- We tested rules with up to 17 underlying states. Real-world rules can be larger. We expect compile time to grow polynomially with rule size; we have not stress-tested at scale.
- We did not eliminate benchmarking. We split the work between proofs (for pure logic) and benchmarks (for everything else). Both are still necessary.
- The compiled network is not faster than the original Python at runtime. Production keeps running the Python. The compiled network is the certificate, not a replacement.
- The parsers that produce the inputs to these rules (regex matches, AST traversal, suffix checks on strings) are NOT in the proof. They stay empirically tested. The proof only covers what happens once the parser has emitted its verdict.
- This does not replace Sutton-Precup-Singh's framing of an option, or Bates' construction. We are wiring the two together — the option as the WHAT, the construction as the WHY-IT'S-CORRECT.
where this came from
Bates' library surfaced through our research channel. The repo had been sitting for some time; we only noticed it once we started asking how to ship rule-correctness without paying the bench tax. Pulling the library, mapping our pure-logic rules onto its formalism, compiling one, and running the differential test took about five minutes once we started.
The shape of the idea — three lenses on every rule, with the third lens being a proof — is independent of whether any one compilation succeeds. The first compilation succeeded on the simplest rule. We then ran the same protocol on the other four pure-logic rules in the library. All four also checked out. Half of the rule library is now proof-cadenced; the other half stays empirical because it involves sampling or learned weights, exactly where the line was always going to be.