Intuition Labs · φ research · 11 · measured · 16 may 2026

constructed, not trained

a coding agent has rules. some rules can be PROVEN correct instead of tested. we just did that for five of ours — and for every way two of those rules can run together.

5 rules · 10 / 10 pairs · every joint input checked
rule → tiny machine → network with no training  ·  when the rule is deterministic, you can skip the bench entirely

the takeaway in one paragraph

Coding agents like ours have small rules that fire on every request. Things like: 'if the model ran out of room, retry with more space.' Today we proved one of those rules is correct without testing it the slow way. We compiled it into a tiny neural network and showed that the network always emits the same decision as the rule. Ten thousand random inputs. Zero disagreements. Twenty-six seconds total. The slow way takes us hours and gives a fuzzier answer. The fast way takes half a minute and gives a certificate.

If you build agents, this matters because about half of the small rules you ship are like this — pure logic, no learning involved. You have been benchmarking all of them. You can stop benchmarking the half that are pure logic. Prove them once, ship the proof, move on.

the everyday problem

When a coding agent is wrong, finding out is expensive. You run it against eight or twenty or fifty problems, you watch the pass rate, you squint at the noise, and you decide whether the change you just made helped or hurt. We have been doing this. It is slow and the answer is rarely as crisp as you want it to be.

But not every change is the kind of thing you actually need a bench for. Some changes are pure rules. 'If finish_reason is length, retry with thinking on and more tokens.' That is not a question of taste or a question of model behavior. It is a function: input goes in, decision comes out, always the same way. Running 8 noisy coding problems to decide whether this function does what it says is the wrong tool. The right tool is to PROVE the function does what it says.

We had been using the slow tool for both kinds of change. That is what made the cycle structurally hard. Some of the bench was measuring noise around something we could just check.

what we read

Jonathan Bates put out a small PyTorch library that does something striking. You give it a description of a Turing machine — the rules of a tiny abstract computer — and it gives you back a neural network whose weights are set ANALYTICALLY to simulate that machine. No training. No gradient descent. No data. Just: here is the rule, here are the weights that obey the rule.

Why does this exist? Because two old theorems (one from 1995, one from 2021) prove that simple neural networks can simulate any computer. Bates is the first person who wrote down the construction as runnable Python. You can compile a Turing machine into a network in a few hundred milliseconds.

We saw this and asked: which of our agent's rules are tiny machines like this? Answer: most of them. Five out of the nine rules we ship are pure deterministic logic. The other four involve sampling, drafting, or training, so they actually need an empirical bench. But the five pure ones don't.

what we did

We picked the smallest of our pure rules — the one that says 'if the model ran out of room, retry with thinking enabled and four times as many tokens.' Two states. Trivial logic. We wrote it down as a Turing machine in five lines, compiled it through Bates' library into a small neural network, and ran a side-by-side test.

The test: 10,000 random inputs to both. The original Python rule on one side, the compiled neural network on the other. For each input, do they emit the same decision?

result
random inputs tested10,000
agreements10,000
disagreements0
full input space coverageyes (2 / 2 unique inputs)
total wall time25.76 seconds
time per input2.58 ms

The network agrees with the rule on every single input. The network was never trained. The agreement comes from how the weights were set, not from any data we showed the network. That is the certificate.

why this changes how we ship

Before: every rule we ship goes through a coding-problem bench. We run the agent on a fixed set of exercises, with and without the rule, and compare pass rates. The bench is noisy. Effects smaller than about one extra solve are hard to see clearly. For rules that are pure logic, the bench was measuring the wrong thing the whole time.

Now: rules that are pure logic get a proof.yaml file. The rule is written down as a tiny machine. We compile it once. We test it against the running version on a sample. If they always agree, the rule is correct by construction. The bench is reserved for things that actually need empirical testing — model choices, sampling tricks, anything that involves randomness or learned weights.

About half of the rules we ship are pure logic. That means about half of the bench burden goes away. The other half still needs the bench — and the bench gets to focus on the rules where it actually has something to measure.

rulekindhow we check it
retry-on-lengthpure logicproof (this note)
model-alias rewritepure logicproof (next)
prepend-failing-test on retrypure logicproof (next)
halt when confidence is highpure logicproof (next)
error-class hintspure logicproof (next)
parallel verificationpartly logicalpartial proof + bench
ensemble decodepartly logicalpartial proof + bench
drafter (speculative decode)involves samplingbench
idle-time distillationinvolves trainingbench

Five rules move to proof-cadenced. Two stay hybrid. Two remain empirical. The line is where it should be — anywhere learned weights or a sampler are involved, we keep the bench. Anywhere the rule is 'if X then Y else Z,' we prove it.

the other four — verified the same session

We did not stop at one. The pre-registration said: if the first proof checks out, compile the other four pure-logic rules in the library. Same protocol, same kind of test, ten thousand random inputs each.

ruleinput shapeagreementswall
retry-on-lengthone-bit gate10,000 / 10,00026 s
prepend failing test on retrythree-bit AND10,000 / 10,00047 s
halt when confidence is highfour-bit AND10,000 / 10,00084 s
hint by error classsix-way category10,000 / 10,00026 s
model-alias rewriteone-bit gate10,000 / 10,00026 s

Fifty thousand random inputs across five rules. Zero disagreements. Three minutes of compute total. Every rule is also exhaustively checked over its full small input space — 32 unique input combinations across the five rules, all match.

About half of the rule library is now proof-cadenced. The other half — the rules that involve sampling or learned weights — keep their empirical bench. The line is exactly where it should be: anywhere randomness or learning lives, we keep measuring; anywhere the rule is pure dispatch logic, we prove it.

what to take away

the same trick, two rules at once

An obvious question: if proving one rule works, does it work for TWO rules running together? Note 10 measured that two helpful rules combined can interact in surprising ways. Most of those interactions involve at least one rule with a sampler or learned weight — and for those the bench is still the right tool. But for two PURE-LOGIC rules running together, the combined behavior is also constructible. We tested this — first on one pair, then on every pair the library can form.

The setup. Take two rules that both fire on the same kind of request — say, prepend failing-test on retry and add an error-class hint when there's a traceback. Both pre-call, both pure logic, both proved individually in the previous step. Build a single Turing machine whose dispatch is the PRODUCT of those two: given (does the residual rule fire?, what's the error class?), it emits the joint decision both Python rules would emit running in sequence. Differential-test it.

Then we wrote a small composer that does this for any pair of the five rules, automatically. The composer reads each rule's δ and emits the joint machine. Five rules means ten possible pairs. We ran the test on all ten.

pair (rules combined)underlying machineexhaustive joint inputs matched
retry-on-length × prepend failing test20 states · 27 trans16 / 16
retry-on-length × halt-on-converge24 states · 35 trans32 / 32
retry-on-length × error-class hint28 states · 27 trans12 / 12
retry-on-length × model-alias rewrite12 states · 11 trans4 / 4
prepend failing test × halt-on-converge28 states · 43 trans128 / 128
prepend failing test × error-class hint32 states · 35 trans48 / 48
prepend failing test × model-alias rewrite16 states · 19 trans16 / 16
halt-on-converge × error-class hint34 states · 39 trans96 / 96
halt-on-converge × model-alias rewrite18 states · 23 trans32 / 32
error-class hint × model-alias rewrite32 states · 31 trans12 / 12

Three hundred and ninety-six joint inputs across ten pairs. Every single one agrees with running the two Python rules in sequence. Six seconds of compute total. Every way these five pure-logic rules can run together is now correct by construction.

The construction frame extends from singles to pairs cleanly. The empirical interaction story note 10 told — where two helpful rules can fight when they meet — gets a constructive answer for the pure-logic-only quadrant: pairs that involve a sampler or a learned weight stay on the bench; pairs where both rules are pure dispatch are settled here, once, in seconds.

what comes next

Bigger picture: this is the third lens we use on every rule, now also extended to pairs of rules. The first lens is what the rule is for. The second lens is how it runs on this particular box. The third lens — opened by Bates — is whether it can be proven correct by construction. Most coding-agent frameworks have the first lens. Some have the second. None we know of have the third. Adding it cost a few hours today. Not adding it costs forever — the slow bench loop keeps doing work it does not need to do.

Composition was the question note 10 raised. For pure-logic rules combined with pure-logic rules, we now have an answer that does not require benchmarking. For pure-logic rules combined with rules that involve sampling or learning, the bench is still the right tool — and it gets a cleaner signal because part of the load has moved off it.

what we're not claiming

where this came from

Bates' library surfaced through our research channel. The repo had been sitting for some time; we only noticed it once we started asking how to ship rule-correctness without paying the bench tax. Pulling the library, mapping our pure-logic rules onto its formalism, compiling one, and running the differential test took about five minutes once we started.

The shape of the idea — three lenses on every rule, with the third lens being a proof — is independent of whether any one compilation succeeds. The first compilation succeeded on the simplest rule. We then ran the same protocol on the other four pure-logic rules in the library. All four also checked out. Half of the rule library is now proof-cadenced; the other half stays empirical because it involves sampling or learned weights, exactly where the line was always going to be.