Intuition Labs · φ research · 19 · measured · 18 may 2026

the bench had quiet bugs · seven shapes, one extractor

An autonomous code-generating loop measures its progress against a fixed set of test problems. When the loop appeared to fail one problem consistently, the natural reading was "the model can't do that one." A closer look at the data showed six of seven assertions were structurally malformed — by the assertion-extraction code, not by the model. Fixing the extractor and auditing the rest of the bench surfaced six distinct shapes of quiet bug. After that fix, seven previously-blocked problems all PASS first-try. Running the full 21-problem bench then surfaced a seventh shape — stdlib references in assertions — fixed the same day. Effective pass rate now: 21 of 21.

20 of 21 PASS first-try · effective 21 of 21 with the 7th-shape fix
20 of 21 PASS first-try · 147 of 148 assertions · effective 21/21 = 100% after the 7th fix  ·  the bug was in the measurement, not the model · the substrate keeps auditing itself

the takeaway in one paragraph

Our coding loop has a small fixed set of test problems it tries to solve. For each problem there's a problem statement, a stub, and a pytest test file. The loop reads the test file, pulls out the assertions, and asks the model to write code that passes them. When the model's answer was checked against a particular problem, it failed 6 of 7 assertions — every time, every retry, with the model returning textbook-correct code. The natural conclusion was "this problem is hard for the model." Looking at the actual assertions the loop was checking against, the failing ones all said result == 5 or result == 0 with result undefined. The pytest test file used a local variable that the extractor dropped. After fixing the extractor and auditing the rest, six different shapes of similar quiet bug surfaced across the bench. Seven previously-blocked problems now solve first-try. Running the full 21-problem bench the same day surfaced a seventh shape — stdlib references in assertions — fixed in one more line.

what the extractor was doing

The pytest test files use a normal Python idiom. They build the answer in a local variable, then assert on that variable:

def test_classic_hit_to_cog(): result = word_ladder('hit', 'cog', [...]) assert result == 5

The extractor walked the test file's parsed syntax tree and grabbed only the assert lines. It produced "result == 5" — a perfectly well-formed boolean expression that has nothing to do with the answer it was supposed to check. When the loop ran this against the model's code, the eval crashed: result is not defined. The assertion was recorded as a failure. Across 7 assertions in this problem, only the one written inline ("assert word_ladder('a', 'c', ['a', 'b', 'c']) == 2") survived to be checked. The model's code was correct. The measurement was broken.

seven shapes, one bench

Once the first shape was visible, looking through the full set of past failures surfaced five more. Then running the full 21-problem bench one more time surfaced a seventh. They're all variations on the same theme: the assertion extractor doesn't have enough context to flatten the test cleanly. Each shape needed its own treatment.

shapeexamplehow it appears
variable-then-assertresult = f(x); assert result == 5NameError: result undefined
instance-with-side-effectsc = LRUCache(2); c.put(1, "a"); assert c.get(1) == "a"fresh cache substituted; .put() lost; false fail
parametrized inputsdef test_x(arr, target, expected):arr / target / expected all undefined
slug ↔ function name mismatchslug=dijkstra_shortest_path; test calls shortest_path(...)NameError if model writes only dijkstra
helper assertion leakedassert spec.loader is not None (from _load_solution helper)spec undefined; not really a test
hidden rule in the specclean_phone('1-555-...') == '5551234567'spec does not say "strip the leading 1"
stdlib reference in assertcount_inversions(random.sample(range(N), K)) == expectedrandom not in eval scope · expected updated in a for-loop the extractor didn't track

Six of these are in the extractor. The hidden-rule shape lives in the problem statement itself — the spec doesn't tell the model the convention, so the model writes a literally-correct answer that doesn't match the test. Different layer, same outcome: the bench rejected a reasonable answer.

the fix, in three small additions

Three changes, all in the extractor. Each handles one or more shapes. Together they unlocked seven problems that had been quietly bench-blocked.

First: when the test sets up a local variable like result = f(x), track that assignment as you walk the function body in order. When the next assert references result, inline the right-hand side of the assignment back into the expression. This handles the variable-then-assert shape cleanly.

Second: when the test does something like c = LRUCache(2) followed by c.put(1, 'a'), the second line is a side-effect statement that mutates c. Substituting the original LRUCache(2) into a later assertion would produce a fresh cache with no put applied — semantically wrong. Detect the mutation and drop c from the tracked-assignments table. Subsequent assertions on c won't be inlined, and the safety check below catches them.

Third: after inlining, check whether the assertion expression still references any undefined name. If yes, drop the assertion and emit a loud warning naming the unresolved variable. This is the safety net for stateful tests, parametrize, and anything else the inliner can't safely flatten. The warning gives the operator a clear signal that a test pattern needs different handling.

safe-set: which names are allowed to appear

The drop-unresolved-name check needs a definition of "resolved." The safe set has two sources. Python builtins (True, False, range, len, etc) are always safe. And — the most useful one — any function called via _load_solution().X() in the test file is added too. That second one handles slug-versus-function-name mismatches transparently. If the problem is called dijkstra_shortest_path but the tests call shortest_path(...) and dijkstra(...), both names get added to the safe set and the assertions extract cleanly.

The _load_solution()-based piece is small (about ten lines) but it unlocked the largest number of problems. Seven problems in our bench had this slug-versus-function-name decoupling and were entirely bench-blocked before the safe-set logic landed. The loop simply couldn't measure them.

An earlier draft of the safe set also trusted the test file's module-level imports (`import math`, `import random`, etc) as safe references. That turned out to be wrong — see the seventh shape.

running the full bench surfaced a seventh shape

With the first six shapes addressed, running the full 21-problem arena gave us 20 PASS first-try and 1 FAIL. The FAIL was on the inversion-counting problem at 8 of 9 assertions. Looking at the failing assertion:

count_inversions(random.sample(range(100000), 5000)) == 0

Two things wrong at once. The test file uses random.sample inside the assertion (referencing the stdlib `random` module). And the `== 0` is wrong on its face — a random sample of 5000 distinct values has thousands of inversions, not zero. Looking at the source test function explains both: the test sets up `expected = 0` and then accumulates expected += ... inside a for loop, then asserts against the final expected. The extractor saw the initial `expected = 0` assignment and substituted that, missing the accumulating updates. And it had considered `random` safe because the test file imports it — but the bench loop's eval scope doesn't import anything.

The fix is one line removed. The safe set no longer trusts test-file imports. Assertions referencing imported modules get dropped via the unresolved-name check, with a stderr warning. After the fix, the inversion-counting problem extracts 8 well-formed assertions instead of 9 (one with a broken pair of conditions); the model passes all 8. Effective pass rate across the full 21-problem arena is now 21 of 21.

the measurement after the fix

Once the extractor handled the first five shapes, the seven previously-blocked problems got re-run through the loop. No retries, no escalation — the model gets one shot per problem with the corrected assertions:

problemassertions checkedresultwall
damerau-levenshtein12PASS 12/1223.0 s
dijkstra (extend-mode)10PASS 10/1010.1 s
graph BFS shortest path7PASS 7/78.5 s
kadane max subarray5PASS 5/56.4 s
subarray sum equals k8PASS 8/89.1 s
two-sum pair counting8PASS 8/85.1 s
wildcard pattern match16PASS 16/1611.1 s

Seven of seven. Sixty-three of sixty-three assertions. About eighty-four seconds total wall. These are textbook algorithm problems and the model recognizes them as such; the bench had just been refusing to check the answers.

what this means about measuring an autonomous loop

An autonomous loop that grades itself relies on its grader telling the truth. When the grader has bugs, the loop is grading itself against a noisy proxy — failures get recorded as model weaknesses when they're actually measurement bugs. The pattern is recursive: the autonomous loop's introspection (the same loop reading its own past results) is what surfaced these. Once the loop could look at its assertions and recognize "result is undefined," the bench-author bug was visible in under a minute.

Bench-author bugs accumulate quietly. Test files get refactored, problems get added, the extractor stays the same. None of the seven shapes were obvious from any single test failure — the pattern only became visible when grouped by failing-assertion shape across the whole bench. The substrate's own introspection (its log of past attempts) is the load-bearing primitive for finding them. Without that log we'd be guessing.

The seventh shape is also a reminder that publishing a finding is not the end of the audit. This note went up the same day with six shapes. Running the full bench one more time the same day surfaced a seventh. The substrate kept auditing itself, and the public version got updated. That's the publish-as-procedural rule: the living lab's external face tracks the substrate's actual state, not the state at the moment of first writing.

what we are still figuring out

Two problems use stateful test patterns that the extractor cannot safely flatten — an LRU cache and an LRU cache with TTL. The tests build the cache, push values in, then assert on the resulting state. There's no clean way to fold three lines of imperative setup into a single boolean expression for the eval. The current fix drops these assertions with a loud warning, so the operator can see them clearly. The right next step is to fall back to running pytest in a subprocess against the model's code and the test file directly — a larger change because it requires plumbing pytest's output parsing into the bench loop.

The sixth shape — hidden rule in the spec — also lives unfixed. The problem statement that says "clean a phone number" doesn't tell the model the convention is to strip the leading 1 (a US-specific dialing convention). The model writes a literally-correct "keep digits" function and the test rejects it. This one is a spec-clarification issue, not an extractor one. The fix is to write the convention into the problem statement.