twenty pull requests and one bench bug

Desai, Tej

Intuition Labs · φ research · 14 · measured · 17 may 2026

twenty pull requests and one bench bug

an evening sprint to find the box's capability ceiling on canonical-pattern coding. the lever held through every tier we built. the only failure caught the bench writer making a mistake. that is itself the finding.

8 bench iters · 6 difficulty tiers · 20/21 box-only pass · 1 CI red caught the bench writer

box_pass_rate (canonical-shape, T0) ≈ 1.000 · bench_correctness_rate (this writer) < 1.000 · the lower bound is the bench writer, not the box

the takeaway in one paragraph

We ran twenty-one consecutive canonical-shape coding problems through the box's autonomous loop. The box wrote nineteen first-try CI-green PRs and one CI-red PR. On inspection, the red PR caught a bug in our test specification — the box wrote the canonical algorithm correctly, the test asserted a value from a different algorithm. We fixed the test, the box retried, CI went green. The box never made a capability mistake we could measure across the entire arc. The thing that finally failed was the bench, not the box.

the arc

Over one evening we built up a difficulty fan in the arena and ran the autonomous loop against each tier in sequence. Each tier added problems that probed a different shape of canonical reasoning.

tier	count	problems	result
easy	5	fib · count_vowels · reverse_string · is_palindrome · sum_digits	5/5 ✓
medium	3	longest_common_prefix · balanced_brackets · kadane_max_subarray	3/3 ✓
hard	3	lru_cache · edit_distance · graph_bfs_shortest_path	3/3 ✓
novel-shape	3	n_queens_count · word_ladder · wildcard_regex_match	3/3 ✓
perf-gated	3	count_inversions · subarray_sum_equals_k · two_sum_count_pairs	3/3 ✓
extend-mode	3	dijkstra_shortest_path · lru_with_ttl · damerau_levenshtein	2/3 ✓ + 1 bench bug
retry (iter73)	1	damerau_levenshtein after spec fix	1/1 ✓
cumulative	21	—	20/21 = 0.952

Every successful PR used the canonical Python idiom for its shape: OrderedDict for LRU, two-variable Kadane for max-subarray, modified merge-sort for inversion counting, star-recurrence DP for wildcard matching, column-plus-two-diagonal-sets backtracking for n-queens, BFS-with-one-letter-mutation for word ladder, heapq for Dijkstra. No retries. No thinking-on. No human in the loop. The fast model with the drafter on chose the textbook answer every time.

the one failure

On the extend-mode tier, we gave the box an existing edit-distance implementation and asked it to add a damerau_distance function — the variant that counts adjacent-pair transpositions as one edit. The problem.md specified the canonical 2D-DP recurrence by name: Optimal String Alignment (OSA), which fires the transposition case only when the literal characters at adjacent positions swap. The box wrote the OSA recurrence exactly.

One of our tests asserted that damerau_distance('ca', 'abc') should equal 2. The intuitive reasoning is 'transpose ca→ac, insert b, that's two edits'. That answer is correct under TRUE Damerau-Levenshtein, which uses adjacency tables to track non-local rearrangements. But the canonical 2D DP — the one in every algorithms course, the one our problem.md specified — is OSA, and OSA does not see this rearrangement because the transposition condition does not hold at any (i,j) for these strings. OSA falls back to plain Levenshtein and returns 3.

So our spec was internally inconsistent: problem.md said 'use OSA' and one test asserted a non-OSA value. CI flagged the inconsistency. The box's algorithm was the canonical answer the spec asked for. The bench writer (the author of this note) was the source of the error. After correcting the test value from 2 to 3, the box retried the same problem and produced the same algorithm; CI went green.

what this measures (and what it does not)

We have not found the box's capability ceiling. What we have found is a strong lower bound: on any well-specified problem whose canonical solution exists in pre-training, the box reliably reaches for the right idiom — including under performance gates that punish brute force and under extend-mode prompts that require reading existing code. The bench is what is exhausted, not the box. Every problem we picked has been written and posted thousands of times in some form; the model has seen them all.

The meta-finding worth carrying forward: the bench has reached the level where the bench writer is the first failure point. To find a real capability ceiling on canonical-algorithm work, the bench would have to be written by someone with deeper canonical knowledge than the box — and the box has the entire literature in pre-training. That is structurally hard. The more productive next pivots are away from canonical-algorithm benches: toward multi-file integration tasks, tasks that require reading non-canonical existing codebases, or tasks that genuinely have no canonical answer in pre-training.

The meta-meta-finding worth carrying forward: the autonomous loop is self-correcting in this regime. A bench bug surfaces as a CI red. The operator investigates the red. The spec gets fixed. The slug becomes dedup-free. The box re-attempts with the same canonical algorithm. CI green resumes. The loop never asked us for anything; it just produced evidence of where the spec was wrong, and we followed.

what a buyer is getting

The product proposition this sprint clarified: the box is a reliable layer for canonical-shape coding work, paced by github's rate-limit and verified by github's CI. A buyer routing a problem through this platform gets the textbook answer for that shape, written in idiomatic Python, with a PR they can review and a CI verdict they can trust. The box does not improvise; it retrieves. That is the bound, not the floor.

For coding tasks that DO have a textbook answer — the long tail of "we need an X but writing it from scratch is tedious" — the box compresses the time from spec to verified-merge-able PR to roughly 8 to 16 seconds of compute plus the rate-limit wall. For coding tasks that DO NOT have a textbook answer — novel domain logic, deeply integrated work, problems where the right algorithm is not in pre-training — the box has not been measured, and this note does not claim to know. That is the next experiment.

The arena is the live demonstration: twenty-one box-authored PRs, twenty CI-green, one closed-without-merge after we fixed the spec. The MCP observer is the buyer's read interface: drop it into any AI assistant and ask about the rate live. CONTRIBUTING.md documents the submission path for a non-owner. The product surface is real; the math note that frames it is research note 03.