Intuition Labs · φ research · 03 · measured · 16 may 2026

pick your target, same algebra

intelligence, cost, efficiency, latency — all ratios of the same four observables. the flag, the polling-rate readout, the GitHub-CI feed, and the active flavor (box opens its own PR; CI is the verifier) all shipped this week.

4 targets · 19 CI-green + 1 bench-bug-red box PR · the bench writer was the first failure point
bandwidth = P·R·B · · · efficiency = (P·B)/C · · · cost/pass = C/P · · · time-to-pass = 1/(P·R)  ·  four targets, parametric in (P, R, C, B)

the problem

Most coding-agent benchmarks pick one optimization target by accident — usually 'pass@k at fixed N' — and design the whole stack around it. That target encodes assumptions: that the corpus is finite, that the sample budget is the constraint, that 'intelligence' is the only thing being bought.

A startup running CI checks at scale wants minimum cost. A research lab wants maximum throughput of correct decisions. A live coding assistant wants minimum latency to the first correct answer. Same box, same techniques, different optima.

These aren’t separate fields. They’re ratios of the same four primitives, and the math tells you which one you’re actually optimizing.

the four primitives

Every iteration the box runs has four observable quantities:

symbolnameunitsmeasured by
Ppass_ratedimensionless (0..1)task verifier (CI green / pytest / owner ack)
Riter_per_sec1 / timewall clock
Ccost_per_iterenergy | dollars | tokensGPU power · time, or token count · price
Bbits_per_taskinformation / tasklog2 of the choice space the task narrows

Pick a basis for C (joules, dollars, tokens — your call). The algebra is the same.

the four targets

Four useful objectives, each a ratio of the four primitives:

targetformulaunitsdirection
intelligence bandwidthP · R · Bbits / secmax
cost per passC / Pcost / passmin
efficiency(P · B) / Cbits / costmax
time to first pass1 / (P · R)secmin

Three dimensional facts fall out of the algebra:

That last one is the throughput-and-latency-friendly Pareto region: a tech that raises R without hurting P also lowers latency. Most speculative-decoding wins live here.

the rate ceiling

Fix one thing: pass@1 instead of pass@k. The 'k' dimension collapses. What was 'how many passes in N samples' becomes 'probability per attempt.' Attempts are now an unbounded stream, not a fixed-N sample.

Pair pass@1 with a live external-verifier corpus — GitHub issues with CI; pytest suites against a watched repo; merged-PR signals — and the rate has no ceiling. R is bounded only by the substrate clock and the verifier’s latency. Both are improvable indefinitely.

The 'count the samples' framing has a built-in ceiling at N. Switch to rate, and the corpus becomes a stream the box samples continuously. That's where recursive self-improvement actually closes — each iteration's outcome trains the next, and there's no fixed-N saturation to memorize against.

The catch: the verifier must be external truth. Recursive training on self-generated outputs collapses without an outside check. GitHub CI green is an external check. A model rating its own work is not.

phi over the cloud

Each technique in the stack has a sign that depends on the prompt class — we measured this directly and wrote it up in the composition note — and we now extend that sign to also depend on the chosen target. The same retry-with-thinking rule that helps intelligence on a hard prompt hurts cost on an easy one.

Read those interactions as points in a four-dimensional cloud: (tech_i, tech_j, prompt_class, target). Each point carries field values (Δpass, Δwall, Δcost, phi). Phi becomes a continuous field over the cloud — high where the algebra composes cleanly, discontinuous where two techniques amplify each other into a wall blow-up.

The encoding is the same one transformers use: a semantic coordinate (what this interaction means) plus a positional coordinate (when and where it fired in the substrate). Project to 2-D, render it, watch the box ‘think’ as it walks the cloud.

what this changes for the operator

Today the codebox bakes one decision rule into its sweep harness — roughly bandwidth-favoring, with a tail-latency amendment. Tomorrow that becomes a flag:

codebox improve KNOB=B:T --target intelligence       # max bandwidth
codebox improve KNOB=B:T --target cost                # min $/pass
codebox improve KNOB=B:T --target efficiency          # max bits/$
codebox improve KNOB=B:T --target latency             # min time-to-pass
codebox improve KNOB=B:T --target-weights P=0.4,R=0.3,C=-0.2,B=0.1

Same bench. Same techniques. Different winning configurations because the math weighs the primitives differently.

what landed today (17 may 2026)

All three pieces of this note ship together. The note flips to MEASURED.

first live numbers

Three sources side-by-side. The github row now includes our own arena alongside the observation of llama.cpp:

sourcenpassrate@1pass/sverifier
github (llama.cpp + arena)50380.7600.00016GitHub CI
local-corpus1070.7000.00097pytest on 34-exercise polyglot corpus
proof-of-life34341.0000.00020pytest on trivial add(a,b) heartbeat

Of the 38 passing PRs on github, nineteen were opened by the box itself against problems on its own arena repo — across six tiers now: easy, medium, hard, novel-shape composition, performance-gated (where pytest-timeout punishes brute force), and extend-mode (where the box reads existing code and adds to it). The box also has one failing PR against the arena. We spent an evening on that one failure; the resolution is the most interesting datapoint of the week.

the active flavor (added 17 may, evening)

The follow-up named below ("the box opens its own PR, waits for CI, takes the verdict") shipped the same day. We opened a public arena repo, gave the box a problem-issue label and an experiment-PR label, and pointed the loop at it.

Across the week the arena has grown to twenty-one problems split across six tiers: easy, medium, hard, novel-shape composition, performance-gated (where pytest-timeout makes algorithm choice matter), and extend-mode (where the box reads the prior implementation and adds to it). On twenty box-authored attempts, the box has nineteen CI-green PRs. The pattern in every successful PR is the same: the box reaches for the canonical Python idiom — OrderedDict for LRU, vertical-scan for longest-common-prefix, modified merge-sort for inversion counting, star-recurrence DP for wildcard matching, OSA 2D DP for Damerau-Levenshtein. No retries, no thinking-on, no human in the loop.

The one failure is the most informative datapoint of the entire week. We asked the box to extend an edit-distance function to handle transpositions (Damerau-Levenshtein). The box wrote the canonical 2D DP with the transposition case — what literature calls Optimal String Alignment (OSA). One of our tests asserted that the function should give a specific answer on `("ca", "abc")` that intuitively involves "transpose then insert." That assertion is correct under true Damerau-Levenshtein (a more general algorithm that uses adjacency tables) but NOT under OSA (which only fires the transposition case when adjacent characters literally swap at the boundary). The canonical 2D DP — the textbook answer, the one in every algorithms course — is OSA. So the box wrote the standard textbook answer; we wrote a test for a different algorithm. The CI verdict caught our spec error, not the box's code.

That datapoint matters more than another nineteen passes would have. The bench has reached the level where the bench writer can be the first failure point. To find the box's actual ceiling on this kind of work, the bench would have to be written by someone with deeper canonical-algorithm knowledge than the box has — and the box has the entire literature in pre-training. The more productive next steps pivot away from "construct a harder canonical problem" toward (a) multi-file integration tasks (real engineering, not isolated functions), (b) tasks that require reading a non-canonical existing codebase, or (c) opening the box up as a product surface where users-other-than-the-owner submit work.

The observer side is shipped as a standalone MCP tool with zero codebox dependencies. Any MCP-capable client can install it and ask "how is the box doing on its arena?" or "what is the merge rate on llama.cpp?" without touching anything else in the stack.

what we are not claiming

where this came from

Two prior notes set up the primitives. Note 01 established phi as a halt signal — the moment the model commits, you stop. Note 02 turned the same signal into a self-rating without a second model. This note steps back: what optimization are you actually buying when you use those signals?

The answer is parametric. Halt + rate-collapse buys bandwidth and latency on cheap problems. Halt + score buys efficiency on best-of-N. The same control plane serves all four targets; the algebra picks which one fires.

TFlow (BWR-hhh) sits in the same algebra: their weight-space communication trades C (token cost) for B (information density) at fixed P. Phi-as-acceptance-test on the patched receiver is the validation. One control plane, growing list of data planes, now one algebra over both.