pick your target, same algebra
intelligence, cost, efficiency, latency — all ratios of the same four observables. the flag, the polling-rate readout, the GitHub-CI feed, and the active flavor (box opens its own PR; CI is the verifier) all shipped this week.
the problem
Most coding-agent benchmarks pick one optimization target by accident — usually 'pass@k at fixed N' — and design the whole stack around it. That target encodes assumptions: that the corpus is finite, that the sample budget is the constraint, that 'intelligence' is the only thing being bought.
A startup running CI checks at scale wants minimum cost. A research lab wants maximum throughput of correct decisions. A live coding assistant wants minimum latency to the first correct answer. Same box, same techniques, different optima.
These aren’t separate fields. They’re ratios of the same four primitives, and the math tells you which one you’re actually optimizing.
the four primitives
Every iteration the box runs has four observable quantities:
| symbol | name | units | measured by |
|---|---|---|---|
| P | pass_rate | dimensionless (0..1) | task verifier (CI green / pytest / owner ack) |
| R | iter_per_sec | 1 / time | wall clock |
| C | cost_per_iter | energy | dollars | tokens | GPU power · time, or token count · price |
| B | bits_per_task | information / task | log2 of the choice space the task narrows |
Pick a basis for C (joules, dollars, tokens — your call). The algebra is the same.
the four targets
Four useful objectives, each a ratio of the four primitives:
| target | formula | units | direction |
|---|---|---|---|
| intelligence bandwidth | P · R · B | bits / sec | max |
| cost per pass | C / P | cost / pass | min |
| efficiency | (P · B) / C | bits / cost | max |
| time to first pass | 1 / (P · R) | sec | min |
Three dimensional facts fall out of the algebra:
- Efficiency is rate-invariant. R cancels in numerator and denominator. You can chase bits-per-dollar at any throughput.
- Cost-per-pass is rate-invariant. Same reason. Pure per-iteration economics.
- Bandwidth and latency are both rate-coupled — and they degrade gracefully together. Higher R helps both.
That last one is the throughput-and-latency-friendly Pareto region: a tech that raises R without hurting P also lowers latency. Most speculative-decoding wins live here.
the rate ceiling
Fix one thing: pass@1 instead of pass@k. The 'k' dimension collapses. What was 'how many passes in N samples' becomes 'probability per attempt.' Attempts are now an unbounded stream, not a fixed-N sample.
Pair pass@1 with a live external-verifier corpus — GitHub issues with CI; pytest suites against a watched repo; merged-PR signals — and the rate has no ceiling. R is bounded only by the substrate clock and the verifier’s latency. Both are improvable indefinitely.
The 'count the samples' framing has a built-in ceiling at N. Switch to rate, and the corpus becomes a stream the box samples continuously. That's where recursive self-improvement actually closes — each iteration's outcome trains the next, and there's no fixed-N saturation to memorize against.
The catch: the verifier must be external truth. Recursive training on self-generated outputs collapses without an outside check. GitHub CI green is an external check. A model rating its own work is not.
phi over the cloud
Each technique in the stack has a sign that depends on the prompt class — we measured this directly and wrote it up in the composition note — and we now extend that sign to also depend on the chosen target. The same retry-with-thinking rule that helps intelligence on a hard prompt hurts cost on an easy one.
Read those interactions as points in a four-dimensional cloud: (tech_i, tech_j, prompt_class, target). Each point carries field values (Δpass, Δwall, Δcost, phi). Phi becomes a continuous field over the cloud — high where the algebra composes cleanly, discontinuous where two techniques amplify each other into a wall blow-up.
The encoding is the same one transformers use: a semantic coordinate (what this interaction means) plus a positional coordinate (when and where it fired in the substrate). Project to 2-D, render it, watch the box ‘think’ as it walks the cloud.
what this changes for the operator
Today the codebox bakes one decision rule into its sweep harness — roughly bandwidth-favoring, with a tail-latency amendment. Tomorrow that becomes a flag:
codebox improve KNOB=B:T --target intelligence # max bandwidth
codebox improve KNOB=B:T --target cost # min $/pass
codebox improve KNOB=B:T --target efficiency # max bits/$
codebox improve KNOB=B:T --target latency # min time-to-pass
codebox improve KNOB=B:T --target-weights P=0.4,R=0.3,C=-0.2,B=0.1Same bench. Same techniques. Different winning configurations because the math weighs the primitives differently.
what landed today (17 may 2026)
All three pieces of this note ship together. The note flips to MEASURED.
- codebox improve --target {intelligence-bandwidth, cost-per-pass, efficiency, latency, sku} selects the decision rule per the algebra above. Each target maps to a score function; the same hard pass_rate-non-regression floor applies to all four. Default --target sku reads the active SKU's target from ~/.box/active.json — the identity layer chooses the optimization frame; the substrate just measures and obeys.
- box show <SKU> prints the target alongside orchestrator, termination, and the pairwise composition matrix. Codebox@1.0.0 declares target=intelligence-bandwidth (the foundational Hamiltonian: max P·R·B). DeerFlow@0.1.0 declares the same. Future SKUs (CI-bot, live-assistant) will pick cost-per-pass and latency respectively.
- codebox livecorpus rate emits rate@1 (pass/total) and pass/s (passes per wall-second) over the trailing N entries from logs/livecorpus.jsonl and logs/proof-of-life.jsonl, bucketed by source. The ceiling is unbounded — improving substrate clock or verifier latency raises pass/s indefinitely.
- codebox livecorpus-github (the third piece) pulls real merged PRs from a configured GitHub repo via the gh CLI and appends them to livecorpus.jsonl with passed=(no FAILURE check conclusions). The verifier is GitHub's CI — external, not the box rating itself. The stream is unbounded as new PRs land continuously across the platform.
first live numbers
Three sources side-by-side. The github row now includes our own arena alongside the observation of llama.cpp:
| source | n | pass | rate@1 | pass/s | verifier |
|---|---|---|---|---|---|
| github (llama.cpp + arena) | 50 | 38 | 0.760 | 0.00016 | GitHub CI |
| local-corpus | 10 | 7 | 0.700 | 0.00097 | pytest on 34-exercise polyglot corpus |
| proof-of-life | 34 | 34 | 1.000 | 0.00020 | pytest on trivial add(a,b) heartbeat |
Of the 38 passing PRs on github, nineteen were opened by the box itself against problems on its own arena repo — across six tiers now: easy, medium, hard, novel-shape composition, performance-gated (where pytest-timeout punishes brute force), and extend-mode (where the box reads existing code and adds to it). The box also has one failing PR against the arena. We spent an evening on that one failure; the resolution is the most interesting datapoint of the week.
the active flavor (added 17 may, evening)
The follow-up named below ("the box opens its own PR, waits for CI, takes the verdict") shipped the same day. We opened a public arena repo, gave the box a problem-issue label and an experiment-PR label, and pointed the loop at it.
- codebox-arena · public github repo with nine problems split between easy (fib, count_vowels, reverse_string, is_palindrome, sum_digits, add) and medium (kadane_max_subarray, balanced_brackets, longest_common_prefix). Each has a markdown spec, a stub, and pytest tests. Issues are labeled codebox-problem.
- codebox attempt · reads a labeled issue, calls codebox solve, writes the candidate to problems/<slug>/solution.py, opens a PR labeled codebox-experiment. Rate-limited to one PR per five minutes per repo. Slug-deduped — it skips any slug that already has an open experiment PR.
- codebox loop --attempt-repo OWNER/NAME [--attempt-only] [--cycle-sleep N] · the active-flavor step inside the recursive self-improvement cycle. With --attempt-only the loop runs proof-of-life and the PR attempt in about eight seconds and skips the heavy local cycle. With --cycle-sleep 305 the loop self-paces against the per-repo five-minute rate-limit — that combination is the autonomous daemon mode.
Across the week the arena has grown to twenty-one problems split across six tiers: easy, medium, hard, novel-shape composition, performance-gated (where pytest-timeout makes algorithm choice matter), and extend-mode (where the box reads the prior implementation and adds to it). On twenty box-authored attempts, the box has nineteen CI-green PRs. The pattern in every successful PR is the same: the box reaches for the canonical Python idiom — OrderedDict for LRU, vertical-scan for longest-common-prefix, modified merge-sort for inversion counting, star-recurrence DP for wildcard matching, OSA 2D DP for Damerau-Levenshtein. No retries, no thinking-on, no human in the loop.
The one failure is the most informative datapoint of the entire week. We asked the box to extend an edit-distance function to handle transpositions (Damerau-Levenshtein). The box wrote the canonical 2D DP with the transposition case — what literature calls Optimal String Alignment (OSA). One of our tests asserted that the function should give a specific answer on `("ca", "abc")` that intuitively involves "transpose then insert." That assertion is correct under true Damerau-Levenshtein (a more general algorithm that uses adjacency tables) but NOT under OSA (which only fires the transposition case when adjacent characters literally swap at the boundary). The canonical 2D DP — the textbook answer, the one in every algorithms course — is OSA. So the box wrote the standard textbook answer; we wrote a test for a different algorithm. The CI verdict caught our spec error, not the box's code.
That datapoint matters more than another nineteen passes would have. The bench has reached the level where the bench writer can be the first failure point. To find the box's actual ceiling on this kind of work, the bench would have to be written by someone with deeper canonical-algorithm knowledge than the box has — and the box has the entire literature in pre-training. The more productive next steps pivot away from "construct a harder canonical problem" toward (a) multi-file integration tasks (real engineering, not isolated functions), (b) tasks that require reading a non-canonical existing codebase, or (c) opening the box up as a product surface where users-other-than-the-owner submit work.
The observer side is shipped as a standalone MCP tool with zero codebox dependencies. Any MCP-capable client can install it and ask "how is the box doing on its arena?" or "what is the merge rate on llama.cpp?" without touching anything else in the stack.
what we are not claiming
- That any one target is the right one. Pick yours; the algebra tells you which knobs matter.
- That all four targets can be maximized simultaneously. They share primitives but trade against each other (efficiency vs latency, for instance, can pull in opposite directions on the same knob).
- That the six current problems on the arena are representative of real-world coding work. They are easy by design — the value is closing the loop, not measuring capability. Widening the difficulty fan is the next milestone, not this one.
- That the active-flavor PR rate is a stable metric yet. Three PRs is a sample; we have not yet run long enough to bound the box's pass@1 against its own arena to anything tight. The infrastructure is the deliverable today; the bench follows.
where this came from
Two prior notes set up the primitives. Note 01 established phi as a halt signal — the moment the model commits, you stop. Note 02 turned the same signal into a self-rating without a second model. This note steps back: what optimization are you actually buying when you use those signals?
The answer is parametric. Halt + rate-collapse buys bandwidth and latency on cheap problems. Halt + score buys efficiency on best-of-N. The same control plane serves all four targets; the algebra picks which one fires.
TFlow (BWR-hhh) sits in the same algebra: their weight-space communication trades C (token cost) for B (information density) at fixed P. Phi-as-acceptance-test on the patched receiver is the validation. One control plane, growing list of data planes, now one algebra over both.