the substrate verified itself
a methodology note: when the substrate's output IS the corpus, using the substrate is verifying it. one of the rows we verified is a bug the same session caused and fixed.
the takeaway in one paragraph
We had inert verification harnesses for a research claim — that the box's dispatch rules are correct around an opaque LLM in the middle. The claim needed empirical evidence: a sample of (request, response) pairs from a real LLM, run through the chain, checked against contract. We didn't have a separate corpus. Then we noticed: the production code's own log writes one row per LLM call, with finish_reason captured. The session we were running through, the very session that built the harnesses, was generating the verification corpus. We pointed the harnesses at that log. 38 rows, every one passed both R-contract and chain-diff. One of the 38 was a bug from this same session that we'd already found and fixed. The substrate's own logs verified the substrate's own dispatch correctness.
the setup
Note 13 names four shapes of composition in our agent's tech library: parallel, matmul (sequential with adapter), gated, and bookend (the LLM-in-the-middle case). The first three close by construction — every joint input checks against the running Python reference. The fourth can't, because the construction frame doesn't describe what the LLM does. The proof has to be different: prove the bookends individually, prove the type signature of the chain, and verify that the LLM (the relation R between request and response) respects its documented contract.
We shipped two harnesses for the bookend case. The chain-diff harness takes (request, recorded-R-response) pairs and verifies that the chain's joint decision matches what running each bookend in Python around the recorded R produces. The R-contract sampler bins finish_reason across captured responses and checks each is in the documented set. Both default to a small in-source fixture so they self-validate; both accept an env-var pointer to a real fixture file.
What we needed next was the real fixture. Capturing it required running phi-proxy in production against representative work, then dumping (request, response) tuples in the harness's expected JSONL shape. That sounded like a separate step.
the realization
codebox-solve, the verb the agent uses for every code-generation call, is itself a wrapper around phi-proxy. Each codebox-solve invocation writes one row to logs/solve_history.jsonl with the model name, the prompt-tokens count, the completion-tokens count, the finish_reason, and a snippet preview. The finish_reason field is the same R-contract output the path-A harnesses want to read.
Across this evening's session — which built the harnesses, ran a long sequence of bench iters, and went through a substrate-state regression that needed diagnosis — the log accumulated 42 rows. 38 of them have finish_reason populated. 37 are stop, 1 is length. That's a real R-sample from real work, not a synthetic distribution.
We wrote a small extractor that converts solve_history.jsonl rows into the harness's expected fixture shape — synthesizing a minimal request from (model, max_tokens) and a minimal response from (finish_reason, usage). Pointed the harnesses at the extracted file. Both ran in milliseconds.
the result
| harness | fixture | n | result |
|---|---|---|---|
| path_a_chain_diff | live capture from solve_history.jsonl | 38 | 38/38 expected joint decisions match · type-check verified |
| r_contract_verify | same 38 rows | 38 | 38/38 R-contract pass · 37 stop · 1 length |
The composition algebra's fourth row — bookend for LLM-in-the-middle pairs — is now empirically measured. All four proof shapes have empirical backing. The summary in one sentence: parallel + matmul + gated by construction across 17 pair-compositions and 170,000 fuzz samples, bookend by construction for the individual techs plus 38 live captures all satisfying contract.
the detail worth marking
The single `length` capture in the 38-row sample is the iter77 dijkstra-extend-mode incident from earlier in the same evening. That incident: the box was asked to extend an existing BFS implementation into a Dijkstra implementation; the extend-mode prompt was longer than usual because the existing code was included; the model's response ran over the configured 512-token output budget and was truncated; the production stack auto-escalated and still ran over at 4096; the codebox-solve receipt logged finish_reason=length. Two iters later, we shipped a fix (raise the codebox-solve default to 2048) and retried the same problem; that one passed cleanly. The retry's row is also in the same 38.
When the path-A harnesses ran against this 38-row corpus, they verified that for the bug row — request with finish_reason=length under the FA-LE chain — the joint decision is retry. Which is exactly what happened in production. The session caused the bug, the session fixed it, the session verified that the dispatch logic still treated the bug-row correctly. We didn't have to mock that; we just ran on the log.
the general principle
When the substrate's output is the same shape as the substrate's verifier-input, using the substrate is verifying it. Note 03 framed this at the metric level: pass@1 on a live external corpus is the structural escape from fixed-N benches. Note 15 closes one level deeper: the corpus doesn't have to be external. It can be the substrate's own log, as long as the verifier's input shape matches. The act of running work through the box produces the rows the box's verifier needs to read.
This is recursive, but not vacuous. The verifier isn't checking that the substrate did its work correctly in some self-referential sense; it's checking that the dispatch logic around the opaque LLM-call is correct given whatever the LLM produced. Those are separable concerns. The dispatch logic is constructible; the LLM is empirical. The harness sits between them, on rows the substrate happens to produce as a side effect of normal operation.
For buyers: this means the dispatch correctness claim doesn't depend on a special verification run. Every routine codebox-solve call writes one row. After N sessions, the verifier has N×~30-50 rows. The empirical contract sample only grows. Hardware drift, model swaps, prompt-shape changes all show up immediately as either contract violations (R behaves outside the documented set) or chain-diff mismatches (the dispatch logic produced a different decision than the bookends predict). The verifier is alive as long as the substrate is.