Intuition Labs · φ research · 24 · measured · 19 may 2026

the gate that needed a gate

We built a verification gate that scores any (prompt, asserts) pair on two axes — does the substrate give the same output, and does it give the right output. The four quadrants map directly to four training prescriptions. Then stress-testing the gate revealed the gate is itself a load test: running it on a degraded substrate produces false-LOW verdicts on trivially-correct prompts. The fix is recursive: the verification gate needs a substrate-health precondition. The meta-lesson generalizes: any measurement of substrate behavior must control for substrate state.

convergence × correctness · 4 quadrants · the verification gate
verdict = quadrant(convergence, correctness) · valid iff substrate_state == healthy  ·  The KPI is the quadrant; the precondition is the substrate is unstressed when measured

the takeaway in one paragraph

Any new problem or domain we want our system to handle reliably gets graded on two axes: how often does the substrate produce the same output for the same input (convergence), and how often is that output correct (correctness). At a high-confidence threshold (80% on both axes), these two yes/no signals give four quadrants. Each quadrant tells you exactly what to do next — ship as-is, retrain, accept the diversity, or redesign the prompt. The grader itself is a small primitive that runs through our existing solve pipeline. But the grader is also a load test on the substrate: running fifteen trials per mode can degrade the very thing being measured. The fix is a substrate-state precondition that refuses to run the grader when the substrate is already slow. We added it. The meta-lesson — substrate measurements must control for substrate state — is the bigger artifact than the grader itself.

the four-quadrant grid

convergencecorrectnessquadranttraining prescription
HIGH (≥80%)HIGH (≥80%)substrate solved itno training · ship known-mode
HIGH (≥80%)LOWconfidently wrong (dangerous)HIGHEST PRIORITY training target
LOWHIGH (≥80%)bounded valid attractorsno training · accept the diversity
LOWLOWconfused domaintraining + prompt redesign

Three of the four quadrants tell you what to do directly. The fourth — HIGH convergence with LOW correctness — is the one to fear. The substrate confidently produces the same wrong answer repeatedly. From outside it looks reliable; from inside it's worst-case. This is the failure mode that motivates careful verification gates in the first place.

what each quadrant means operationally

HIGH/HIGH: the substrate has internalized this operation. Same input → same correct output, every time. You can demand zero variance per call and the substrate complies. Build downstream systems that depend on this guarantee.

HIGH/LOW: the substrate has a wrong load-bearing prior on this prompt class. The model is biased toward an incorrect attractor and goes there reliably. This is where targeted training pays off most: you have a concrete failure mode that's stable enough to debug. Detection of this quadrant is the hardest part — pass-rate must be measured against ground truth, not against itself.

LOW/HIGH: the substrate has multiple valid attractors for this prompt class. A code-generation prompt where iterative, recursive, and memoized implementations are all correct will live here. Trying to train this quadrant down to a single attractor would COLLAPSE the model's solution diversity for that domain. The right response is to embrace the variance and route this prompt class through the exploratory operating mode.

LOW/LOW: the substrate is uncertain AND wrong. Could be a domain coverage gap, could be prompt ambiguity, could be both. Both training and prompt redesign help. If correctness improves while convergence stays low, you've found a multi-solution domain; if convergence improves while correctness stays low, you've trained the model into a single wrong attractor (the worst possible outcome of training). Watch the trajectory, not just the endpoint.

the worked example

We ran the gate on a trivially-deterministic prompt (write a Python add function with three strict-equality asserts) at fifteen trials per mode on a freshly-restarted substrate. The known-mode cell came back HIGH/HIGH: one distinct response across all fifteen trials, all asserts passed, sub-second per trial. The substrate confidently and correctly produced the same canonical implementation every time. Ship it.

We then ran the same prompt on a substrate that had just absorbed a heavy stress-test run. Same fifteen trials, same prompt, same asserts. The verdict came back LOW/LOW: three distinct responses, only 60% pass rate, 71 seconds per trial. On a TRIVIALLY-CORRECT prompt.

That's not the prompt's fault. That's not the model's fault. That's the substrate being in a degraded state from prior load. The grader was telling the truth about what it observed; what it observed was a sick stack.

the gate that needed a gate

The grader is itself a load test. Running N=15 trials in sequence accumulates substrate state. The act of measuring contaminates what's being measured. Without a precondition check, the grader can produce false-LOW verdicts on prompts the substrate could easily handle when fresh.

The fix is a substrate-state precondition. Before the grader runs its trial loop, it issues a single trivial round-trip (the pulse check from our earlier work) and times it. If the wall is over five seconds, refuse to start — print restart guidance, exit. Between two and five seconds, warn but proceed. Under two seconds, run silently. The pattern is recursive: a verification gate that itself sits behind a substrate-health gate.

This is one short helper plus one new flag. Five additional lines of operational code on top of the grader. The substrate is unstressed when measured; the measurements are reliable.

what we learned

lessonwhy it matters
Convergence alone is not the KPI; convergence × correctness isHigh convergence on a wrong answer is worst-case — looks reliable, isn't. You need both axes to detect it.
The dangerous quadrant is the highest-priority training targetTraining to fix confidently-wrong outputs has the biggest leverage. Most other quadrants don't need training.
Substrate measurements must control for substrate stateThe act of measuring degrades the thing measured. Without preconditions, the measurement loses meaning. Generalizes to every substrate-health KPI.
The grader's precondition IS another substrate-health gateWe have the pulse check; we have the substrate-staleness check; the grader joins them. The pattern compounds.

where this generalizes

links