two decisions, same softmax

Desai, Tej

Intuition Labs · φ research · 02 · mixed · 15 may 2026

two decisions, same softmax

yesterday φ halted the stream. today the same signal rates it.

−50% scoring compute

AUC(φ) over answer-phase · band ∈ { high, medium, low, none } · self-rating

the setup

Day 1 introduced phi as an inside-the-loop sensor that stops filler tokens once the answer has landed. The story was: read the model's confidence; act on it.

This week we ran 154 prompts through our codebox stack and found a 98.7% pass-at-1 ceiling at thirty tokens per second. Of the failures, exactly zero hit the length cap — every miss returned at finish=stop with a wrong answer, fully formed and confident-looking.

Our existing escalator (retry on a higher tier when the model truncates) catches the truncation tail. It does not catch this tail. The model committed; it just committed to the wrong thing.

We need a second decision from the same signal.

the signal

Phi rises when the model commits, dips when it hedges. Average it over the answer phase of a single response, and you get the area under the curve — a one-number rating the model gave its own output.

High area: the model stayed confident throughout. Low area: it wavered. The signal is computed during normal decoding; no second forward pass.

In phi-controlled inference, this is the second decision phi makes — after halting. Same softmax. Different action.

the projection

Today: 1× forward pass through the policy + 1× forward pass through the grader = 2× compute per scored output.

With phi-AUC: 1× forward pass through the policy. The rating is read from the same softmax that produced the answer.

approach	fwd passes	extra model	compute
today (grader / verifier / reward)	2×	yes, ≈ policy size	100%
with phi-AUC	1×	no	50%

−50% scoring compute. The math is exact; the open question is whether the rating tracks downstream quality on your task. We expect yes on structured outputs (code, JSON, factual answers) and uncertain on creative writing.

how it works

Stream-level, no model changes:

for each token in answer phase:
    history.append(confidence_at_token)

auc = mean(history)
quality = high   if auc >= 0.95
        | medium if auc >= 0.85
        | low    if auc >= 0.50
        | none   if len(history) == 0

The band gives you a four-level rating with no calibration. The raw AUC gives you a continuous score for ranking — best-of-N selection becomes max(candidates, key=auc). For correctness-suspect escalation, route quality in {low, none} to a higher tier, the same router shape we use today on finish=length.

reproduction

Every phi-proxy stream emits phi_summary on its final SSE chunk:

{
  "phi_summary": {
    "auc": 0.94,
    "quality": "medium",
    "n_answer": 190,
    "n_reasoning": 312,
    "min_answer_phi": 0.81,
    "max_answer_phi": 0.99
  }
}

Drop this into your existing scoring pipeline. For best-of-N, replace the grader call with max(candidates, key=lambda c: c.phi_auc).

where it doesn't help

Confidently-wrong failure modes — the model needs an outside check, not its own rating.
Creative writing — high confidence often means cliché, not quality.
Best-of-N when all candidates score similarly — needs a tiebreaker (AUC variance, or fall back to a small grader).

what we are not claiming

We are not claiming phi-AUC will deliver the same lift the length-escalator does. They catch different failure modes — length catches truncations, phi-AUC catches hedges.

We also want to set expectations on the length-escalator's own number. When we re-measured it this week on a less curated set (9 escalations across 60 trials), recovery rate was 67%, the average retry took about 5x the original wall, and net pass-at-1 was *lower* than running with no escalator at all. The +29pp lift from the original bench held for a recovery-positive selection of prompts. The phi-AUC recovery rate is the next bench, not this post.

also in the air this month

Day 1 connected phi to TFlow's weight-space communication — phi as the trigger for firing a patch and the acceptance test on the patch. Today's mechanism is exactly that acceptance test: post-patch AUC compared to pre-patch AUC tells you whether the patch helped.

Same signal, three uses so far: halt (note 01), score (note 02), validate weight-space communication (note 01 + TFlow). One control plane, growing list of data planes.

Phi is one move inside a larger algebra. The platform that schedules these moves — and the lever set they compose into — lives in the algebra note and ships under codebox. The signal is the smallest piece; the product is what does the routing.