stop once the answer lands

Desai, Tej

Intuition Labs · φ research · 01 · measured · 14 may 2026

stop once the answer lands

the model keeps talking after it has answered. read its confidence — and stop.

2.6× faster

halt = answer-phase ∧ φ ≥ 0.85 ∧ |Δφ| ≤ 0.005 · halt condition

the problem

Most language models keep talking after they've actually answered. We measured ten seconds of wall-time spent on filler — rephrasing, examples, polite codas — after a coding answer landed at second six. The user has to read past it. The GPU has to generate it. The bill has to pay for it.

the signal

Phi is one number every language model already computes: how confident the model is in its next-token distribution, on a 0 – 1 scale. It rises as the model commits to an answer.

When phi has plateaued at a high value, and the model is past its thinking trace, the next tokens carry essentially no new information. The model has, in effect, already answered.

the measurement

Open-source 35-billion-parameter MoE model (Qwen3.6-35B-A3B). AMD Strix Halo workstation (Radeon 8060S, 128 GB unified memory). One coding prompt: reverse a string.

condition	seconds	words	finish reason
today (cut at cap)	16.5	512	hit the cap
with phi	6.3	190	answer landed

2.6× faster. Five of five correct on a coding suite.

how it works

The halt fires when three conditions hold simultaneously: the model is past its thinking phase; phi has crossed a high-water mark; the slope of phi over the last four tokens is near zero.

for each generated token:
    confidence = read from the model's own softmax
    if past_thinking_phase
       and confidence >= 0.85
       and slope_is_flat:
        finish_reason = "answer landed"
        close stream

The phase gate prevents the halt from firing inside the model's thinking trace (where confidence is naturally low and oscillating). The slope check prevents firing on a transient spike. No model fine-tuning. No new training data. Just a stream-level read of a signal the model already computes.

reproduction

The halt logic lives in a stream-rewriting proxy in front of llama-server. Toggle it on and off via an environment variable:

# with phi
PHI_PROXY_HALT_PHI=0.85 mind-phi-proxy-up
python strix-mind/bench/phi_bench.py

# baseline
PHI_PROXY_HALT_PHI=0 mind-phi-proxy-up
python strix-mind/bench/phi_bench.py

where it doesn't help

Open-ended creative writing — the user wants the full response.
Very short answers — the model already emits EOS naturally before phi plateaus.
Tool-using agent flows where logprobs are not on the stream.

also in the air this month

TFlow (BWR-hhh) cuts inter-agent token traffic by 83% and gives a 4.6× wall-clock speedup on GSM8K — by sending hidden-state trajectories between Qwen3-4B agents through transient LoRA patches instead of natural language.

Different surface, same lever. Early-halt skips filler at the end of one model’s answer by reading its own confidence. TFlow skips token-encoding between two models by reading hidden state directly. Both bets: tokens are a lossy bottleneck.

A natural next step combines them. Phi as the control plane for TFlow’s data plane: fire a weight-space patch only when the receiver’s confidence drops; accept the patch only when post-patch phi rises; broadcast only when the sender’s own confidence has crystallized.