stop once the answer lands
the model keeps talking after it has answered. read its confidence — and stop.
the problem
Most language models keep talking after they've actually answered. We measured ten seconds of wall-time spent on filler — rephrasing, examples, polite codas — after a coding answer landed at second six. The user has to read past it. The GPU has to generate it. The bill has to pay for it.
the signal
Phi is one number every language model already computes: how confident the model is in its next-token distribution, on a 0 – 1 scale. It rises as the model commits to an answer.
When phi has plateaued at a high value, and the model is past its thinking trace, the next tokens carry essentially no new information. The model has, in effect, already answered.
the measurement
Open-source 35-billion-parameter MoE model (Qwen3.6-35B-A3B). AMD Strix Halo workstation (Radeon 8060S, 128 GB unified memory). One coding prompt: reverse a string.
| condition | seconds | words | finish reason |
|---|---|---|---|
| today (cut at cap) | 16.5 | 512 | hit the cap |
| with phi | 6.3 | 190 | answer landed |
2.6× faster. Five of five correct on a coding suite.
how it works
The halt fires when three conditions hold simultaneously: the model is past its thinking phase; phi has crossed a high-water mark; the slope of phi over the last four tokens is near zero.
for each generated token:
confidence = read from the model's own softmax
if past_thinking_phase
and confidence >= 0.85
and slope_is_flat:
finish_reason = "answer landed"
close streamThe phase gate prevents the halt from firing inside the model's thinking trace (where confidence is naturally low and oscillating). The slope check prevents firing on a transient spike. No model fine-tuning. No new training data. Just a stream-level read of a signal the model already computes.
reproduction
The halt logic lives in a stream-rewriting proxy in front of llama-server. Toggle it on and off via an environment variable:
# with phi
PHI_PROXY_HALT_PHI=0.85 mind-phi-proxy-up
python strix-mind/bench/phi_bench.py
# baseline
PHI_PROXY_HALT_PHI=0 mind-phi-proxy-up
python strix-mind/bench/phi_bench.pywhere it doesn't help
- Open-ended creative writing — the user wants the full response.
- Very short answers — the model already emits EOS naturally before phi plateaus.
- Tool-using agent flows where logprobs are not on the stream.
also in the air this month
TFlow (BWR-hhh) cuts inter-agent token traffic by 83% and gives a 4.6× wall-clock speedup on GSM8K — by sending hidden-state trajectories between Qwen3-4B agents through transient LoRA patches instead of natural language.
Different surface, same lever. Early-halt skips filler at the end of one model’s answer by reading its own confidence. TFlow skips token-encoding between two models by reading hidden state directly. Both bets: tokens are a lossy bottleneck.
A natural next step combines them. Phi as the control plane for TFlow’s data plane: fire a weight-space patch only when the receiver’s confidence drops; accept the patch only when post-patch phi rises; broadcast only when the sender’s own confidence has crystallized.