what the confidence number actually measures

Desai, Tej

Intuition Labs · φ research · 17 · measured · 18 may 2026

what the confidence number actually measures

we publish a confidence number per response. we audited where that number actually came from. it samples roughly one in five tokens — specifically the hardest ones — and that turns out to be the right thing to measure, just not what the name suggested.

81% of every response was never measured · the 19% that was is the part that matters

confidence ≈ avg φ over hard tokens · drafter_acceptance = 1 − classified / completion · two numbers, both honest, both now first-class fields of every response

the takeaway in one paragraph

Our coding agent publishes a confidence number with every answer. We assumed it was a clean average of how sure the model was at each token. Auditing four hundred and thirty responses, we found the number is actually sampled from about nineteen percent of the response, and the ones it samples are specifically the hardest tokens — the points where a small drafter model failed to predict and the main model had to think fresh. That's the signal that matters most for halt decisions (uncertainty at the hard points is the give-up signal), but the name was misleading us about what we were measuring. We fixed the framing and surfaced the missing eighty-one percent as a separate number — the drafter acceptance rate — so every downstream consumer can see both halves of the story.

the setup

Our local inference stack sits behind a small proxy. The proxy intercepts every streamed token, computes a confidence number for it from the model's own probability distribution, and attaches a summary to every response. The summary has fields like `n_answer` (count of answer-phase tokens seen) and `phi_auc` (mean confidence across those tokens). Until last week we used the headline `phi_auc` number in three places: an early-stop circuit-breaker that halts the stream when confidence sags, a downstream policy that decides whether to auto-merge a PR, and a research note explaining the whole thing to friends.

A small audit started because of one anomalous response — a 12,210-token generation that finished cleanly but reported confidence of zero throughout. We initially read that as the model giving up at low confidence. When we pulled the actual response, the picture changed. The model had emitted 1,851 high-confidence reasoning tokens and zero answer-phase tokens. The confidence-of-zero metric was reporting on an empty bucket — there were no answer tokens to measure. The model wasn't uncertain; it just never finished thinking out loud. Two failure modes, not one.

the audit

We walked the proxy's request log — 430 responses going back to mid-May. For each response we computed the simplest possible metric: of the total completion tokens the model emitted, what fraction did the proxy actually annotate with a confidence number? The expectation was something close to one hundred percent. Tokens are tokens; the proxy sees the stream; the count should match.

fraction annotated	count of responses	share
less than 5%	17	4.0%
5–25%	254	59.1%
25–50%	24	5.6%
50–75%	0	0.0%
75–95%	0	0.0%
95–100%	135	31.4%

The shape is unmistakable. About one-third of responses are nearly fully annotated; about two-thirds are barely annotated; the middle is completely empty. There is no "sometimes the proxy gets seventy percent" — it is bimodal. Either almost everything is measured or almost nothing is.

the natural experiment

Sorting by date told the rest of the story. Every response before 16 May falls into the high-annotation bucket. Every response on or after that date falls into the low-annotation bucket. The split is razor-sharp. Not "mostly low after" — literally 135 of 135 high responses are from before, and 295 of 295 low responses are from on/after. The middle buckets stay empty in both periods.

What happened on 16 May? We turned on a speculative-decoding drafter as the default. A small, fast model proposes the next several tokens; the big model checks them in parallel and accepts the ones it agrees with. Accepted tokens get emitted directly without the big model having to sample them. This was a 12 percent throughput win at the time, and we shipped it as the default. We didn't realize at the time that speculative decoding has a clean side effect on logprob reporting: accepted draft tokens arrive in the stream WITHOUT a top-k probability distribution attached. The big model never sampled them, so there's no distribution to report. The proxy's per-token annotator silently skips them.

About 81 percent of all tokens we emit are now drafter-accepted. The proxy is measuring confidence on the 19 percent that aren't.

the reframe

We assumed the confidence number was a clean average of how sure the model was at every token. It isn't. It's an average over the tokens where the drafter failed and the big model had to sample fresh. By construction those are the harder tokens — the points in the response where the easy-to-predict patterns break down and the model has to think.

This is actually a stronger signal for what we care about. The whole point of measuring confidence per response is to catch "the model is unsure here" early enough to halt or defer. The drafter handles the obvious stuff cleanly; what's left for the proxy to measure is exactly the stretches where the model could plausibly be unsure. Easy tokens drop out of the average — but easy tokens were not where the give-up signal was going to come from anyway. The metric is biased toward exactly the tokens that matter for halt decisions.

The honest framing isn't "average confidence over the response" — it's "average confidence at the hard points in the response." Different names, different stories, same number. The first claim is wrong; the second is what we are actually measuring.

the fix

We didn't rename the original number. Receipts going back to mid-May still report `phi_auc`, and we kept that name to preserve continuity. But we added two new fields to every response summary: `n_unclassified` (count of drafter-accepted tokens the proxy didn't see) and `drafter_acceptance` (that count as a fraction of the total). Now both halves of the story ride on the API: the hard-tokens confidence and the share of tokens that bypassed measurement.

field	value on a typical fib(n) response	meaning
n_answer	3	answer-phase tokens the proxy annotated
n_unclassified	76	drafter-accepted tokens · invisible to the proxy
drafter_acceptance	0.962	fraction of completion that bypassed measurement
phi_auc	0.984	average confidence at the 3 measured tokens

A consumer reading this response now sees the metric for what it is: 96 percent of the tokens were drafter-accepted (the easy stuff), 4 percent were the hard points where the big model had to sample fresh, and the model was very sure at those hard points. That is more information than "phi_auc = 0.984" was carrying on its own, and it costs nothing — both numbers were computable from data the substrate already had.

why we are writing this down

Every coding agent that ships a confidence score is in the same situation we were — at some point the score is going to come from a different distribution than the user assumes. Drafter behavior is one source of bias; sampling temperature is another; partial logprobs from quantized models is a third. The substrate that publishes the number knows the answer. The user reading the number rarely does.

We think the right move is to publish the second number alongside the first. "Confidence 0.98, computed over the 4% of tokens we could measure" is more honest than "confidence 0.98." It costs one extra field and gives the reader enough to know whether to trust the figure. The proxy was already doing the work; we just hadn't been writing the result down.

More broadly: the longer you measure a system, the better-defined its metrics become. The headline confidence number we shipped originally was a guess about what an obvious quantity meant; the audit converted that guess into a calibrated story about what the number is actually measuring. Three weeks of letting the loop run produced the data; one afternoon of looking produced the correction. That ratio is approximately the bargain on offer for any continuously measured system — most of the work is just running it long enough for the patterns to show up.