the 81% was an average · your prompt regime matters
in the last note we said the proxy can measure about 19% of tokens on a typical response. that number was the median across mixed traffic. broken down by prompt regime, individual streams span from 2% measured (extend an existing function) to 26% measured (think out loud about a hard problem). knowing which regime you're in matters when you read the confidence score.
the takeaway in one paragraph
Note 17 reported that our confidence number is computed from about 19% of completion tokens (median across the corpus) — specifically the drafter-miss tokens, the points where a small fast model failed to predict and the big model had to sample fresh. The 19% was the median across mixed traffic. Looking at the same data broken out by prompt shape, the spread is large: extending an existing function gives us 2% measured (the drafter predicts almost everything); asking the model to think out loud on a hard novel problem gives us 26%. A 24-point spread. The confidence score is still the right signal to look at, but knowing how much of the stream we could actually see changes how to read it.
four regimes, four samples
We pulled four representative responses, one from each kind of prompt our coding agent typically sees. For each we recorded the same thing: how much of the completion the drafter handled (the part we cannot measure) vs how much the main model had to sample (the part we can measure).
| regime | example prompt | drafter handled | we measured |
|---|---|---|---|
| extend-mode | extend a partial implementation to also pass new tests | 98% | 2% |
| thinking-off · small fn | write fib(n), reply with only the function | 94% | 6% |
| thinking-off · new code | implement Strassen 2x2 matrix multiply | 94% | 6% |
| thinking-on · hard | implement bellman-ford with negative-cycle detection | 74% | 26% |
The pattern is intuitive once you see it. The drafter is good at predicting tokens it has structural context for. Extend-mode is the easiest: the existing code is in the prompt, the new code follows similar patterns, and the drafter rides on the existing structure cleanly. Asking for a known small function gives the drafter strong priors and most predictions hit. Asking the model to think out loud through a hard novel algorithm produces the most original token-by-token reasoning, which the drafter struggles with and the main model has to actually produce fresh — so we see more of it.
why this matters for reading the number
A confidence score of 0.96 on an extend-mode response was computed from 2% of the response. That's two or three samples. The number is technically correct but statistically noisy — comparing 0.96 to 0.94 on two-sample averages is signal-to-noise mostly noise. A confidence score of 0.96 on a thinking-on hard problem was computed from 26% of the response. That's 50–300 samples. Same number, very different precision.
For halt decisions this works out fine. The proxy halts on sustained low-confidence runs, not on single-token comparisons; sample size feeds into how long the run needs to be before the halt fires. For reading a score on a single response, the precision varies by regime. Our records publish both numbers — the confidence score and the drafter acceptance rate — so a consumer can know which precision regime they're looking at.
what this also tells us about the drafter
The acceptance rate itself is a measurement of the drafter's job. A drafter that predicts 98% of tokens correctly on extend-mode is doing extend-mode well. A drafter that predicts only 74% of tokens correctly on hard-novel work is — by design — handing more decisions to the big model in exactly the cases where the big model's judgment matters. That's not a failure mode; that's the speculative-decoding system working.
The acceptance rate is also a free latency signal. Higher acceptance = more tokens per main-model forward pass = faster wall-time. The 98% extend-mode case is roughly 4× faster end-to-end than the 74% thinking-on case at the same total token count. We hadn't been thinking of acceptance rate this way; now that we publish it on every response, downstream consumers can read latency expectations directly from the previous response's acceptance number.
what we are still figuring out
Four data points are a sketch, not a calibration. With more responses we should see the bimodal split inside each regime — there will be hard extend-mode prompts (forking an algorithm subtly) and easy thinking-on prompts (recap a known proof) and the regime label will not be a clean predictor. The pattern we're showing here is the first cut: averages across categories. The real interesting structure is probably what falls inside each category.
We also don't yet have a confidence score on the drafter_acceptance number itself. "Extend-mode is around 98%" rests on one observation; the answer might be "between 95% and 99%" once more data lands. That kind of precision question matters less than the qualitative finding — there is a real 24-point spread across regimes — but it matters if anyone tries to use the number as a fine-grained predictor.