the line where tuning ends

Desai, Tej

Intuition Labs · φ research · 26 · measured · 19 may 2026

the line where tuning ends

Even at temperature zero, with the substrate freshly restarted, the same prompt does not always produce byte-identical code. For short uniquely-determined answers, yes. For Fibonacci-function-length outputs, the substrate fragments into bounded attractors that no amount of tuning eliminates. The variance comes from the hardware-software stack itself — floating-point non-associativity, speculative-decode chain coupling, state accumulation. Below a threshold, tuning suffices. Above it, you need a separate proof-driven verifier that maps substrate output to either the acceptable answer set OR an explicit reject. This article states the line, the law, and what it asks of any system that depends on the substrate for high-stakes operations.

substrate variance is irreducible above ~30 tokens · proof layer required

cost-of-wrong > cost-of-verification ⇒ proof layer required · the threshold is empirical and measurable; the requirement is structural

the takeaway in one paragraph

Some variance the substrate produces survives every knob we have — temperature, model-tuning, prompt-engineering. At temperature zero, with a fresh server, the same input doesn't always give the same bytes out once the output gets longer than a token-handful. That residual variance lives in the floating-point math and the speculative-decoding chain, downstream of the model's training. For any operation where being wrong is more expensive than verifying right, the substrate alone is insufficient. You need a separate layer — a total function from substrate output to (acceptable answer OR explicit reject), proven correct relative to the spec, observable in the trace, and faster than the cost of being wrong. Below the line, tuning suffices. Above it, proof is mandatory. This is the law our work has been pointing at.

the empirical line

Across our recent experiments — short uniquely-answered prompts run fifteen times each at temperature zero — eight out of ten cells came back byte-identical on every trial. One distinct response across fifteen runs. The substrate IS deterministic when the answer is uniquely specified and the output is short.

The two cells that didn't were both code-generation prompts (write a Fibonacci function). Three distinct responses across fifteen trials. These three were bounded attractors, each a valid algorithm, with the modal one capturing 70-80% of trials. The variance was structural to the substrate, not random and not a failure of the model's competence.

Where the line sits is operationally measurable. Roughly: as soon as the model has more than one valid way to write a non-trivial output, the substrate's micro-arithmetic differences begin to bias the choice. Output longer than about thirty tokens with non-trivial branching is enough. Output shorter than that — single numbers, single words, two-line scripts — stays under the line.

why this variance is structural

Three sources, each independent of temperature or model. First, floating-point math is not associative — summing the same logits in a different order produces a different final number, which can flip an argmax. Different runs interleave kernels differently. Second, speculative decoding has chain coupling — the drafter's accepted tokens depend on transient KV-cache layout. Microscopic state differences propagate forward. Third, substrate state accumulates — KV cache layout, drafter state, allocator fragmentation. Each request marginally changes what the next request sees.

None of these are bugs. They're properties of high-throughput batched inference. They're the price of getting tokens out fast on a real GPU. Pretending they don't exist gives you the illusion of determinism; acknowledging them gives you the line where it actually breaks.

the law

We state it cleanly:

For any operation O whose acceptable answer set A is bounded and whose cost-of-wrong C exceeds the cost-of-verification V, the substrate alone is insufficient. A proof-driven correspondence layer is required.

The correspondence layer must be four things:

property	what it means
Total function from substrate output to A ∪ {REJECT}	For any substrate output, the layer produces either a member of A or an explicit reject. No silent passes.
Proven correct relative to A	The layer's decision procedure has a correctness argument against the spec — not just empirical pass-rate.
Observable (trace-able)	The layer's decision is recorded so downstream systems can audit. Failure to be observable means the law is satisfied in form, not substance.
Cheaper than wrong	Runtime cost of the layer must be less than C. Otherwise the law's antecedent fails and you should reconsider whether O is worth running at all.

Below the line — operations where C is small or A is large and forgiving — substrate tuning + the standard verification gate suffices. Above the line, the correspondence layer is mandatory, not optional.

examples on each side of the line

Below the line: the substrate computing 7 × 8 and returning '56.' The acceptable answer set has one element. Cost of wrong is small (a single bad math answer). Substrate at temperature zero is byte-deterministic for this; no proof layer needed beyond the obvious '56?' check.

Above the line: the substrate generating SQL that should match 'delete the user_id row.' Cost of wrong is catastrophic — could delete the wrong rows, the whole table. A is finite (the small set of SQL statements that correctly target the right rows) but the substrate has many shapes it could emit, some of which would do the wrong thing. Substrate-tuning alone is not sufficient. A proof layer — schema validation + dry-run + explicit confirmation — is mandatory.

The verification gate's HIGH/HIGH verdict in the mode-switching framework marks 'below the line.' The MIXED ATTRACTORS verdict marks 'above the line, tuning has failed, proof required.' The gate itself is a runtime classifier for which side of the line a given operation falls on.

what the law asks of an honest system

First, classify every operation. Where is C relative to V? If you cannot estimate C, the operation is above the line by default (the precautionary stance). If you cannot estimate V, you don't yet have a system.

Second, for operations above the line, build the correspondence layer. The shape depends on the domain — type systems for structured outputs, schema validators for data, CI gates for code, dry-run-and-confirm for destructive actions, formal verifiers for safety-critical logic. The substrate's job becomes 'produce a candidate'; the layer's job is 'accept or reject with proof.'

Third, make the layer observable. Receipts. Logs. Audit trails. If the layer's decision is not in the trace, you cannot tell it from the substrate's decision, and the law's third property fails.

Fourth, design the operation envelope so cost-of-wrong stays in a regime where the cheapest layer that satisfies the law is also affordable. If C is large enough that any verifiable correspondence costs more than the wrong outcome, the right answer is not to run O.

what we learned

lesson	why it matters
Substrate variance is structural at non-trivial output length	No tuning eliminates it. Pretending otherwise produces silent-flip-flop in production.
The proof-layer requirement is a function of cost, not of preference	C vs V is the decision rule. Above the line is mandatory, not aspirational.
The verification gate locates the line at runtime	HIGH/HIGH is below; MIXED ATTRACTORS is above. The gate is the classifier.
Below-the-line tuning compounds; above-the-line tuning regresses	Below the line, more tuning increases byte-determinism. Above it, more tuning can collapse the model into one attractor — including the wrong one.

the line where tuning ends

the takeaway in one paragraph

the empirical line

why this variance is structural

the law

examples on each side of the line

what the law asks of an honest system

what we learned

where this generalizes

links