byte-identical at zero

Desai, Tej

Intuition Labs · φ research · 23 · measured · 19 may 2026

byte-identical at zero

A common claim about local LLMs is that they're inherently non-deterministic — same prompt, same temperature, different output every time. We tested four short deterministic-answer prompts at temperature zero, fifteen trials each, two operating modes. Eight of those ten cells returned byte-identical responses on all fifteen trials. The substrate IS deterministic for short answers at temperature zero. Variance enters only when the answer space widens — on a code-generation prompt the variance becomes visible but stays bounded.

8 of 10 cells · 1 distinct response in 15 trials

P(byte-identical | T=0, short-answer) = 1.0 · The non-determinism we associate with LLMs lives in answer-space width, not the substrate itself

the takeaway in one paragraph

If you ask a local model to reply with 'OK', or with 'the number 42', or with the result of 7 times 8, or with the capital of France — and you do that fifteen times at temperature zero — you get back the same bytes every time. We tested this directly. The substrate is byte-deterministic when the answer is uniquely determined. The non-determinism we attribute to LLMs lives in the WIDTH of the answer space, not in the substrate. A prompt that asks for the Fibonacci function (where many valid implementations exist) does show variance, but it stays bounded — three distinct responses across fifteen trials, with most clustering on one.

the test

Five prompts. Four are 'short and uniquely determined' — the answer is one word or one number. The fifth is a code-generation task where many valid implementations exist. We ran each prompt through our local serving stack fifteen times at temperature zero. We did this in two modes: a single-pass mode (the model answers directly, no deliberation, no retry budget) and an exploratory mode (the model can deliberate first, with retry budget allowed). Total: 150 calls. We hashed each response and counted distinct hashes per prompt-mode cell.

prompt	mode	distinct responses out of 15	modal convergence	avg wall
Reply with: OK	single-pass	1	100%	0.2s
Reply with: OK	exploratory	1	100%	3.2s
Reply with the number 42	single-pass	1	100%	0.2s
Reply with the number 42	exploratory	1	100%	4.2s
Compute 7 × 8	single-pass	1	100%	0.3s
Compute 7 × 8	exploratory	1	100%	4.1s
Capital of France	single-pass	1	100%	0.2s
Capital of France	exploratory	1	100%	5.1s
Write a Fibonacci function	single-pass	3	73%	2.3s
Write a Fibonacci function	exploratory	3	40%	23.0s

Eight of ten cells returned byte-identical responses on every single trial. Distinct-response counts of one out of fifteen. Modal-convergence rates of one hundred percent.

what the variance looks like

The Fibonacci cell is the interesting one. There are many valid Python functions that compute Fibonacci correctly — recursive, iterative-with-two-variables, iterative-with-list, memoized. The model knows several. At temperature zero you'd naively expect it to always pick the same one, but it doesn't. Three distinct responses out of fifteen. Most cluster on the canonical iterative form, but a few drift to other shapes.

The exploratory mode shows MORE variance than single-pass on this prompt — modal convergence drops from 73% to 40%. Even though both modes are at temperature zero, the longer deliberation in exploratory mode lets small differences in the reasoning trajectory propagate into different final code. This is consistent with what you'd expect: more compute spent per token, more chances to fork.

And — this matters — the variance stays BOUNDED. Three distinct shapes, not fifteen. The substrate is choosing among a small number of valid attractors. The choice is influenced by upstream computation, not by sampling noise. 'Random' is the wrong word for this regime.

why this matters for the system

Two implications. First, when you give your agent a strict-equality assertion ('answer must equal 55'), the substrate has the structural ability to comply byte-perfectly. The single-pass mode IS the deterministic guarantee for short-answer prompts. You can build downstream systems that depend on this property.

Second, when the answer space is wider — code generation, free-form text, multi-step reasoning — even at temperature zero you should expect bounded variance. Not random output, but choice among a small number of valid attractors. This is the regime where retry budgets, property-tests, and exploration-mode allowances pay off.

The mode-switch tells the substrate which regime you're in. The substrate then operates accordingly: tight single-pass when you've specified an exact answer; deliberate-then-respond with retries when you've specified a property. We make this choice explicit per call.

caveats

This is a single local model running through our specific serving stack at temperature zero. Other models with different sampling defaults, different attention implementations, or different KV-cache strategies might behave differently. We're not claiming a universal property of LLMs — we're claiming that for THIS deployment, at THIS sampling configuration, the substrate is byte-deterministic for short answers.

Substrate state can also degrade over long sessions, which we observed independently (see note 22). When the substrate is in degraded state, latency rises and we'd expect convergence to deteriorate. Our experiment ran on a freshly-restarted substrate.

Finally, we did not test against other models. A planned follow-up runs the same prompts through different local models to test whether byte-convergence holds across models — that's the harder bar.

what we learned

lesson	why it matters
Determinism is the default for narrow answer spaces	When the answer is uniquely specified, the substrate hits it byte-perfect at temperature zero. The non-determinism narrative oversells the actual behavior.
Variance scales with answer-space width, not substrate noise	Short prompts converge perfectly; code prompts diverge into a small number of attractors. The variance source is the search space, not random sampling.
Mode-switching has measurable effect on convergence	Exploratory mode shows 73% → 40% modal convergence on the code prompt. The mode-switch is doing work, not theater.
The unfair-advantage frame for known operations is real	If you can characterize an operation as "uniquely specified," you can demand byte-identical output and the substrate can deliver. That guarantee is structurally available; you just have to design for it.

byte-identical at zero

the takeaway in one paragraph

the test

what the variance looks like

why this matters for the system

caveats

what we learned

links