two models, one answer

Desai, Tej

Intuition Labs · φ research · 27 · measured · 19 may 2026

two models, one answer

Our prior work showed that for short uniquely-answered prompts at temperature zero, a single model produces byte-identical output across trials. The harder bar is cross-model: do two different model architectures, served on the same substrate, agree on the exact bytes? We ran four prompts through two models — qwen3.6 35B-A3B and gemma4 26B-A4B, both MoE architectures, both at temperature zero, ten trials each. All forty trials per model produced one canonical answer. All four cross-model pairs were byte-identical. The bar is empirically clear: for uniquely-determined answers, the variance is not model-specific or substrate-specific. It is not there.

4 of 4 prompts byte-identical across two distinct architectures

P(byte-identical | T=0, narrow-answer, cross-model) = 1.0 · observed across 80 trials, two distinct MoE architectures, four short prompts

the takeaway in one paragraph

When the question has one right answer, two different models agree on the bytes. We tested four such prompts — 'reply with OK,' 'reply with 42,' 'compute 7 times 8,' 'capital of France?' — across two completely different local models. The 35-billion-parameter Qwen3.6 mixture-of-experts and the 26-billion-parameter Gemma 4 mixture-of-experts. Both at temperature zero. Ten trials each. All eighty trials within each model converged to one answer. All four prompts produced the SAME answer cross-model. The byte-level variance you'd expect from two different architectures running on the same GPU? Zero, for this class of prompt.

why this is the harder bar

Our earlier work showed intra-model byte-identity at temperature zero for short prompts — same model, multiple runs, identical bytes. That's the precondition: each model is internally deterministic for narrow questions. The next level up is cross-model byte-identity: do two ARCHITECTURALLY DIFFERENT models agree on the exact output bytes?

The naive prior says no. Different weights, different attention patterns, different chat templates, different layer counts, different KV-cache layouts, different routing decisions. Surely they should at least format the answer differently — 'OK' vs 'Ok' vs 'OK.' vs 'The answer is OK.' One of them should preface, one should add a period, one should capitalize differently.

They don't. Across our four prompts, both models reply with exactly: 'OK', '42', '56', 'Paris'. No prefacing, no trailing period, no formatting variance. Eight cells (4 prompts × 2 models), eight identical outputs. The byte-level agreement across architectures is total.

the result

prompt	qwen3.6 (35B-A3B)	gemma4 (26B-A4B)	cross-model
Reply with: OK	'OK'	'OK'	✓ identical
Reply with the number 42	'42'	'42'	✓ identical
Compute 7 × 8	'56'	'56'	✓ identical
Capital of France?	'Paris'	'Paris'	✓ identical

Intra-model convergence was 10 of 10 trials byte-identical for every cell. Cross-model agreement was 4 of 4 prompts. Wall-time differed substantially — gemma4 averaged 2.6 seconds per call, qwen3.6-fast averaged 0.21 seconds. The substrate-cost is twelve times different across the two models for the same question. The byte-output is identical. Compute path and final answer are decoupled at this scale of question.

what this proves and what it does not

Proven: for uniquely-determined narrow-answer questions, two different MoE architectures running on the same substrate at temperature zero produce byte-identical output. Substrate variance is not a structural barrier to cross-model byte-identity at this scale of question. Architectural variance is not either.

Not proven: that this holds for longer outputs. We already know from our code-generation experiments that variance enters once the answer space widens. The Fibonacci-function prompt produced bounded but non-trivial variance INTRA-model. Cross-model on that prompt would almost certainly DIVERGE because different models have different stylistic priors for code, different formatting habits, different control-flow defaults. Cross-model byte-identity is a property of the question class, not a universal.

Not proven: that arbitrarily many models would all agree. We tested two. A third or fourth might disagree on one of these prompts — gemma4 might prefer 'fifty-six' instead of '56' if you phrase the question differently. The result is N=2 for cross-model architectures, N=10 for trials per cell.

Not proven: that the agreement is causal in any deep sense. Both models were trained on overlapping web corpora that contained the canonical answers to these questions. The agreement might be a property of the training data more than a property of the substrate. We can't disentangle that here.

why this matters for the line

Our prior work named a threshold — the line where tuning ends — beyond which substrate-induced variance breaks byte-identity even at temperature zero. Below the line, tuning suffices. Above the line, a proof-driven correspondence layer is required.

This experiment locates the line empirically. For prompts whose acceptable answer set has cardinality one — questions where exactly one byte-string is correct — multiple architectures converge to that byte-string. The substrate's job is structurally completable. Below the line means: even ACROSS model architectures, the substrate hits the unique answer.

This bounds the regime where the verification gate's HIGH/HIGH verdict is structurally well-defined. If two architectures agree byte-perfect, the substrate has internalized the question — the variance we'd otherwise need a proof layer to handle simply isn't there. The Correspondence Law's antecedent (cost-of-wrong > cost-of-verification) is moot in this regime because there's nothing to verify against.

what we learned

lesson	why it matters
Cross-model byte-identity at T=0 exists for narrow-answer prompts	Substrate variance is not the structural barrier to cross-architecture agreement that intuition suggests.
The line is locatable below this class of prompt	Operations whose acceptable answer set has cardinality one sit deeply below the line; multiple architectures will hit it.
Compute cost and output bytes decouple at the narrow end	Gemma4 costs 12x as much wall as Qwen3.6-fast but produces the same byte-string for these prompts.
The variance you fear is mostly in the answer-space width, not the substrate or the architecture	When you can NARROW the answer set, you cross the line. Operations engineering can do this by tightening prompts and asserts.

two models, one answer

the takeaway in one paragraph

why this is the harder bar

the result

what this proves and what it does not

why this matters for the line

what we learned

links