two models, one answer
Our prior work showed that for short uniquely-answered prompts at temperature zero, a single model produces byte-identical output across trials. The harder bar is cross-model: do two different model architectures, served on the same substrate, agree on the exact bytes? We ran four prompts through two models — qwen3.6 35B-A3B and gemma4 26B-A4B, both MoE architectures, both at temperature zero, ten trials each. All forty trials per model produced one canonical answer. All four cross-model pairs were byte-identical. The bar is empirically clear: for uniquely-determined answers, the variance is not model-specific or substrate-specific. It is not there.
the takeaway in one paragraph
When the question has one right answer, two different models agree on the bytes. We tested four such prompts — 'reply with OK,' 'reply with 42,' 'compute 7 times 8,' 'capital of France?' — across two completely different local models. The 35-billion-parameter Qwen3.6 mixture-of-experts and the 26-billion-parameter Gemma 4 mixture-of-experts. Both at temperature zero. Ten trials each. All eighty trials within each model converged to one answer. All four prompts produced the SAME answer cross-model. The byte-level variance you'd expect from two different architectures running on the same GPU? Zero, for this class of prompt.
why this is the harder bar
Our earlier work showed intra-model byte-identity at temperature zero for short prompts — same model, multiple runs, identical bytes. That's the precondition: each model is internally deterministic for narrow questions. The next level up is cross-model byte-identity: do two ARCHITECTURALLY DIFFERENT models agree on the exact output bytes?
The naive prior says no. Different weights, different attention patterns, different chat templates, different layer counts, different KV-cache layouts, different routing decisions. Surely they should at least format the answer differently — 'OK' vs 'Ok' vs 'OK.' vs 'The answer is OK.' One of them should preface, one should add a period, one should capitalize differently.
They don't. Across our four prompts, both models reply with exactly: 'OK', '42', '56', 'Paris'. No prefacing, no trailing period, no formatting variance. Eight cells (4 prompts × 2 models), eight identical outputs. The byte-level agreement across architectures is total.
the result
| prompt | qwen3.6 (35B-A3B) | gemma4 (26B-A4B) | cross-model |
|---|---|---|---|
| Reply with: OK | 'OK' | 'OK' | ✓ identical |
| Reply with the number 42 | '42' | '42' | ✓ identical |
| Compute 7 × 8 | '56' | '56' | ✓ identical |
| Capital of France? | 'Paris' | 'Paris' | ✓ identical |
Intra-model convergence was 10 of 10 trials byte-identical for every cell. Cross-model agreement was 4 of 4 prompts. Wall-time differed substantially — gemma4 averaged 2.6 seconds per call, qwen3.6-fast averaged 0.21 seconds. The substrate-cost is twelve times different across the two models for the same question. The byte-output is identical. Compute path and final answer are decoupled at this scale of question.
what this proves and what it does not
Proven: for uniquely-determined narrow-answer questions, two different MoE architectures running on the same substrate at temperature zero produce byte-identical output. Substrate variance is not a structural barrier to cross-model byte-identity at this scale of question. Architectural variance is not either.
Not proven: that this holds for longer outputs. We already know from our code-generation experiments that variance enters once the answer space widens. The Fibonacci-function prompt produced bounded but non-trivial variance INTRA-model. Cross-model on that prompt would almost certainly DIVERGE because different models have different stylistic priors for code, different formatting habits, different control-flow defaults. Cross-model byte-identity is a property of the question class, not a universal.
Not proven: that arbitrarily many models would all agree. We tested two. A third or fourth might disagree on one of these prompts — gemma4 might prefer 'fifty-six' instead of '56' if you phrase the question differently. The result is N=2 for cross-model architectures, N=10 for trials per cell.
Not proven: that the agreement is causal in any deep sense. Both models were trained on overlapping web corpora that contained the canonical answers to these questions. The agreement might be a property of the training data more than a property of the substrate. We can't disentangle that here.
why this matters for the line
Our prior work named a threshold — the line where tuning ends — beyond which substrate-induced variance breaks byte-identity even at temperature zero. Below the line, tuning suffices. Above the line, a proof-driven correspondence layer is required.
This experiment locates the line empirically. For prompts whose acceptable answer set has cardinality one — questions where exactly one byte-string is correct — multiple architectures converge to that byte-string. The substrate's job is structurally completable. Below the line means: even ACROSS model architectures, the substrate hits the unique answer.
This bounds the regime where the verification gate's HIGH/HIGH verdict is structurally well-defined. If two architectures agree byte-perfect, the substrate has internalized the question — the variance we'd otherwise need a proof layer to handle simply isn't there. The Correspondence Law's antecedent (cost-of-wrong > cost-of-verification) is moot in this regime because there's nothing to verify against.
what we learned
| lesson | why it matters |
|---|---|
| Cross-model byte-identity at T=0 exists for narrow-answer prompts | Substrate variance is not the structural barrier to cross-architecture agreement that intuition suggests. |
| The line is locatable below this class of prompt | Operations whose acceptable answer set has cardinality one sit deeply below the line; multiple architectures will hit it. |
| Compute cost and output bytes decouple at the narrow end | Gemma4 costs 12x as much wall as Qwen3.6-fast but produces the same byte-string for these prompts. |
| The variance you fear is mostly in the answer-space width, not the substrate or the architecture | When you can NARROW the answer set, you cross the line. Operations engineering can do this by tightening prompts and asserts. |