each model picks its own canonical

Desai, Tej

Intuition Labs · φ research · 28 · measured · 19 may 2026

each model picks its own canonical

Our prior cross-model test showed two different architectures produce byte-identical output for short narrow-answer prompts. The companion question is: what happens on a wider-answer prompt where multiple valid implementations exist? We ran the same Fibonacci-function prompt through Qwen3.6 35B-A3B and Gemma 4 26B-A4B at temperature zero. Cross-model byte-overlap was zero out of ten trials. But the within-model behavior was unexpected: gemma4 converged to a single canonical attractor across all ten trials, perfectly deterministic. Qwen3.6-fast had three bounded cosmetic variants. The variance is between models, not within them — and each model picks a DIFFERENT valid algorithm as its canonical.

cross-model bytes diverge but each model converges internally on a different valid algorithm

cross-model byte-overlap = 0 · intra-model K varies · algorithm-class is partially shared · each architecture has its own attractor preference for code-gen at T=0

the takeaway in one paragraph

On a Fibonacci-function prompt at temperature zero, two local MoE architectures produced zero shared bytes. The story below the headline matters more. Gemma 4 picked one canonical implementation and produced it ten out of ten trials with perfect byte-identity. Qwen3.6 in its single-pass mode produced three bounded variants of a different canonical implementation, with one dominant variant occurring eight out of ten times. Both canonicals are correct algorithms. Each architecture has its own deterministic attractor for this question. The variance lives between models, not within them. The substrate is doing the same disciplined thing each time; the models just disagree on which disciplined thing to do.

the result

model	K (distinct attractors)	modal attractor	modal-cluster size	avg wall
qwen3.6-fast (35B-A3B)	3	Algorithm A bare (if/elif/return b, no fences)	8/10	1.4s
gemma4 (26B-A4B)	1	Algorithm B fenced (range(n)/return a)	10/10	31.9s

Gemma 4 hit one attractor and stayed there for all ten trials. Qwen3.6-fast had three attractors but they were all cosmetic variants of one algorithm — same logic, varying whether the function appeared in a markdown code fence and whether the loop used else: or elif. The modal qwen attractor occurred 8 of 10 times.

two different valid algorithms

The modal qwen output:

def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    a, b = 0, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

The gemma output (all ten trials identical):

```python
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a
```

Both pass the standard Fibonacci asserts (fib(0)=0, fib(1)=1, fib(10)=55). They differ in three ways. Qwen handles base cases with explicit if/elif and uses range(2, n+1) returning b; gemma uses range(n) returning a after the shift. Qwen omits the markdown code fence; gemma always includes it. Both are common Fibonacci-function shapes you'd see in a tutorial.

The remarkable thing is the WITHIN-model behavior. Both models are internally deterministic at temperature zero — they pick a canonical and stick to it. But the canonical differs across architectures. The substrate's job (produce A Fibonacci function) is structurally completable for both; each just landed on a different element of the acceptable answer set.

comparing this to the narrow-answer result

Our prior cross-model test (note 27) showed byte-identical agreement on four short prompts: 'reply with OK,' 'compute 7×8,' etc. The acceptable answer set for those has cardinality one. Multiple architectures converged on it perfectly.

This test is the wide-answer companion. The acceptable answer set for 'write a Fibonacci function' is large — any correct implementation is in. The cross-model byte-agreement drops to zero. But each model still picks ONE attractor from that set with high internal consistency. The variance migrates from 'within model' (which is what naive intuition expects at code-gen length) to 'between models.'

This is the line we've been characterizing in our prior work. Below the line (narrow answer set), cross-architecture byte-identity. Above the line (wide answer set), each architecture has its own preferred attractor and the bytes don't match — but the algorithm-class often does. Both fib implementations above are correct. Both are in the acceptable answer set. Both architectures sit just above the line for this prompt class, on different preferred points.

why this matters

Three operational consequences. First, when you need byte-determinism on code, choose one model and stay with it. Cross-model code-gen DOES NOT byte-agree, even with the same prompt at temperature zero. Multi-model voting on raw bytes is structurally broken at this output length.

Second, when you need ALGORITHMIC determinism on code, the within-model behavior is the dominant signal. Gemma 4 was PERFECTLY deterministic in this test — one attractor, ten of ten trials. If you want one canonical Fibonacci, gemma will give it to you every time. Qwen3.6-fast was bounded but not single-attractor; its variance is cosmetic but still byte-variance.

Third, multi-model voting at the SEMANTIC level (do both models produce CORRECT code, even if different shapes?) is structurally well-defined. Both architectures landed on valid Fibonacci implementations. An aggregator that checks 'does this code pass fib(0)=0 and fib(10)=55?' gets two correct candidates that happen to differ in style.

what we learned

lesson	why it matters
Cross-model byte-divergence at code-gen length is structural, not noise	Different architectures have different preferred attractors. The disagreement is repeatable, not random.
Within-model determinism can be perfect even at code-gen length	Gemma 4 hit K=1 on Fibonacci across ten trials. Substrate variance is not inevitable at this length; it depends on the model.
The line shifts across architectures	Qwen3.6-fast had K=3 cosmetic variance; gemma had K=1. The "tuning vs proof" line is at slightly different positions for different models.
Algorithm-class is a more useful equivalence than byte-identity	Both architectures produce correct Fibonacci. Voting on bytes loses this; voting on algorithm-class or correctness keeps it.

each model picks its own canonical

the takeaway in one paragraph

the result

two different valid algorithms

comparing this to the narrow-answer result

why this matters

what we learned

links