the rest of the layer
Note 33 ran one block of a real model on the little always-on chip. This is the rest of a layer — attention, a second kind of mixer, and the two stitched into a whole layer — plus an honest look at where the chip's 16-bit math does and does not hold up.
the takeaway in one paragraph
Note 33 put one block — the feed-forward — of a real language model on the NPU, the small always-on chip. A full layer has more: a mixer that decides which words look at which, and (in this model) a second, lighter mixer. Both now run on the NPU, and a whole layer stitches together on it. The honest part: the chip's 16-bit number format wobbled on a worst-case stress test built from random numbers, then held cleanly once we used the model's real trained weights — for the feed-forward and, after we added the real attention machinery, for attention too. We show both the wobble and the recovery rather than skip past either.
what attention is
Attention is the part that lets each word look at the others. For every word it scores how much to attend to each other word (a big multiply of 'queries' against 'keys'), softens those scores into weights that sum to one (the softmax), and mixes the words' 'values' by those weights.
This model uses a money-saving variant — grouped-query attention. Thirty-two 'query' heads share eight 'key/value' heads, so the key/value half of the work is a quarter the size. Real models lean on tricks like this; the chip has to handle them as they are.
attention on the chip
We ran the whole thing on the NPU: the three projections that produce queries, keys, and values; the per-head score-and-mix across all thirty-two heads; and the output projection that recombines them. The softmax in the middle runs on the CPU for now — it is a tiny step — while the heavy multiplies run on the chip. Checked against the CPU, the direction of the answer lines up.
the other mixer
Here is something many people don't expect: this model is a hybrid. Most of its layers use no attention at all. They use a small convolution — a three-tap filter that slides along the words, one filter per channel — which is cheaper than attention and good at local patterns. Ten of the sixteen layers are this kind; six are attention. We built the convolution mixer on the NPU as well (its two big projections on the chip, the little filter on the CPU). So both kinds of layer can now be assembled.
a whole layer
A layer is a fixed recipe: normalize, mix (attention or convolution), add the input back (a 'residual' connection), normalize again, feed-forward, add again. We stitched the attention mixer together with the feed-forward from note 33 into exactly that, and ran the whole layer on the NPU. The normalizes and the adds are tiny CPU steps between the chip's heavy multiplies.
h = x + attention( norm(x) ) # mix, then add the input back
out = h + feed_forward( norm(h) ) # feed-forward, then add againthe honest number
The chip does its multiplies in bf16, a 16-bit format that trades a little precision for speed. We stress-tested a full layer with random weights, and the answer drifted: its agreement with the exact 32-bit result fell to about 93%, when we want ~99.9%. We tracked the cause down to a long chain of multiplies — each rounding to 16 bits — with the sharp softmax magnifying small score errors.
That was alarming until we swapped in the model's REAL trained weights. Random weights are the worst possible case: the scores are noise and the softmax magnifies the noise. Real weights are smoother — the feed-forward matched the exact math to about 99%. We then checked the harder part, attention, the same way: with the full attention machinery (the rotary position encoding, the per-head normalizers, the causal masking) on the real weights, the 16-bit attention also matched the exact math to about 99%. So 16 bits are enough for this model's layer; the stress test had simply found the worst case. Two deeper checks remain — matching the reference implementation exactly, and confirming it holds across all sixteen layers stacked — and if 16 bits ever fall short there, the fix is known (keep the sensitive step in 32-bit).
what's next
Two things. Load the real model's trained weights for all sixteen layers and confirm, layer by layer, that the chip reproduces the reference implementation. Then wire the always-on chip to the voice: the small chip holds the conversation and the state, and the big GPU is summoned only when the thought gets heavy.
That seam — a light, ever-present voice in front of heavy reasoning behind — is the thing we are building toward. This note is the layer; the next ones are the whole model and the voice.
the substrate
AMD Strix Halo workstation: a Ryzen AI Max+ 395 with an XDNA2 NPU and 128 GB of unified memory. The model is LFM2.5-Audio. Kernels built with the open MLIR-AIE / Peano toolchain — no proprietary blobs — and everything we wrote is AGPL.