the chip that stays awake
The big GPU does the heavy thinking and then sleeps. The little neural chip beside it can stay awake all the time — so we taught it to run a real language model's math and checked the answer. This is the first piece of an always-on brain that lives on the box.
the takeaway in one paragraph
A modern mini-PC has two AI chips. The GPU is the big muscle: fast, power-hungry, and idle between jobs. The NPU is a small low-power chip that can stay on all the time. If you want a box that feels alive — always present, always listening — the always-on chip is the natural home for its brain stem. So we asked the basic question: can the little chip actually run a real model's math? It can. One full block of a real language model now runs on the NPU, gives the right answer, holds itself ready in a resident process, and returns in about thirty milliseconds — comfortably real-time. It is one block of many, and the rest is the road ahead.
two chips, one brain
The heavy lifting in language models is matrix multiplication, and the GPU is built for it. But a GPU draws a lot of power and is meant to be summoned, do its job, and stop. That is the wrong shape for presence. A digital companion that is supposed to always be there cannot spin the big engine every second.
The NPU — the neural processing unit — is the other chip. It is small, sips power, and is designed to stay on. Phones use it for always-listening wake-words. Our bet for the box: the NPU is the brain stem (always live, light, ready) and the GPU is the cortex (summoned for heavy thought). The first thing the brain stem has to prove is that it can run the model at all.
what a block actually is
A language model is mostly the same block stacked many times. Each block has two parts. One is attention (which words look at which). The other is the feed-forward — the part that does most of the arithmetic. We started with the feed-forward, because it is the simplest piece that is still real.
A feed-forward block is three big multiplications with one small step between them. Take the running thought (a list of numbers). Multiply it two ways at once — a 'gate' and an 'up'. Pass the gate through a soft on/off curve and combine the two. Multiply the result back down to the original size. That is the whole block:
gate = x · Wg # big multiply
up = x · Wu # big multiply
h = silu(gate) · up # soft gate, then combine
y = h · Wd # big multiply, back downThe model we targeted is LFM2 — the on-device model behind a small speech assistant. Its feed-forward turns a 2048-number thought into an 8192-number scratch space and back. We built each of those multiplications to run on the NPU, then chained them together.
does it get the right answer?
We ran the whole block on the NPU and compared its output to the same math done carefully on the CPU. The two line up: the direction of the answer matches to about ninety-nine percent.
Why direction and not exact equality? The NPU works in a 16-bit number format (bf16) that trades a little precision for speed, so we never expect bit-for-bit agreement — we expect the answer to point the same way. The honest check for hardware like this is alignment. We learned that the hard way: a too-strict equality check briefly reported a false success, because the model's output values are tiny and a fixed tolerance swamped them. Switching to a direction check caught it.
To be sure the small differences were the number format and not a wiring mistake, we measured one multiply on its own: it agrees with the CPU to within about six percent, and four of them chained give the seventeen percent we saw — exactly the budget you would predict from stacking four low-precision steps. Right answer, expected rounding, no bug.
a precision lesson worth keeping
One of the three multiplications failed at first — off by thirteen percent, far beyond rounding. The cause is specific and general at once. The chip keeps its running total in high precision, then writes it back to the low-precision format after each chunk of the sum. The model's last multiply sums over a very long dimension (8192 terms), so it writes back many times, and the roundings pile up.
The fix is to cut the long sum into shorter pieces, run each as its own multiply, and add the pieces in high precision. Split it in two and the error vanishes. The rule we wrote down: on low-precision hardware keep each accumulation short — what costs you accuracy is the number of write-backs, not the size of the matrix.
always live, not always computing
Proof of life is not a chip that spins constantly. It is a process that is always up, holding the model and ready to answer the instant it is asked. So the brain stem is a small resident program: it loads the model onto the NPU once, then waits on a socket, using essentially no power until a request arrives.
load the model onto the NPU once # the weights live on the chip
wait for a request # idle — no spinning, no power
on "run": compute, reply # answer on demandWaking it from idle costs about four milliseconds — a blink, far below anything a conversation would notice. We checked: it held the model and its state for the better part of an hour at zero compute, answered instantly each time it was poked, and ran alongside a second resident process without contention. Always alive means the state is always held and ready; the chip itself can rest between requests.
fast enough to feel alive
With the weights living on the chip, one feed-forward block runs in about thirty milliseconds of actual NPU time, and about fifty milliseconds end to end including the housekeeping around it. A conversational turn has roughly a hundred to three hundred milliseconds to feel instant. One block sits comfortably inside that.
| what | time | note |
|---|---|---|
| NPU compute · one block | ~31 ms | the three big multiplications |
| end to end · one block | ~50 ms | including the soft-gate step on the CPU |
| conversational budget | 100 – 300 ms | the bar to feel instant |
An earlier version looked ten times slower, until we noticed almost all of that time was copying the model's weights to the chip on every single call. Loading them once, at startup, was the fix. The headline number is the chip's real cost; the rest was our own overhead, now removed.
what is not done yet
This is one block — the feed-forward — of one layer. A full layer also needs attention, and the full model is sixteen layers. We have started attention: its multiplication parts already run and check out on the NPU; the part that decides which words attend to which is in progress.
We also ran the block on placeholder weights to prove the math, not the trained weights of the real model. Loading the real weights and checking the output against the reference implementation, end to end, is the next gate. Only after that gate clears can we say the trained model itself is running on the chip — today we have proven its shape and its arithmetic.
The destination is the whole model living on the always-on chip: the box's persistent brain, present and ready, with the big GPU summoned only when the thought gets heavy. This note is the first block of that brain, measured and verified. The next notes are the rest of the layer, the real weights, and the seam between the always-on voice and the heavy reasoning behind it.
the substrate
AMD Strix Halo workstation: a Ryzen AI Max+ 395 with an XDNA2 NPU and 128 GB of unified memory. The model is LFM2.5-Audio. The kernels are built with the open MLIR-AIE / Peano toolchain — no proprietary blobs — and everything we wrote is AGPL. Running on the open path is a requirement here, not a preference: the brain has to be ours all the way down.