Intuition Labs · φ research · 30 · measured · 19 may 2026

the box that learned to read itself

Note 29 left the system picking a model for the operator when its confidence was high enough. Seven more steps followed. The system found that arithmetic, fibonacci, and graph search each have their own signal strength in the recommendation. One threshold across all of them was the wrong shape. Tuning per problem type gave a result we rarely see in practice: the picker fires eighty-eight percent more often and succeeds at a higher rate on those picks. Then the system started ranking its own past answers by training-value, and watching itself for the failure mode researchers call generalization collapse. The alarm fired once. By hand we found the cause was a different failure — the wrong model being chosen on prompts where that model always fails — not real capability loss. We taught the system to make that same distinction itself.

more action AND better accuracy · 88% more automatic model picks · 96.5% → 98.0% success rate on those picks
class-aware gate fires: 60 → 113 (+88%) · followed-success: 96.5% → 98.0% (+1.5pp)  ·  measured on a counterfactual replay against the same 235-row corpus from note 29

the takeaway in one paragraph

Note 29 ended with the system acting on its own learning. This note covers what happened next. Seven small refinements, each one building on the last. The headline finding is that one rule wasn't right for every kind of coding problem — arithmetic, recursive math, string manipulation, graph search and dynamic programming each have a different signal strength in the recommendation. Tuning the threshold per type produced a result we rarely see in practice: more automatic picks AND a higher success rate on those picks. Then the system started watching itself in ways that detect both real capability loss and the more common failure mode of using the wrong tool on the right problem. The system can now tell the operator which of those is happening.

step one · not all problems are the same

The recommendation system from note 29 had a single confidence threshold. Above it, the system would pick the model automatically. Below it, the recommendation was advice only. The threshold was set conservatively based on a measurement averaged across the entire corpus.

We stratified the same measurement by problem type. The result was striking. On simple arithmetic problems, matching the recommendation to actual usage produced a forty-seven percentage-point success-rate gap. On string-manipulation problems, the gap was ninety-four points. On algorithmic problems with array operations, the gap was a full hundred points — every single matched recommendation succeeded; every mismatched one failed. On graph and dynamic-programming problems, the gap was barely a few points, sometimes negative.

The single threshold was too conservative for some classes and about right for others. The averaged measurement had been smoothing over real per-class variance.

step two · a rulebook instead of a rule

We swept the threshold per problem type and found the natural break for each: arithmetic and graph and dynamic-programming kept the original conservative threshold; string and array-algorithm got a lower threshold (the discrimination there is so sharp that lower confidence is still safe); recursive math sat in between. The change was small in code — a lookup table indexed by problem type instead of a single number — and large in behavior.

gate versionautomatic pickssuccess rate on those picks
one threshold for everything (note 29)6096.5%
one threshold per problem type (this note)11398.0%
change+88%+1.5 percentage points

Eighty-eight percent more automatic picks at a higher success rate. This is the move you don't usually get. Most improvements trade more action for less accuracy or the other way around. Here both moved the same direction because the previous threshold was too cautious for the classes where the recommendation is sharpest.

step three · checking the other axis

After the per-type tuning the natural next question was whether model choice introduced another axis. The recommendation system picks a model. Maybe it picks one kind of model with sharper signal than another. We stratified the same data by recommended model.

The answer was: not really. Looking across the two hundred and thirty-five recommendations, the system picks the fast model in nearly all of them — about ninety-nine percent. The two percent of recommendations for the deliberative model never accumulate enough similar past matches to clear the action threshold; they stay as advice. Adding a per-model axis on top of the per-problem-type axis would add granularity nothing currently uses. We left the architecture at one axis.

The non-finding mattered. It told us the per-problem-type tuning is the right level of granularity. Adding more dimensions would have been speculative.

step four · picking its own homework · watching for forgetting

Every past answer the system has stored carries a few signals: what the system predicted would happen, what actually happened, the problem type, how confident the system was. We wrote a small ranking function over these signals: examples where the system was confidently wrong score highest (the gradient is sharpest there); examples where the system was surprised by a success on a prompt it expected to fail score next; problems from a rare category — ones the system hasn't seen many of — score high too; problems that are already grokked, where the system is confidently right, score lowest.

About forty percent of the corpus came back marked worth studying further. Fifteen rows were flagged confidently-wrong — the highest-value training signal we can produce from existing logs.

We added one more thing to the same pass. For each problem type, the function compares the system's success rate on the most recent eight attempts against the previous eight. If the recent rate has dropped by more than fifteen percentage points, the function raises an alarm. The literature calls this anti-grokking — the failure mode where a model that was working at a task starts working worse at it, sometimes collapsing back to baseline. We wanted the system to notice if that started happening to itself.

step five · the alarm and the investigation

The watcher fired its first alarm on the string-manipulation problem type. The most recent eight rows for that type had a fifty-percent success rate; the previous eight had seventy-five. A twenty-five-point drop. According to the rule we'd written, this looked like the system was forgetting something it used to know.

We investigated by hand. The recent eight rows had four successes and four failures. We looked at the four failures. All four had used the deliberative model — the slow one — on canonical string-manipulation prompts. All four had stopped before producing a function definition. The success rows had used the fast model and succeeded in under a second each.

The system had not lost the ability to handle string problems. The recent window simply asked the deliberative model to handle them more often — twice as often as in the older window. The deliberative model has always failed on these prompts. The operator finding from a prior note told us this directly. So the 'capability loss' the watcher had detected was a different failure: the wrong model being chosen on the right problems.

Once we understood this we realized the recommendation system from note 29 should have prevented it. The recommendation always picks the fast model for string-manipulation prompts. But the recommendation only gets consulted when the operator runs a particular flag. For invocations without that flag, the operator can manually choose any model and the recommendation never weighs in. The recent four failures had bypassed the system entirely.

step six · teaching the system to investigate its own alarms

The investigation we did by hand has a sharp structure. When an alarm fires, look at the failures in the recent window. Ask: do they all use the same model? If yes — and at least one other model also ran in this window — the cause is model-mismatch, not capability loss. The remediation: make sure the recommendation system is consulted on every call. Stopping training would be the wrong move.

We codified that structure as a test. Take the most recent window. Count the failures by model. If three quarters or more of the failures live in a single model, and at least two models were used in this window, the cause is model-mismatch. Otherwise check whether the problem complexity shifted (longer prompts, different test counts). Otherwise treat the alarm as real capability decline.

causewhat triggered itwhat to do
model-mismatchthe failures clump up in one model and at least two models were usedmake sure the recommendation system is consulted on every call
prompt-complexity-shiftthe prompts got measurably longer or test counts changedaccept that the problem set changed and recalibrate
real capability declineneither of the above; same model on same kind of prompt losing accuracystop training on this problem type until the cause is found

We re-ran the new test against the string-manipulation alarm. The classifier said: model-mismatch, recommended action is to make sure the recommendation system is consulted on every call. The same diagnosis we had reached by hand. The watcher now classifies its own alarms.

step seven · the tools become commands

The training-value ranker and the alarm-with-diagnosis were both internal scripts during the seven steps above. To use them an operator would have to know which file to run. We promoted both to first-class commands inside the system's existing tool surface. There's now a 'value' subcommand that ranks the past answers by training-value and writes the top examples to a curriculum file. There's a 'sentinel' subcommand that reports per-problem-type status with cause-typed diagnosis and remediation prescriptions. Both subcommands produce machine-readable output as well as human-readable output, so other parts of the system can call them too.

The result is that the system's self-observation layer became part of the operator's daily-driver toolkit instead of living in a scripts folder. Anyone running the system can ask the system what it has noticed about itself.

why this matters

The Pareto-frontier moment is the most concrete payoff. The same recommendation system, the same model, the same hardware — but tuned per problem type instead of with a single threshold — fires eighty-eight percent more often AND succeeds at a higher rate. Most engineering changes trade off; this one didn't. The reason it didn't is that the previous threshold was too cautious for some classes and about right for others. We had the right average and the wrong distribution. Stratification surfaced this.

The self-watching layer is the more durable contribution. The system now has explicit machinery for noticing when its behavior is drifting — and explicit machinery for distinguishing between the kind of drift that means 'something has gotten worse' and the kind that means 'something is being used wrong.' These need opposite remediations. A naive watcher that only knew about the first kind would prescribe the wrong action half the time.

And the chain itself, viewed from outside, is what we hoped for. Note 29 was four steps: hand-finding, system-noticing, measuring, acting. This note adds seven more: stratifying, tuning, confirming, ranking, raising-and-investigating-an-alarm, automating-the-investigation, productizing. Eleven steps total. Each step is small. None required new training. None required architectural redesign. The receipts the system writes are the substrate; the rest is reading them carefully.

what we learned

lessonwhy it matters
One threshold across problem types hides the per-type signalThe averaged measurement was the right average and the wrong distribution. Stratifying surfaced a Pareto-frontier move that had been sitting in the data.
More action and better accuracy can both move the same directionWhen the previous setting was over-cautious for some cases, loosening it on those cases lifts both. Most engineering tradeoffs aren't tradeoffs — they're mis-set defaults.
The literature's named failure mode is not the only one that looks like itAnti-grokking is real and worth watching for, but the same surface signal (a class regressing) can also mean the right model isn't being chosen. Two causes, two remediations.
Sometimes the right answer is "no change needed"Stratifying by recommended model produced a non-finding. That confirmed the per-problem-type axis was the right level of granularity. Confirming the architecture is correct is itself useful work.
Productize the internal tools when they're validatedThe training-value ranker and the alarm-with-diagnosis lived as scripts for a few iterations. Once they were stable, moving them into the daily-driver command surface made the self-observation layer routine instead of esoteric.

links