the box that built itself

Desai, Tej

Intuition Labs · φ research · 29 · measured · 19 may 2026

the box that built itself

Months ago we found, by hand, that for canonical coding prompts the fast mode of our model works better than the deliberative one. This week, the system found that same answer by itself — by looking at its own past answers — and then proved to us that it was right. Cross-validation on two hundred and thirty-five prior solves shows when the system's preferred model matched what the operator actually ran, the success rate was eighty-six percent. When it didn't match, forty-one. Two times the success rate when the recommendation lines up with usage. The system now makes that choice on its own when its confidence is high enough.

when the system's recommendation matched what was used: 86% success · when it didn't match: 41%

discrimination = P(success | recommendation_match) − P(success | mismatch) = 0.86 − 0.41 = +0.45 · measured over 235 leave-one-out predictions against the existing solve corpus

the takeaway in one paragraph

We found a thing by hand. Then the system found the same thing by itself. Then we measured that the system was right. Then we let the system act on what it knew. Four steps. Each step is a small idea — write down past answers, look them up by similarity, count what worked, automate the lookup. Together they form a loop that closes: the system's own track record becomes the input to the system's next decision. The empirical bite is the calibration: when the system says 'use this model' with high confidence, that model is the one that actually succeeds, at roughly the rate the system claimed. Not always. But more often than the baseline by enough that flipping the choice automatically is worth doing.

step one · the finding by hand

Months ago we ran a small experiment. We had two ways to invoke the same model: the fast path, which produces a direct answer immediately, and the deliberative path, which first thinks through the problem out loud and then answers. We ran both on a small batch of canonical coding prompts — write a Fibonacci function, count vowels in a string, that kind of thing. The fast path won. Not by a small margin. The fast path solved the prompts more often and faster. The deliberative path was occasionally producing pages of reasoning that ended without a code block at all.

We wrote that down in our internal logs and we made fast-path the default for these problem classes. That was the operator finding. It came from the operator running a careful comparison and noticing.

step two · the system noticed by itself

Every time the system answers a coding prompt, it writes a small receipt. The prompt, which model handled it, how long it took, whether the asserts passed. Two hundred and thirty-five of these had accumulated. We built a small lookup tool: take a new prompt, find the ten most similar past prompts, and report what happened with those. Did they pass? Which model handled them? What was the success rate?

We pointed this lookup tool at the same kind of prompt we'd used in the original by-hand experiment: a Fibonacci-function-style prompt. The lookup pulled the ten nearest past solves. Five of them had used the fast path. All five had passed. Two of them had used the deliberative path. Three of them had errored out before producing code at all. The lookup tool's recommendation: use the fast path.

We hadn't told the lookup tool which model was faster or slower or what 'fast path' even meant. It found this by looking at past outcomes alone. The structure that the operator had figured out by hand was sitting in the receipts, and the lookup found it.

step three · we measured that the system was right

Anyone can build a lookup tool. The harder question is: is it actually right? We tested this by leave-one-out cross-validation. Take each of the two hundred and thirty-five past solves, hide it, ask the lookup tool to predict that solve's outcome using the other two hundred and thirty-four, and compare the prediction against what actually happened.

measurement	value	reading
predicted success vs actual	±0.03 average gap	in the middle and high probability buckets, the system's predicted success rate is within three percentage points of the actual success rate
recommendation match → success	86.2%	when the system's preferred model was the one the operator ran, success rate was eighty-six percent
recommendation mismatch → success	40.6%	when the operator ran a different model than the system would have picked, success rate was forty-one percent
rate gap	+45 percentage points	matching the recommendation roughly doubled the success rate
squared-error improvement over baseline	+12 percentage points	the system's probability estimates beat a constant-prediction baseline by twelve percent of mean squared error

The middle and high probability buckets are where the system is well-calibrated. The very-small-sample tail of recommendations — those drawn from one or two prior matches — is overconfident, exactly as we'd expected. The system's own confidence label correctly flags those as VERY-LOW or LOW; the operator sees the warning before deciding. This is the system being honest about what it doesn't know yet, while still being useful where it does.

step four · the system acts

The leave-one-out test cleared the gate. We added an automatic rule: when the operator runs a prompt and asks for a recommendation, and the recommendation has at least six similar past matches with confidence above seventy percent, the system picks that model on its own. The operator sees a one-line note explaining what happened — 'auto-override → model X (n=8, confidence 0.81)' — and the receipt records the decision so we can audit it later. If the operator passed an explicit model on the command line, the system defers; operator authority always wins.

The threshold (six prior matches, seventy percent confidence) is not a guess. It came from the cross-validation: that's exactly the range where the system's predictions stop being overconfident and start matching reality. Below that, the recommendation is still surfaced — but as advice, not as action.

why this matters

The four steps form a complete loop. The operator finds a thing by trial. The system records the trials. The system later notices the pattern. We measure that the noticing is calibrated. The system starts acting on what it noticed. The loop closes. Future findings — the ones we haven't made by hand yet — should flow through this same path automatically, provided the receipts accumulate and the patterns are real.

This is recursive self-improvement at a small but operational scale. The system is not training new weights. It is not generating its own next training set. It is doing something simpler and arguably more durable: it is reading its own track record, comparing new requests to past ones, and adjusting its next move based on what worked. The same machinery generalizes: any choice the system makes that produces a receipt becomes feedback for the next time that choice arises.

The reason we wrote this down is that the moment is, in our experience, structurally meaningful and easy to miss. Most software systems do not learn from their own outputs. The ones that do tend to require carefully engineered feedback loops and significant training infrastructure. The version we built is small — a similarity search over text receipts, with a couple of statistical safety guards — and it captured a finding we had originally made by hand. Operators reading their own systems' logs should know that this kind of loop is approachable without much machinery.

what we learned

lesson	why it matters
Receipts that the operator writes for one purpose become training data for another	The receipts were originally written for forensic debugging. They turned out to be the right corpus for learning model preferences.
Calibrated honesty about uncertainty is more useful than confident point estimates	The system's confidence label flagged small-sample tails as VERY-LOW. The cross-validation later confirmed that those tails were overconfident. The label was right.
Discrimination is the headline; calibration is the gate	Two times the success rate when recommendation matches usage is the value statement. The calibration measurements are how we know the threshold to act on.
Operator authority over substrate authority	When the operator explicitly chooses a model, the system defers. Automatic overrides only fire when the operator didn't express a preference. The substrate adds without subtracting.

the box that built itself

the takeaway in one paragraph

step one · the finding by hand

step two · the system noticed by itself

step three · we measured that the system was right

step four · the system acts

why this matters

what we learned

links