four rungs of recovery · how the loop learned to retry

Desai, Tej

Intuition Labs · φ research · 20 · measured · 19 may 2026

four rungs of recovery · how the loop learned to retry

An autonomous code-writing loop writes a function, runs the tests, and gets a verdict. When the verdict is FAIL, what should happen? Over a stretch of work, the substrate accumulated four distinct recovery mechanisms — each catching a different kind of failure at a different layer. Three are reactive: they trigger after the model emits a complete response. The fourth is proactive: the model can call a verification tool mid-generation and rewrite if needed. All four are now shipped and default-on. This note explains what each rung catches, why, and what it cost.

three reactive layers + one proactive · all four shipped, default-on

3 reactive rungs (proxy · solver · cascade) + 1 proactive rung (tool-use) = 4 · each rung catches a different shape of failure

the takeaway in one paragraph

A code-writing model fails in several distinct ways. Sometimes it gives up mid-generation (writes a long preamble, never gets to the answer). Sometimes it writes wrong code that the test catches. Sometimes the writing-style itself is the problem (thinking-off-the-cuff misses a subtle case that thinking-out-loud would have caught). And sometimes the model would benefit from running its own tests before finalizing. Each of these has a different fix. We built four — one for each — at different layers of the stack. Three react after the model speaks; the fourth lets the model verify and self-correct while it's writing. All four are now in place and default-on.

rung 1 · the proxy notices the model gave up

Our control plane (the proxy that sits between the user and the model) watches the model's confidence token-by-token as the response streams. When the model writes prose for a long time without getting to the actual answer, the proxy halts the response and emits a structured signal: finish_reason=no_answer. The signal is a named-halt, not an error: it tells the caller "the model never started answering."

The proxy then quietly retries the same prompt with the opposite generation mode flipped and a hint suppressing prose. The model writes the answer this time about half of the cases. The user sees a longer wall-time but a clean PASS. The retry is invisible to the user except for a small badge in the response telling them the substrate escalated.

This is rung 1: the lowest layer can catch "the model wandered into preamble." It happens entirely inside the proxy, before the verifier gets a chance to run.

rung 2 · the solver retries when the verifier fails

The model emits a complete response. The verifier runs the code against the test asserts. The answer is wrong. What now?

Rung 2 sits at the solver layer. When the verifier returns FAIL, the solver re-prompts the model with the same problem AND the previous attempt AND the assertion failure details. The model sees "you wrote X; assertion expected Y; rewrite." Usually the model self-corrects on the second pass — it had the right idea but missed an edge case. This rung fires up to twice (default N=2) before giving up.

Rung 2 is the most-used recovery in practice. About 10% of solve attempts trigger it (often on subtle off-by-one or edge-case misses), and roughly a third of those recover within the retry budget.

rung 3 · the solver escalates the mode

Rung 2 sometimes hits its retry budget without recovering. The two retries gave the model the failing assertion, but it still couldn't fix the code. This is usually because the model was in a fast-write mode (thinking-off) and the problem really needed thinking-out-loud.

Rung 3 fires when rung 2 exhausts on a thinking-off model. It makes one more attempt with thinking-on enabled — same prompt, same retry context, but the model takes more wall time to reason explicitly. About half of the rung-3 attempts recover problems that rung 2 missed.

Rung 3's cost is real: one thinking-on call is ~60 seconds vs ~10 seconds for thinking-off. We only fire it when the cheaper paths have already failed, which keeps the average wall manageable.

rung 4 · the model verifies its own work

Rungs 1, 2, and 3 are all reactive — they fire AFTER the model emits a complete response. The fourth rung is different: it puts verification IN the model's generation loop.

The model gets one tool: run_tests(code). When the model has a candidate solution, it calls this tool with the full code as an argument. The tool runs pytest against the actual test files and returns pass/fail counts plus the names of failing tests. The model sees the result and can rewrite if needed, then call the tool again, and so on. When the tool returns PASS, the loop exits — the substrate has direct verification that the code works.

Rung 4 is HEADROOM. Our current problem set is canonical algorithm problems (BFS, Dijkstra, LRU cache, dynamic programming, etc.). The model writes correct code on the first try about 95% of the time without using rung 4 at all. The value of rung 4 surfaces when problems get harder — multi-function modules, performance-gated novel algorithms, or problems where the model's intuition would benefit from quick verification before committing.

the four-rung table

rung	trigger	response	layer
1	writer says no_answer mid-generation	retry with inverse mode + prose-suppress	proxy
2	verifier says FAIL post-generation	retry with feedback up to N=2	solver
3	rung 2 exhausted on thinking-off	one more attempt with thinking-on	solver-cascade
4	model decides mid-generation	call run_tests, see result, optionally rewrite	generation-time

Each rung catches a different shape of failure. They compose: a single solve attempt might invoke rung 1 (model wandered, retry inverse), then if the answer is still FAIL, rung 2 (retry with feedback), then if still FAIL, rung 3 (thinking-on). If the caller requested rung 4 explicitly, the model can call run_tests at any point during its own generation.

why this matters

A naive code-writing loop has two outcomes: PASS or FAIL. When it FAILs, the operator has to step in or the loop produces a useless record. Four rungs of recovery turn FAIL into a structured cascade — each step is logged, named, and gated. Operators can read the receipt and see WHERE the recovery happened: did the proxy catch it, the solver retry catch it, the mode escalation catch it, or did the model self-correct via the tool?

The pattern generalizes. When you're building an autonomous loop, recovery is a layered problem. You want fast catches close to the source (rung 1 in the proxy), then progressively more expensive recoveries (rung 2 retries, rung 3 mode escalation), and at the top, the model itself can verify its own work (rung 4). Each rung handles failures the rungs above couldn't catch. The substrate becomes more legible — failure modes have names, recoveries have shapes, the cascade is observable.

what the rungs cost

Rung 1 adds ~10s wall-time when it fires (a single retry). Rung 2 adds 10-30s per retry (one or two extra solver calls). Rung 3 adds ~60s (a thinking-on attempt is much slower). Rung 4's cost varies by how many tool calls the model makes — typically 1 call (~8-15s including pytest), occasionally 2-3 if the model wants to verify multiple times.

All four are default-on. Operators don't need to opt in; the substrate fires them automatically based on the conditions above. Receipts capture which rungs fired and what they recovered. The aggregate wall increases only on the cases that needed recovery; clean cases pass through without overhead.

what we are still figuring out

The four-rung ladder is shipped, but its empirical value depends on the problem distribution. On our current canonical-algorithm bench, the model writes correct code first-try about 95% of the time, so the rungs rarely fire. The rungs WILL matter as problems get harder — multi-function modules, performance-gated novel algorithms, problems where the model's intuition would benefit from quick verification. We're queuing a harder problem set to measure each rung's contribution under genuine difficulty.

Rung 4 specifically: the model has to be told the tool exists, what to pass, and what NOT to pass. Our first attempt had the model interpret run_tests as a code-execution sandbox; it called the tool with a file-listing script instead of the solution. Three small fixes (tighter description, test content in the prompt, helpful error message for empty code) addressed it. Tool descriptions are a real design space: what to pass, what NOT to pass, explicit state semantics. The model's intuition about tools is informed by training data, and disambiguating the contract matters.