Intuition Labs · φ research · 16 · measured · 18 may 2026

five PRs the autopilot wanted to merge

we built a small code reviewer to triage twenty open PRs. it disagreed with the auto-merge policy on five of them. two of those five had local-verdict-FAIL stamped into the PR body while CI showed green.

5 of 20 disagreed · 2 had local-FAIL written in their own bodies
auto_merge ⇐ CI ∧ model_read ∧ size ∧ label  ·  four signals must all agree before action · adding the model signal eliminated the 25% false-positive rate

the takeaway in one paragraph

We built a small command-line tool that reviews open pull requests. About two seconds per PR on a local model. It reads the diff, the CI status, and the PR description, and decides: merge, request changes, close, or kick to a human. Alongside the model we keep a YAML file with declarative merge rules — the first version said auto-merge anything with green CI, the right label, and a small diff. We pointed both at twenty open PRs an agent had been generating. They agreed on fifteen. They disagreed on five. Two of those five had "local verdict: FAIL" written into the PR body itself, with CI showing green anyway. The naive policy would have shipped the bugs. We changed the rule to require the model's read to agree before any auto-action.

what we built

The command-line tool is one verb of our local agent stack: `codebox review <pr#>`. It pulls the pull request from GitHub — title, body, status checks, diff — and asks the local model one structured question: based on everything you can see, what should we do? Four allowed answers. Merge if the code is correct and CI is green. Request changes if you spotted issues. Close if the PR shouldn't ship at all. Defer to a human if the call is too close to make automatically. The model also returns a confidence number from 0 to 1 and a one-sentence reason. The whole decision goes into a JSONL receipt so it can be audited later.

Alongside the model we keep a small YAML file: the policy. The policy is a list of rules with simple conditions — CI status, change size, file count, labels, age, author. The first rule we wrote said: auto-merge any pull request where CI is green, the change is under 200 lines, the file count is at most five, and the label is `codebox-experiment`. Four conditions, all structural metadata. No reading required. This is roughly how every CI bot we've used historically works. Looks great on a dashboard.

what we ran it on

We had twenty open PRs sitting at our codebox-arena repo. An autonomous agent had been working through a list of canonical coding problems — reverse a string, balanced brackets, longest common prefix, Kadane's algorithm, N-Queens count, Dijkstra, edit distance — and pushing each solution as a labeled PR. Every PR had a green check. Every PR was small (most under 50 lines). Every PR had the right label. By the naive policy, every one of them merges.

Total wall time to review all twenty: about forty-five seconds. Mean model confidence: 0.97. The local model is fast enough that the cost of asking it the question is essentially free in human terms. The cost of NOT asking it the question, as we were about to learn, was not free.

what came back

agreementPRswhat the policy would have donewhat the model said
both say merge15mergemerge
model says request changes5merge (wrong!)request changes

Three of the five disagreements were the kind of thing a careful code reviewer would catch and CI wouldn't. An LRU cache implementation with a TTL feature that loses thread safety under load and walks the entire dictionary on every size() call. A function named `shortest_path` that turns out to be BFS while the PR title says Dijkstra. An implementation of a different edit-distance variant than the one the problem statement asked for. None of these failed CI. All of them are real defects the model spotted by reading the actual code rather than trusting the test runner.

the two surprises

The other two disagreements were stranger. We opened the PR bodies — the descriptions the agent itself had written when creating each pull request — and found, in two of them, the literal text: **"local verdict: FAIL (8/9)"** on one, **"FAIL (1/7)"** on the other. The pull request's own description, written by the same system that wrote the code, was admitting that the local bench had failed. The CI in GitHub showed green anyway.

Either the local bench and the CI test suite were measuring different things, or one had stopped reporting accurately. Either way, the PR was self-incriminating. The naive policy didn't care — labels matched, CI was green, change was small, merge. The model read the body and immediately said: this is FAIL; do not merge. Two minutes of additional reading caught two pull requests that were telling on themselves in plain English.

This is the unsettling part. Both of these PRs had been sitting in the open queue for days. Neither human nor autopilot had read the descriptions carefully enough to notice the contradiction. The local model — running locally, in two seconds, with no context beyond what GitHub serves over its API — caught it on the first pass.

what we changed

We added one condition to the auto-merge rule: the model also has to think the PR is a merge, with confidence at least 0.90. CI being green is no longer enough on its own. The model reading the diff has to agree.

Mathematically this is a parallel composition: two independent signals must both fire before any action. The state where "CI green but model says no" no longer flows to merge — it falls to "defer to a human." When we re-ran the policy against the same twenty PRs under the new rule, the five disagreement cases all correctly deferred. The fifteen agreement cases still flowed through cleanly. The naive policy's 25% false-positive rate dropped to zero on this batch.

The model can also be wrong, of course. Requiring both signals to agree is not a guarantee of correctness; it is a guarantee that one signal alone cannot trigger a merge. The decision shifts from "trust CI" to "trust the intersection of CI and a second independent read." Intersections are smaller than either input. That's the point.

what this is for

The actual value of a code-reviewer agent isn't in the cases where every signal agrees. It's in the cases where two independent signals disagree. CI green tells you the tests pass. The model tells you the code matches what was claimed. When both fire, ship. When they don't, somebody has to look.

In a busy review queue this turns "twenty PRs sitting open" into "five of them need eyes, fifteen are ready to ship." The five come with the model's reasoning on why each one is worth a second look. The fifteen come with a green light from two independent reads. That's triage the operator didn't have before — and the cost of producing it is roughly one minute of local compute time across the whole queue.

The autonomy ladder still has gates above this. Today this tool only suggests; it never merges, closes, or comments on PRs. The auto-action path is gated behind a documented policy file, a reversal CLI, and an audit of at least ten operator-reviewed decisions. We are two of three gates closed; the third is the operator engaging with the disagreement queue and recording their actual calls so the model's calibration is measured against ground truth before any action runs unsupervised.

also in the air

Disagreement-detection as a structural pattern is older than we are. Test ensembles vote. Two-out-of-three failover protects safety-critical systems. Redundant sensors on a spacecraft are scored against each other before either is trusted. The shape is always the same: two independent signals are stronger evidence than one stronger signal, because their failure modes are usually independent.

We're using it for PR review because that's where our own loop was getting stuck. The substrate had been opening pull requests for days; the human had been opening none of them. Adding a second read of the diff broke the deadlock. The same structure should help anywhere a single "passing" signal is being asked to make a high-stakes call without a second opinion.