composition is not additive
two rules that each help the agent alone can fight each other when they fire together. we measured it and the fight depends on the prompt.
the takeaway in one paragraph
If you build coding agents, you have rules that fire on every request. You probably tested each rule on its own and confirmed it helps. The bad news: when two rules that each help alone fire together on the same kind of prompt, they can fight. We measured a case where two helpful rules together made the agent three times slower than either alone. The good news: the fight depends on what the prompt is. Same pair of rules, different prompt, no fight at all. That is not a bug to fix. It is a fact about how rules combine, and it changes how you should ship them.
The action: stop assuming your rules add up. Measure them in pairs, not just individually. Especially measure them on the prompts where you expect both to fire.
the everyday problem
Imagine you have two safety nets for your agent. The first: if the model runs out of room, retry with more space and more thinking. The second: have a small companion model speed up the main one by predicting easy tokens ahead. Each one, tested alone, helps.
Turn them both on. On easy prompts, no problem. On hard prompts where BOTH fire, the agent slows to a crawl. Each retry inherits the companion-model overhead, and the retries that should have rescued the hard cases now cost five times the wall time of doing nothing.
This is not a unique pathology of our stack. It is the structural shape of combining anything that triggers conditionally. The trigger sets overlap. Where they overlap, the effects compound. The cost-of-compounding is the part nobody measures because measuring rules in isolation is so much cheaper.
what we measured
We ran our agent on a small batch of coding problems twice — once with just the retry rule, once with retry plus the companion model. Same problems, same seeds.
| condition | problems solved | median wall | mean wall |
|---|---|---|---|
| retry only | 4 of 8 | 142 s | 534 s |
| retry plus companion model | 5 of 8 | 149 s | 1090 s |
One extra problem solved. Median wall barely moved. Mean wall doubled. The +1 solve is real — one problem went from FAIL to PASS — but on the four problems where both rules fired together, the wall went up about three times. On the problems where only one rule fired, things were fine.
Two rules. Each helpful alone. Combined, the combined cost is much more than the sum of the parts on the prompts where they meet. The cost-of-compounding has a number. It is large.
the rule itself
Most code-agent design implicitly assumes that what you do with two rules is the sum of what they each do. That mental model is wrong. The right model has a third term — the interaction — and the third term is the one that matters when both rules are firing.
Three things the interaction term can be, and we have seen all three in practice:
- Zero. Rules truly compose. Most pairs, most prompts. Real but boring.
- Negative (synergy). The pair beats the sum. Rarer. Worth shipping when found.
- Positive (wall blow-up). The pair costs more than either alone. The hidden tax we measured.
And the sign of the interaction depends on the prompt. The same two rules can be wall-blow-up on a hard prompt and zero on an easy one. There is no global verdict for a pair of rules; there is a verdict per prompt class.
why this is hard to find on your own
We tried building a tool to find good rule combinations automatically. The first version looked at every request the agent had ever served, clustered them by 'which rules fired,' and asked: 'which rule, if you removed it, would lower the average cost?'
The first answer it gave us: 'remove the rule that prepends failing-test output on retry.' We benched it. The pass rate dropped from 7-of-8 to 5-of-8 and the wall went up almost four times. The tool's suggestion was directionally wrong.
Root cause: the rule it suggested removing fires preferentially on RETRIES. Retries are slow even when the rule helps, because retries are slow on hard prompts by nature. The tool saw the correlation 'rule present and slow' and inferred 'rule causes slow.' That is selection bias, not causation. The rule was riding the same prompts the slowness was riding.
Two more versions of the tool fixed this. First we compared rule-on and rule-off on the SAME exercise; same prompt on both sides cancels the prompt-difficulty noise. Same bad suggestion still surfaced. Then we changed the cost function from per-call cost to per-exercise outcome — count the cost if the exercise PASSED, count a fixed penalty if it FAILED. The suggestion vanished. Same data, different bias model, different answer.
Mining rule combinations is only as honest as the bias model in the cost function. We did not know which biases were load-bearing until we wrote them down and watched the suggestion change.
what to take away
- Test rules in pairs, not just alone. The interaction is real and sometimes large.
- On the prompts where you EXPECT both rules to fire is where you need to measure. That is where the cost-of-compounding lives.
- If you build a tool to suggest removing rules from your stack, it must compare like-for-like on the SAME prompts. Per-call averages will lie to you because rules ride the prompts they were designed for.
- The cost function in your tool is itself a model of bias. The first cost function you write is the wrong one. Change it and see if the suggestion changes — if it does, the new cost function is honest about a bias the old one was hiding.
what we're not claiming
- That the wall blow-up is always 3×. Eight problems is a small sample; the direction is consistent across several runs, the exact magnitude is not certified.
- That two-rule interactions are always negative. We have seen the opposite: a different pair of rules cut wall by 13× on a different batch. The sign is regime-dependent.
- That this displaces existing options-theory framings. The original framing of 'an option' (initiation, policy, termination) is correct and we use it. We are adding a layer on top — composition — where the existing field's framing assumed simplicity and the data says otherwise.
- That the automatic-mining tool is finished. The data set needs more variety before it can find combinations our hand-authored set has not already covered. That work is in flight.
where this came from
We ran a routine before-and-after benchmark. The 'after' had been microbenchmarked to be faster. The 'after' on the full benchmark was twice the mean wall time. Half a week chasing what the surprise meant. The composition-is-not-additive shape was the answer that fit all the data — and the explanation for the failed mining tool's bad suggestion was that it also could not see the interaction layer.
Most public discussions of 'tool use' or 'skill libraries' or 'agent frameworks' implicitly assume the rules compose additively. They do not. That is the gap this note names. It is named because the box is starting to measure where the gap lives.