The Subtraction Principle: Why Removing Agent Tools Often Improves Performance
Research shows verifiers and multi-candidate search can hurt agent performance. Learn when to remove harness components as models improve.
When Adding More Breaks What Already Works
There’s a persistent assumption in AI agent design: if your system isn’t performing well enough, add something. A verifier to catch mistakes. A multi-candidate search to pick the best output. Another tool, another pipeline stage, another layer of checking. More infrastructure means more reliability — right?
Often, no. And the research backing this up is more than a footnote.
The subtraction principle — the idea that removing components from an agent harness can improve performance — is one of the most practically useful and consistently underappreciated ideas in multi-agent optimization. As foundation models grow stronger, the scaffolding built around them can actively work against them. Verifiers introduce new failure modes. Multi-candidate search requires selection logic that may be worse than just trusting a single good answer. Tools that once filled capability gaps now create decision overhead.
This post unpacks why that happens, what the research shows, and how to identify which parts of your agent setup are helping versus quietly dragging performance down.
The “More Infrastructure” Assumption and Where It Comes From
The instinct to add components isn’t irrational — it came from a real problem.
Early large language models were brittle. They hallucinated frequently, struggled with multi-step reasoning, and couldn’t reliably self-correct. So researchers and engineers built harnesses around them: verifiers to reject bad outputs, sampling strategies to generate many candidates and pick the best, external tools to handle tasks the model couldn’t do reliably on its own.
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
For weaker models, this worked. If your base model succeeds 40% of the time on a task, generating five candidates and selecting the best one via majority vote or a verifier can push accuracy significantly higher. The math is straightforward: ensemble methods reduce variance when individual predictions are noisy and relatively independent.
But the conditions that made those scaffolds useful have changed. Foundation models are dramatically more capable than they were two or three years ago. And when you apply infrastructure designed for a weak model to a strong one, you often get worse results — not better.
What Research Shows About Verifiers
Outcome-reward verifiers (ORMs) and process-reward models (PRMs) were developed to solve a real problem: models that reach correct answers through incorrect reasoning, or that produce plausible-sounding but wrong outputs. The idea is sound in principle — add a checker that evaluates whether the answer is actually correct.
The problem is that verifiers are imperfect. They’re trained models themselves, which means they have their own error rates, biases, and blind spots.
When you add a verifier to a strong base model, a few things can happen:
- The verifier rejects correct answers. If the base model produces a valid solution that looks unusual or takes a non-standard path, a verifier trained on more conventional patterns may mark it wrong. You’ve just discarded a good answer.
- The verifier accepts confident-sounding wrong answers. Verifiers tend to be susceptible to confident, fluent outputs. A model that states an incorrect answer with high apparent certainty may pass verification more reliably than a correct answer expressed with appropriate uncertainty.
- Verification errors compound. Each additional decision point is an additional place where an error can be introduced. For a model that’s already quite accurate, adding a moderately reliable verifier doesn’t raise accuracy to near-perfect — it creates a new category of failures at the verification stage.
Research on test-time compute scaling has shown this clearly. When you scale up computation through verifier-guided search, performance improvements plateau faster than expected — and in some configurations, degrade. The assumption that “more checking = more accuracy” breaks down once the base model is strong enough that verifier errors become the dominant source of failure.
This doesn’t mean verifiers are never useful. For tasks with clear ground-truth signals (unit tests, formal verification, math proofs with checkable steps), outcome verification is reliable and genuinely helpful. The problem is when verifiers are used as a general-purpose fix for model uncertainty, applied broadly to tasks where “correct” is harder to verify than it is to generate.
The Multi-Candidate Search Problem
Generate N outputs, pick the best one. This strategy is called best-of-N sampling, and it’s intuitively appealing: if one attempt might be wrong, surely ten attempts gives you a better shot at including a correct one.
But best-of-N only works if you have a reliable way to identify which of the N candidates is actually best. That’s the part that often fails quietly.
The Selection Bottleneck
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
Your selection mechanism — whether it’s a verifier, a reward model, majority vote, or some other heuristic — now becomes the constraint on your system. If that mechanism has a 10% error rate, and you’re running best-of-10, you’re not guaranteed to pick the best answer. You’re running your selection logic once and trusting it to distinguish good outputs from bad ones under conditions where the model may be generating subtle variations, not obviously right or obviously wrong answers.
For strong models, the N outputs are often all plausible. The difference between the best and second-best candidate may be small and hard to detect. Your selection mechanism has to be very good to add value in this scenario — and typically, it isn’t better than the model it’s checking.
The Cost Structure Changes
There’s also a practical cost problem. Running best-of-10 means 10x the inference cost. For weaker models, that cost is justified if accuracy improves significantly. For stronger models, where the marginal gain from additional candidates shrinks, you may be paying 10x the compute for a 2% accuracy improvement — or actively degrading performance if your selection logic makes poor choices.
Teams optimizing multi-agent pipelines often find that reducing the candidate count — or eliminating best-of-N entirely — reduces latency and cost while maintaining or improving output quality. The selection overhead was a liability all along; it just wasn’t obvious until the base model improved.
Scaffolding Debt: The Infrastructure You Built for a Different Model
The term “technical debt” applies to agent systems in a specific way. Call it scaffolding debt: the accumulation of pipeline components, workarounds, and error-handling logic built around a model’s limitations that remain in place after those limitations no longer exist.
This is one of the most common sources of silent performance degradation in agent systems.
How It Accumulates
A team builds an agent 18 months ago using a model that struggles with tool selection. They add explicit routing logic, multiple fallback steps, and a validation layer to catch tool misuse. This works reasonably well at the time.
They upgrade to a newer model. The new model is much better at tool selection — it rarely makes the mistakes the old scaffolding was designed to catch. But the scaffolding is still there. Now, instead of catching model errors, it’s intercepting correct decisions and second-guessing them. Latency increases. Some requests that would have resolved cleanly in one step now bounce through unnecessary validation loops.
The team notices the new model doesn’t feel as much faster as expected. They blame the infrastructure overhead without realizing the infrastructure is the problem.
What Scaffolding Debt Looks Like in Practice
- Routing logic that sends requests to specialized sub-agents for tasks the main model now handles competently on its own
- Retry logic with aggressive thresholds that triggers unnecessary re-runs on valid outputs
- Prompt engineering patches (added instructions to correct specific failure modes) that are no longer needed but now add noise to context
- Tool lists bloated with options added to compensate for a model’s previous inability to reason through problems
Each of these was justified at some point. Each may now be a liability.
When Tool Overload Hurts Agent Reasoning
There’s a related problem specific to tool-using agents: giving them too many tools.
Coding agents automate the 5%. Remy runs the 95%.
The bottleneck was never typing the code. It was knowing what to build.
Research on function-calling and tool-use in language models shows consistent performance degradation as tool count increases beyond a certain threshold. When an agent has access to 50 tools, it faces a harder selection problem than an agent with 10 well-chosen tools. Context that describes available tools takes up space in the context window. The model has to reason through more options on each step. The chance of selecting a suboptimal or irrelevant tool increases.
This is counterintuitive to anyone who thinks more capability options = more capable agent. But tool selection is itself a reasoning task. Harder selection problems increase the cognitive load on the model, which reduces the quality of the work it does after making the selection.
The Practical Signal
If you’re building an agent that underperforms expectations, look at the tool list before assuming the model needs to be stronger or the task needs more steps. Ask:
- Which tools has this agent actually used in the last 100 runs?
- Which tools are being called in error or producing outputs the model ignores?
- Could the same task be done with half the tools?
Often, pruning the tool list to the 5–8 most relevant options produces a measurable improvement. The agent isn’t getting “less capable” — it’s getting a harder selection problem removed from its path.
How to Know What to Remove
The subtraction principle doesn’t mean strip everything. It means audit components against the current model’s actual capabilities, not the capabilities you assumed when you built the system.
A Practical Audit Framework
Step 1: Baseline without the component. Run your benchmark with the component removed. If performance stays the same or improves, the component was providing no value — or was a liability.
Step 2: Measure component-level error rates. If you have a verifier, measure how often it rejects correct outputs versus incorrect ones. If it’s rejecting correct outputs at a rate above 5–10%, it’s introducing more noise than it’s filtering.
Step 3: Check latency and compute against quality gains. If best-of-N is adding 3x the latency for a 1% quality improvement on your specific task distribution, that’s a signal the selection mechanism isn’t working well enough to justify the cost.
Step 4: Review routing logic against recent outputs. Are the sub-agents your router sends to actually outperforming the main model on their designated tasks? Run a comparison. Routing logic that made sense 12 months ago may not be accurate today.
Step 5: Audit tool usage logs. Which tools are never or rarely called? Which are called incorrectly? Start there for pruning.
What to Keep
Not everything is scaffolding debt. Some components remain valuable regardless of model capability:
- Hard constraints and safety filters — These aren’t performance optimizations; they’re policy enforcement.
- Deterministic verification — Unit tests, API contract checks, database query validation. These don’t have error rates the way learned verifiers do.
- Retrieval and memory — Giving a model access to relevant context is fundamentally different from adding a layer that second-guesses its outputs.
- Orchestration logic — Managing parallel tasks, handling failures gracefully, and routing between genuinely specialized systems is structural, not compensatory.
The distinction is: does this component compensate for a model limitation that may have since been resolved, or does it provide structural value that exists independently of the model’s capability level?
How MindStudio Approaches Agent Configuration
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
One practical challenge with applying the subtraction principle is that most agent frameworks make it easy to add components and hard to audit what they’re actually doing.
MindStudio’s visual builder addresses this directly. When you build an agent in MindStudio, every step in the workflow is explicit and inspectable. You can see exactly which tools are active, which steps are in sequence, and what’s happening at each stage of execution — without having to parse code or read logs.
This makes it straightforward to run the kind of subtraction audit described above. You can disable a step, run a test batch, and compare outputs directly, without touching code. For teams managing agents built on top of different underlying models — or updating an agent as they switch to a newer, more capable model — being able to quickly modify and re-test the harness is genuinely useful.
MindStudio also supports building multi-step AI workflows across 200+ models, which means you can test whether a single stronger model with minimal scaffolding outperforms a more complex pipeline built around an older one. That kind of A/B comparison is often where teams first see the subtraction principle in action — not as a theory, but as a measurable improvement in their results.
You can try it free at mindstudio.ai.
Frequently Asked Questions
Does removing tools always improve agent performance?
No. Removing tools helps when the tool set is creating decision overhead, being misused, or compensating for limitations the current model no longer has. Tools that provide genuinely useful capabilities — access to current data, external APIs, specialized computations — continue to add value regardless of model strength. The question to ask is whether each tool is being used correctly and whether the agent is selecting among tools effectively.
Why do verifiers sometimes make agent performance worse?
Verifiers are trained models with their own error rates. When a base model is strong enough that its primary error rate is low, a verifier that rejects even a small percentage of correct outputs becomes the dominant source of failure rather than a correction mechanism. The key issue is that verification is asymmetric: a false rejection (rejecting a correct answer) is harder to detect and correct than a false acceptance.
How does multi-candidate search hurt performance?
Best-of-N sampling requires a selection mechanism to pick the best candidate. If that selection mechanism — whether a verifier, reward model, or majority vote — is imperfect, you may systematically pick worse candidates than you’d get by just taking the model’s top output. This is most likely when candidates are all high-quality and subtle differences between them are hard to evaluate, which is exactly the situation with strong modern models.
What is scaffolding debt in AI agent systems?
Scaffolding debt refers to pipeline components, workarounds, and error-handling logic built to compensate for a model’s past limitations that remain in place after those limitations no longer exist. As foundation models improve, the patches built around older, weaker versions can actively interfere with correct outputs from newer ones. Auditing and removing this accumulated infrastructure is a key part of maintaining agent performance as you upgrade underlying models.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
When should you add components to an agent rather than subtract?
Add components when: (1) there’s a clear capability gap the current model cannot fill — access to real-time data, deterministic computation, or specialized domain tools; (2) you have a reliable verification signal like unit tests or formal checks; or (3) you’re genuinely parallelizing work across specialized agents rather than adding redundant oversight. The test is whether the component provides independent value or is compensating for a model weakness that may have been resolved.
How do you measure whether a pipeline component is helping or hurting?
The simplest method is ablation testing: run the same benchmark with and without the component and compare accuracy, latency, and cost. For verifiers specifically, measure false rejection rates on a labeled set of known-correct outputs. For tool configurations, analyze usage logs to identify tools that are never called, called incorrectly, or whose outputs the model ignores. These are the clearest signals that a component is overhead rather than value.
Key Takeaways
- Scaffolding built for weaker models often becomes a liability as models improve. Components that once compensated for real limitations now introduce new failure modes.
- Verifiers fail by rejecting correct answers. When a base model is already strong, a verifier’s false rejection rate can exceed its benefit, making it a net negative.
- Multi-candidate search requires reliable selection to work. Best-of-N only improves performance if your mechanism for identifying the best candidate is more accurate than just trusting the top output.
- Too many tools create harder selection problems. Pruning tool lists to the most relevant options consistently improves task performance in tool-using agents.
- Audit against current model capabilities, not the assumptions you made when you built. Model capability changes fast; pipeline architecture tends not to.
The instinct to add infrastructure is understandable. But the highest-leverage optimization move is often a careful look at what’s already there — and what can come out.