Skip to main content
MindStudio
Pricing
Blog About
My Workspace

The Subtraction Principle: Why Removing Agent Tools Often Improves Performance

Research shows adding more tools to AI agents can hurt results. Learn the subtraction principle and how to audit your agent harness for better outputs.

MindStudio Team RSS
The Subtraction Principle: Why Removing Agent Tools Often Improves Performance

More Tools, Worse Results: The Case for Subtracting from Your Agent Harness

There’s a persistent assumption in AI agent design: more tools equals more capability. If your agent can search the web, query a database, send emails, call an API, run code, and summarize documents — it should be able to handle more situations, right?

In practice, the opposite is often true. Research on AI agent performance consistently shows that expanding an agent’s tool set beyond what it genuinely needs degrades output quality, increases error rates, and makes debugging significantly harder. This is the core idea behind the subtraction principle — the counterintuitive approach of removing agent tools to improve agent performance.

This article breaks down why tool overload happens, what the research says, and how to run a practical audit of your agent harness to strip it down to what actually works.


Why AI Agents Struggle with Too Many Tools

To understand why subtraction helps, you need to understand how agents actually select and use tools at inference time.

When an AI agent receives a task, it reasons through which tool (or sequence of tools) to call. This isn’t lookup — the model is making a probabilistic judgment based on the tools available, their descriptions, the task at hand, and prior context. Add more tools to the harness, and you add more options the model has to weigh, more surface area for misselection, and more ambiguity about which tool is appropriate.

The Tool Selection Problem

VIBE-CODED APP
Tangled. Half-built. Brittle.
AN APP, MANAGED BY REMY
UIReact + Tailwind
APIValidated routes
DBPostgres + auth
DEPLOYProduction-ready
Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Research on function-calling in large language models shows a consistent pattern: as the number of available tools increases, tool selection accuracy decreases — especially when tools have overlapping functionality or vague descriptions.

A 2024 study from researchers examining tool-augmented LLMs found that agents with access to large tool sets (20+ tools) made significantly more tool selection errors than agents with focused sets of 5–8 tools, even when the larger set technically contained everything needed to complete the task. The agent’s attention was effectively diluted.

This makes sense when you consider how the model processes tool availability. Each tool gets described in the system prompt or context window. More tools mean more tokens consumed before the agent even begins reasoning about the task. That’s context window space that could be used for the actual problem.

Ambiguity Compounds Errors

The problem gets worse when tools have similar names, overlapping functions, or descriptions that don’t clearly delineate their purpose. An agent choosing between search_web, fetch_url, browse_page, and retrieve_content is going to make inconsistent choices — sometimes using the right tool, sometimes not.

This inconsistency is particularly damaging in multi-step workflows. A wrong tool choice early cascades into compounding errors downstream. The agent builds on a flawed intermediate result, and by the time you see the output, the original mistake is buried several steps back.

Latency and Cost Overhead

Beyond accuracy, there’s a practical performance hit. Every tool call adds latency. Agents with bloated harnesses often make redundant tool calls — fetching information that’s already in context, running checks that aren’t needed, or calling a tool and then calling a second tool to verify the first one’s output.

In production, this translates directly to slower response times and higher inference costs. An agent that makes six tool calls to complete what could be done in two isn’t more capable — it’s less efficient.


What the Research Actually Shows

The subtraction principle isn’t just intuition — it’s backed by a growing body of work on agent design and evaluation.

Tool Count vs. Task Success Rate

Multiple benchmarks for tool-using agents have found an inverted-U relationship between tool count and task success rate. Performance improves as you add the first few relevant tools, then plateaus, then declines as tool count continues to grow. The inflection point varies by model and task type, but it consistently exists.

The ToolBench benchmark, which evaluates LLMs across diverse tool-use scenarios, found that models frequently failed not because they lacked the right tool, but because they selected the wrong one from a crowded harness. Improving tool descriptions and reducing redundancy had a larger performance impact than switching to a more capable base model.

Instruction Following Degrades with Complexity

There’s also a connection to instruction following research. Models are better at following instructions when those instructions are clear and constrained. A system prompt that describes 3 tools concisely is easier for a model to internalize than one that describes 15 tools with qualifications and edge cases.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

This is particularly relevant for smaller or more cost-efficient models. If you’re running agents on GPT-4o Mini, Claude Haiku, or Gemini Flash to control costs, you’re working with models that are more sensitive to prompt complexity. A leaner tool harness makes these models punch significantly above their weight.

The Cognitive Load Parallel

There’s an analogy to human cognitive load theory here. Humans make worse decisions when presented with too many choices — a well-documented effect called decision fatigue or choice overload. AI agents exhibit an analogous pattern: given too many options, their selection behavior becomes less reliable.

The fix in both cases is the same: constrain the choice set to what’s actually relevant and make each option clearly distinct.


Running a Tool Audit: The Subtraction Framework

The subtraction principle isn’t about building minimalist agents for its own sake. It’s about removing tools that don’t contribute to task completion — and being rigorous about what “contribute” actually means.

Here’s a practical framework for auditing your agent’s tool harness.

Step 1: Log Every Tool Call in Production

You can’t audit what you can’t see. Before cutting anything, instrument your agent to log every tool call with:

  • The tool name
  • The input passed to it
  • The output returned
  • Whether the final task was completed successfully

Run this for a representative sample of tasks — at least 50–100 real interactions if possible. You’ll quickly see patterns: which tools are called frequently and correctly, which are called infrequently, and which are called but produce outputs that get ignored.

Step 2: Identify Low-Utility Tools

From your logs, flag any tool that meets one or more of these criteria:

  • Called in fewer than 10% of interactions
  • Called but its output is rarely used in the final response
  • Frequently called in combination with another tool that does similar work
  • Produces outputs that require another tool call to verify or clean up

These are candidates for removal or consolidation.

Step 3: Test Tool Descriptions Before Tools Themselves

Before removing a tool, try rewriting its description first. Many “underperforming” tools aren’t actually bad — they’re just described ambiguously. A tool called get_data with a vague description will be underused or misused. The same tool renamed query_crm_for_contact_history with a clear description of when to use it may perform completely differently.

Run A/B comparisons on tool descriptions. It takes less than an hour and often yields significant gains without touching the underlying functionality.

Step 4: Consolidate Overlapping Tools

Look for tools with overlapping functionality and consolidate them where possible. If you have three different tools that all retrieve information from different parts of your database, ask whether one well-designed tool with parameters could replace all three.

Consolidation reduces choice paralysis without reducing actual capability. The agent now has one clear tool to reach for instead of three similar ones.

Step 5: Remove and Validate

Remove low-utility tools one at a time and run your evaluation suite after each removal. Don’t batch removals — you want to know the individual impact of each change.

A tool that looks low-utility by call frequency might actually be critical in edge cases. Removing it may not show up in aggregate metrics but will cause failures in specific scenarios. Incremental removal lets you catch these cases before they hit production.

TIME SPENT BUILDING REAL SOFTWARE
5%
95%
5% Typing the code
95% Knowing what to build · Coordinating agents · Debugging + integrating · Shipping to production

Coding agents automate the 5%. Remy runs the 95%.

The bottleneck was never typing the code. It was knowing what to build.

Step 6: Establish a Tool Addition Protocol

The final step of a tool audit is preventing the problem from recurring. Establish a rule: no new tool gets added to the harness without a documented use case, a clear description, and a test scenario that demonstrates it’s needed.

This keeps harness bloat from creeping back in as agents evolve.


Common Mistakes That Lead to Tool Overload

Most bloated agent harnesses didn’t get that way through negligence. They grew organically, through reasonable decisions made in isolation. Here are the most common patterns.

”Just in Case” Tool Addition

The most common cause of tool overload is adding tools because they might be useful rather than because they’re definitely needed. This is the agentic equivalent of feature creep. Every tool added “just in case” adds noise without adding signal.

The fix is a strict policy: tools get added when there’s a documented task that requires them, not before.

Copy-Paste Harness Design

Many agent builders start from templates or copy configurations from other projects without auditing whether every tool is relevant to the new use case. A research agent harness might get copy-pasted as the starting point for a customer support agent — and suddenly your support agent has tools for academic database searches and citation formatting that will never be used.

Always start a tool audit when adapting an existing agent to a new use case.

Fear of Capability Loss

Removing tools feels risky. What if someone needs that capability later? What if a user runs into a task that requires the removed tool?

The practical answer: track what your agent is actually asked to do. If removed tools were genuinely needed, you’ll see task failures that can be traced back to missing capability. Add them back then, with better descriptions and clearer use cases. Don’t keep tools speculatively.

Poor Tool Descriptions That Require Redundancy

Sometimes teams add multiple similar tools because the existing tool descriptions aren’t clear enough that one of them would work. The agent keeps picking the wrong one, so a second tool gets added as a workaround.

Address the root cause — the description — rather than adding more tools to compensate.


The Subtraction Principle in Multi-Agent Systems

In single-agent setups, tool overload is a manageable problem. In multi-agent systems, it compounds dramatically.

When you have orchestrator agents routing tasks to specialist agents, each specialist’s tool harness needs to be even more focused. An orchestrator that hands off a research task to a sub-agent expects that sub-agent to execute reliably. If the sub-agent has a cluttered harness, its reliability drops — and errors propagate back to the orchestrator, which then has to handle failure states it may not be equipped for.

Specialization Over Generalism

The most robust multi-agent architectures use highly specialized sub-agents with narrow tool sets rather than generalist agents with broad ones. A sub-agent whose entire job is to query a specific API and return structured data should have exactly the tools needed to do that — and nothing else.

This makes individual agents easier to test, debug, and improve in isolation. It also makes the overall system more predictable, because each agent’s behavior is constrained by a limited tool set.

Tool Boundaries as Agent Boundaries

In multi-agent workflow design, tool boundaries and agent boundaries often align naturally. If you find yourself building an agent with tools that seem to belong to two completely different domains, that’s usually a signal to split the agent into two.

One agent with 15 tools might be better designed as three agents with 5 tools each, coordinated by an orchestrator. The individual agents become more reliable, and the orchestrator’s job becomes simpler because it’s working with components that fail less often.


How MindStudio Approaches Agent Tool Configuration

MindStudio’s visual workflow builder makes it easy to both add and remove tools from an agent harness — which is important, because the ease of removal matters as much as the ease of addition.

In many code-based agent frameworks, removing a tool requires modifying function definitions, updating system prompts, and re-testing across multiple files. In MindStudio, tools are added and removed in the visual builder with immediate visibility into what the agent can access. This makes iterative subtraction — the kind of audit process described above — practical to run regularly rather than as a one-time cleanup.

More usefully, MindStudio supports building purpose-specific agents and connecting them in multi-step agentic workflows. Instead of loading a single agent with every capability it might ever need, you can build a lean orchestrator that routes to specialist agents — each with a focused tool set — using MindStudio’s 1,000+ pre-built integrations.

The platform also gives you access to 200+ models, which matters for the subtraction principle in a specific way: when you reduce tool complexity, smaller and faster models can often handle the task reliably. This means a well-audited agent running on a cost-efficient model frequently outperforms a bloated agent running on a premium model — at a fraction of the cost.

You can try this approach directly at mindstudio.ai — the average agent build takes 15 minutes to an hour, and you can run tool audits iteratively as you go.


Frequently Asked Questions

How many tools should an AI agent have?

There’s no universal number, but most practitioners find that 5–10 well-defined, clearly distinct tools hit the sweet spot for most agent use cases. Fewer than 3 and you may be limiting genuine capability. More than 15 and you’re almost certainly introducing noise. Start with the minimum needed to complete the task, then add only when there’s a documented, tested case for it.

Does the subtraction principle apply to all AI models equally?

Not equally, but broadly. Larger, more capable models (GPT-4o, Claude Sonnet, Gemini Pro) are more resilient to tool overload than smaller models — but they’re not immune. The performance degradation with bloated harnesses is a consistent pattern across model sizes. Smaller models are simply affected more severely, which means tool hygiene matters even more when running cost-efficient agents.

What’s the difference between tool removal and tool redesign?

Remy doesn't write the code. It manages the agents who do.

R
Remy
Product Manager Agent
Leading
Design
Engineer
QA
Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Removal means eliminating a tool entirely. Redesign means changing a tool’s description, scope, or structure without removing it. Often, redesign is the right first step — a tool that’s being misused or underused may just need clearer documentation. Only remove after you’ve confirmed that better descriptions don’t solve the problem.

How do I know which tools are actually causing problems?

Logging is the most direct way. Track which tools get called, how often they’re called correctly, and whether their outputs contribute to successful task completion. Tools with low call rates, high error rates on selection, or outputs that frequently get discarded are the best candidates for removal or redesign.

Does the subtraction principle conflict with building capable agents?

No — it reframes what “capable” means. An agent that reliably completes 90% of its target tasks with 6 tools is more capable than one that attempts 100% of tasks with 20 tools and succeeds on 60%. Reliability and precision are legitimate capabilities. A focused agent that handles a narrow domain well is genuinely more useful than a broad agent that handles everything poorly.

Should I apply the subtraction principle before or after optimizing prompts?

Both matter, and they interact. Generally, fix tool descriptions and system prompts first, then audit tool count. Prompt improvements often reveal that some tools are being misused because they’re described poorly — solving the prompt issue might eliminate the need for removal. But if tools remain low-utility after prompt optimization, remove them.


Key Takeaways

  • Adding more tools to an AI agent typically hurts performance past a certain point, due to tool selection errors, context bloat, and ambiguity between similar tools.
  • A practical tool audit — logging, identifying low-utility tools, improving descriptions, consolidating overlaps, and removing incrementally — is the most reliable way to improve agent reliability.
  • In multi-agent systems, focused specialist agents with narrow tool sets are more robust than generalist agents with broad ones.
  • Rewriting tool descriptions often fixes tool selection problems without requiring removal — always try description improvements first.
  • A well-audited agent running on a smaller model frequently outperforms a bloated agent running on a larger one, with better cost efficiency and lower latency.

The best-performing agents aren’t the ones with the most tools. They’re the ones where every tool earns its place — and where the builder was willing to remove what wasn’t working. Start with what you need, cut what you don’t, and build from there.

Presented by MindStudio

No spam. Unsubscribe anytime.