How to Reduce Token Usage in AI Agents: 10 MCP Optimization Techniques
MCP servers can burn through your context window fast. These 10 techniques—from code execution to TOON encoding—can cut token usage by up to 98%.
Why MCP Servers Eat Your Context Window Alive
If you’ve built AI agents that call MCP (Model Context Protocol) servers, you’ve probably noticed your token counts climbing fast. A single tool call that returns a JSON blob with 50 fields — when you only needed 3 — can consume thousands of tokens in one shot. Multiply that across a multi-step workflow and you’re burning context window before the agent has done anything useful.
Token usage in AI agents isn’t just a cost problem. It’s a performance problem. Bloated context slows inference, increases error rates, and hits limits on models with smaller windows. For MCP optimization specifically, the problem is structural: most MCP servers are built to return complete, accurate data — not minimal data. That’s fine for humans reading output, but it’s wasteful when an LLM only needs a fraction of what comes back.
These 10 techniques address that gap directly. Some are architectural choices. Some are encoding tricks. A few, like TOON notation, can cut token usage by up to 98% on certain payloads. All of them are practical and applicable whether you’re building from scratch or tuning an existing multi-agent workflow.
Understanding the Token Waste Problem in MCP Workflows
Before the fixes, it helps to know exactly where the waste happens.
In a typical MCP interaction, the agent sends a tool call, the MCP server fetches or processes data, and the result lands back in the context window. That result gets read by the model and informs the next step. The problem is that “the result” is often far larger than necessary.
Common sources of token bloat in MCP workflows:
- Verbose JSON responses — APIs return full objects with dozens of keys when you needed one value
- Over-specified tool schemas — Tool descriptions and parameter definitions add hundreds of tokens before any data arrives
- Redundant context passing — Prior tool results stay in context even when they’re no longer relevant
- Paginated data loaded all at once — Full datasets fetched when a subset would do
- Uncompressed structured data — Standard JSON is readable but not token-efficient
The fix isn’t to use fewer tools. It’s to use them more precisely.
The 10 MCP Optimization Techniques
1. Filter Tool Outputs at the Server Level
The single highest-leverage change you can make: return only the fields the agent actually needs.
Most MCP servers pass through the full API response. A GitHub MCP call that returns a pull request might include author info, timestamps, branch names, reviewer lists, CI statuses, labels, and more — when the agent only needed the PR title and diff.
The solution is field filtering at the MCP server layer. Before the response hits the context window, strip it down to the fields your agent workflow requires. This can be as simple as a projection function that whitelists specific keys.
For a response that typically returns 50 fields, filtering to 3–5 relevant ones can reduce payload tokens by 80–90%.
2. Run Computation Server-Side Instead of in Context
This is one of the most underused MCP optimization techniques: move computation to the server, not the model.
A common anti-pattern looks like this: the MCP server returns a large dataset, the agent processes it in context (sorting, filtering, aggregating), and produces a result. You’ve paid for every token of that dataset.
The better pattern: expose aggregation tools in your MCP server that do the processing before anything enters the context window. Instead of get_all_sales_records, expose get_sales_summary_by_region. Instead of list_all_issues, expose count_open_issues_by_priority.
This is the principle behind many high-efficiency MCP designs. Code runs in milliseconds server-side and costs nothing in tokens. The same operation in context costs hundreds or thousands of tokens.
3. Compress Tool Descriptions Without Losing Function
Every tool you expose to an agent carries a description and a parameter schema. On a complex MCP server with 20 tools, the combined schema can add 2,000–4,000 tokens to every request — even for tools the agent never calls.
A few approaches to reduce this overhead:
- Trim verbose descriptions — Tool descriptions should be precise, not exhaustive. “Returns user profile” beats a four-sentence explanation.
- Use shorter parameter names —
qinstead ofsearch_query_stringsaves tokens at scale - Remove optional parameters from default schemas — Only expose the full parameter set when the tool is actually being invoked
- Use tool selection layers — For large tool catalogs, use a router that presents only the relevant subset of tools based on the current task
Tool schema compression alone can reduce per-request overhead by 30–60% in agents with large MCP tool catalogs.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
4. Implement Intelligent Pagination
Many MCP tools default to returning all available data. That’s a reasonable default for a generic API, but it’s a token drain for agents.
The fix is to build pagination awareness into your MCP layer and into your agent prompting:
- Default to small page sizes (10–20 items) rather than returning everything
- Expose
has_moreflags so the agent can decide whether to fetch the next page - Let the agent request additional data only when it determines it needs it — not preemptively
An agent looking for a specific customer record in a CRM doesn’t need all 5,000 records dumped into context. It needs the ability to search, get a small result set, and stop when it finds a match.
This requires the agent’s reasoning to include awareness of when “enough data” has arrived — but that’s a solvable prompt engineering problem, and the token savings are significant.
5. Cache Tool Results Within a Session
Redundant tool calls are a silent token drain. An agent that queries the same Notion database three times in one workflow pays full token cost each time — both for the call overhead and the repeated response content.
Session-level caching at the MCP layer solves this. When a tool result hasn’t changed and the same query is made within a session, return the cached result without re-fetching and, crucially, without reinserting the full response into context.
More advanced implementations use a context cache reference — instead of re-inserting the full tool response, inject a compact reference like [Cached: customer_record_id_4821] and have the agent resolve it only if needed.
Some model providers (like Anthropic’s prompt caching feature) offer native support for caching repeated context segments, which pairs well with MCP result caching.
6. Use TOON Encoding for Structured Data
TOON (Token-Optimized Object Notation) is a compact alternative to standard JSON for passing structured data through agent contexts. It’s not a universal standard — different implementations vary — but the core idea is consistent: represent structured data in formats that use fewer tokens without losing semantic content.
Standard JSON is human-readable but token-heavy. Consider this comparison:
Standard JSON:
{
"customer_id": "cust_4821",
"first_name": "Sarah",
"last_name": "Chen",
"account_status": "active",
"plan": "enterprise"
}
TOON-style compact encoding:
cust_4821|Sarah|Chen|active|enterprise
With a schema definition the agent already has in context, the compact version conveys identical information at a fraction of the token cost. For large datasets — think 100 customer records or 500 log entries — the savings compound dramatically. Some benchmarks show 90–98% token reduction on highly repetitive structured data when moving from verbose JSON to compact schema-aligned encodings.
The tradeoff is that the agent needs the schema to interpret the compact format. This works best for predictable, repeated data shapes — not one-off responses.
7. Summarize Large Responses Before Injection
When an MCP tool needs to return genuinely large content — a long document, a lengthy email thread, a full code file — don’t inject the raw content. Inject a summary instead.
This can work in two ways:
Pre-summarization at the MCP layer: The MCP server runs a lightweight summarization step before returning the result. For document retrieval tools, this means the agent gets a 200-token summary instead of a 4,000-token document.
On-demand detail retrieval: The agent receives a summary plus a reference ID. If it determines it needs the full content, it calls a separate get_full_content(id) tool. Most of the time, the summary is enough — and you’ve avoided the token cost on every call.
This pattern works especially well for AI agents that process documents or research content, where the agent often needs to triage many sources before deciding which ones merit full attention.
8. Batch Tool Calls Where Possible
Each MCP tool invocation carries overhead: the tool call itself, its parameters, and the response — all in context. For agents that make many sequential calls, this overhead accumulates quickly.
Batching reduces this by combining multiple operations into one:
- Instead of
get_user(id_1),get_user(id_2),get_user(id_3)— exposeget_users([id_1, id_2, id_3]) - Instead of separate write calls for multiple records — expose a bulk write operation
This reduces not just the per-call overhead but also the number of reasoning steps the agent needs to take, which itself reduces context length.
When designing MCP servers for multi-agent workflows, building batch variants of common tools from the start is worth the extra implementation time.
9. Use Selective Context Injection by Agent Step
In multi-step agents, the entire conversation history — including all prior tool results — sits in context for every subsequent step. Early steps may have fetched data that’s irrelevant to later steps, but it still costs tokens.
Selective context injection addresses this by pruning stale tool results from the active context as the agent progresses:
- Tag tool results with a relevance scope (e.g., “needed for steps 2–3 only”)
- At each step, strip results that fall outside the current scope
- Maintain a summary of what was done in prior steps rather than the full detail
This is more complex to implement but has a large payoff in long-running agents. A 10-step workflow that accumulates context without pruning can end up with 15,000+ tokens in context before the final step. With selective injection, you might hold that to 3,000–5,000 tokens throughout.
10. Deduplicate and Normalize Tool Call Patterns
This one is about workflow design rather than server architecture.
Poorly designed agent prompts often lead to redundant or overlapping tool calls. The agent fetches the same data multiple times, asks similar questions in different ways, or retrieves large contexts when targeted queries would work.
Deduplication strategies:
- Track what’s been fetched — Maintain a simple log in the agent’s working memory of what data has already been retrieved
- Normalize query patterns — If the agent tends to ask “what is the user’s plan?” and “is this user on enterprise?” separately, consolidate these into one structured lookup
- Use structured output for tool selection — Rather than letting the agent reason freely about which tool to call, provide a structured decision step that selects the minimum necessary tools for the current task
Remy is new. The platform isn't.
Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.
These prompt-level and workflow-level changes don’t require touching the MCP server at all, but they can meaningfully reduce redundant calls — and with them, redundant token costs.
How MindStudio Helps You Implement These Techniques
Building token-efficient MCP workflows from scratch requires managing a lot of moving parts: tool schemas, response filtering, caching logic, context pruning. That’s substantial engineering overhead before you’ve solved your actual business problem.
MindStudio’s no-code agent builder lets you configure multi-agent workflows visually, with built-in control over what data flows between steps. You can define exactly which fields pass from one step to the next — no code required — which directly applies techniques 1, 9, and 10 above.
MindStudio also supports custom JavaScript and Python functions within workflows. That means you can implement server-side computation (technique 2), TOON-style compact encoding (technique 6), and response summarization (technique 7) as custom steps in your pipeline without standing up separate infrastructure.
For teams building agents that connect to external APIs and data sources — the exact scenario where MCP token bloat becomes a problem — MindStudio’s 1,000+ pre-built integrations handle the connection layer. You focus on optimizing what data flows through the agent, not on managing authentication and API plumbing.
You can also expose your MindStudio workflows as MCP servers, making them callable by other AI systems like Claude or custom agents built with LangChain or CrewAI — a useful option for teams building layered multi-agent architectures where token efficiency across the full stack matters.
Try it free at mindstudio.ai.
Measuring the Impact: What to Track
Applying these techniques is only useful if you’re measuring the right things. Token optimization without monitoring is guesswork.
Key metrics to track for MCP token efficiency:
- Tokens per tool call — Average input + output tokens for each MCP tool in your workflow
- Context window utilization — What percentage of the available context window is in use at each agent step
- Tool call count per session — Are redundant calls happening?
- Cost per workflow run — The bottom-line measure of whether optimization is working
Most major model providers expose token counts in API responses. For production multi-agent systems, instrument your MCP layer to log token counts per tool call so you can identify which tools are the biggest contributors to context bloat.
Even basic logging will usually reveal that 2–3 tools account for the majority of your token usage — and those are the ones worth attacking first.
Common Mistakes to Avoid
Even with good intentions, a few implementation patterns tend to undermine MCP optimization efforts.
Over-filtering responses — Stripping too many fields can leave the agent without information it needs, causing extra round-trips or reasoning errors. Always profile what the agent actually uses before deciding what to cut.
Caching stale data — Session caches are only useful if the underlying data doesn’t change during the session. For real-time data sources, caching can introduce errors. Build in staleness checks or use caching only for reference data.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
Compressing once, forgetting forever — As your agent’s capabilities evolve, tool schemas and response structures change. Token optimization is maintenance work, not a one-time fix. Re-audit when you add tools or modify workflows.
Ignoring prompt-side bloat — Most MCP optimization discussions focus on tool responses. But system prompts and few-shot examples can also consume thousands of tokens. Treat prompt compression as part of the same budget.
Frequently Asked Questions
What is MCP in the context of AI agents?
MCP stands for Model Context Protocol, an open standard that defines how AI models interact with external tools and data sources. It provides a structured way for agents to call tools — like searching the web, querying a database, or reading a file — and receive results. MCP servers expose these tools; AI agents consume them.
Why do MCP tool calls use so many tokens?
Token usage in MCP comes from multiple sources: the tool schema definitions sent with each request, the tool call parameters, and — most significantly — the response content returned by the tool. If a tool returns a full API response with dozens of fields when only a few are needed, all of those extra fields still consume context window space.
Does reducing token usage affect agent quality?
Done correctly, no. The goal is to remove redundant or irrelevant tokens, not useful information. An agent that receives a clean, filtered 200-token response will generally reason more accurately than one that has to parse through 3,000 tokens of noise to find the 200 tokens of signal it actually needs. Focused context often improves performance.
What is TOON encoding and is it a standard?
TOON (Token-Optimized Object Notation) is a compact data encoding approach used in agent and MCP contexts to reduce token usage on structured data. It’s not a single universal standard — implementations vary — but the core idea is representing structured records in delimited or abbreviated formats rather than verbose JSON, when the agent has access to a schema that defines the structure. It can achieve very high compression ratios on repetitive structured data.
How do I know which MCP tools are consuming the most tokens?
Instrument your MCP layer to log input and output token counts for each tool call. Most model provider APIs return token usage metadata in responses, which you can capture and aggregate. Look for tools with high average response sizes or high call frequency — those are your best optimization targets.
Can these techniques be applied to non-MCP agent architectures?
Yes. Many of these techniques — selective context injection, response filtering, server-side computation, caching — apply to any tool-using agent architecture, whether it uses MCP, function calling, ReAct patterns, or custom tool implementations. The principles are general; MCP is just the delivery mechanism.
Key Takeaways
- MCP token bloat is structural — MCP servers are built for completeness, not efficiency. Optimization requires deliberate design choices.
- Filter at the source — Returning fewer fields from tool responses is the single highest-leverage optimization.
- Move computation server-side — Code execution is free; tokens aren’t. Aggregate and process before data enters context.
- Compact encoding works for repetitive data — TOON-style representations can reduce token usage by 90%+ on structured datasets.
- Monitor and iterate — Token optimization is an ongoing process. Instrument your workflows, identify the biggest cost drivers, and target them specifically.
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Implementing even 3–4 of these techniques in a production multi-agent workflow typically cuts token usage by 50–70%. The full stack of techniques can push that much further — and the savings compound with every agent run. MindStudio gives you the tooling to apply many of these without building the infrastructure layer from scratch.