How to Optimize MCP Server Token Usage: Code Execution, Tool Search, and TOON
MCP servers can burn 50% of your context window before a single message. Learn 10 techniques—including code execution and TOON—to cut usage by up to 98%.
The Hidden Token Tax Hiding in Your MCP Servers
Every time an AI agent fires up a Model Context Protocol (MCP) server, it pays a toll — in tokens. Tool schemas load into the context window before a single message is processed. Tool outputs pile in after each call. And in complex agentic setups with dozens of tools, that toll can eat 30–50% of your available context before the real work even starts.
MCP server token usage isn’t just a cost problem. It degrades model performance, increases latency, and can push your context window over its limit mid-task. The good news: it’s highly optimizable.
This guide covers 10 concrete techniques — including code execution, tool search, and TOON (Tool Output Optimization Notation) — that can cut token consumption by up to 98% in the right scenarios, without sacrificing capability.
Why MCP Servers Burn So Many Tokens
Before fixing the problem, it helps to understand exactly where the tokens go.
The Tool Schema Problem
Every tool exposed by an MCP server needs a description the model can understand. That means a name, a description, and a full JSON schema for the input parameters. A single well-documented tool might consume 200–500 tokens. Load 50 tools — which is common in enterprise setups — and you’ve spent 10,000–25,000 tokens just on definitions, before any tool has been called.
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
Some popular MCP server packages (for GitHub, Notion, Salesforce, etc.) expose 40–80 tools by default. Most agents only use a fraction of those in any given session.
The Output Verbosity Problem
Tool outputs are often designed for humans, not models. A REST API response might return a full JSON object with 60 fields when the model only needs 3. A database query might return full row objects when only IDs are required. These bloated outputs compound fast across multi-step workflows.
The Round-Trip Problem
When an agent needs to perform 10 sequential operations, it makes 10 tool calls. Each call means a new completion request with the full context window — including all previous tool calls and outputs — attached. The token cost is multiplicative.
Understanding these three sources of waste points directly to the solutions.
Technique 1: Use Code Execution to Batch Operations
One of the most powerful optimizations is replacing multiple discrete tool calls with a single code execution call.
How It Works
Instead of giving an agent 10 separate tools (one for each operation), give it one code interpreter tool and let it write a script that performs all 10 operations in a single call. The model reasons once, writes the code, executes it, and gets a single result back.
Without code execution:
- 10 tool calls × 500 tokens per output = 5,000 tokens in outputs
- Plus 10 intermediate completions with growing context
With code execution:
- 1 tool call
- 1 output (the final result of the script)
- Token savings: often 80–90%
When to Use It
Code execution is most effective when:
- Operations are sequential and deterministic (query → filter → transform → summarize)
- You need to process or aggregate data from multiple sources
- The logic is clear enough for a capable model to script reliably
It’s less appropriate when operations require the model to reason mid-task — where it needs to see intermediate results before deciding what to do next.
Implementation Tips
Provide the model with a sandboxed Python or JavaScript runtime. Keep the tool schema simple: one parameter for the code string, one return field for the output. Enforce output formatting in the system prompt so the model returns structured, compact results rather than prose.
Technique 2: Tool Search Instead of Tool Loading
The single highest-leverage optimization for most MCP setups is not loading all tools upfront. Instead, give the model a search tool that retrieves tool definitions on demand.
The Core Idea
Rather than injecting 50 tool schemas into the context at session start, inject one tool: search_tools(query: string). The model calls this with a natural language description of what it needs, and your server returns the 2–5 most relevant tool schemas dynamically.
This alone can reduce schema-related token consumption by 80–95%.
Building Tool Search
The simplest approach is embedding your tool descriptions and using cosine similarity to retrieve the most relevant ones at query time. Tools like FAISS or any vector database work well here. Pre-compute embeddings for all tool descriptions; at runtime, embed the search query and return the top-k matches.
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
A more sophisticated version uses BM25 or hybrid search (keyword + semantic) to handle cases where exact tool names matter.
Practical Setup
Keep a registry of all available tools with their full schemas stored server-side. The search_tools function returns full schemas — not just names — so the model can immediately use whatever it finds without a second lookup.
Add a fallback: if the model tries to call a tool that wasn’t returned by search, return the schema automatically and log the miss. This helps you identify tools that need better descriptions for retrieval.
Caveats
Tool search adds latency for the first call. And there’s a risk of retrieval misses — the model searches for “send notification” but the relevant tool is called “push_alert.” Fix this with synonym expansion in your tool descriptions, or by fine-tuning the retrieval on your specific tool vocabulary.
Technique 3: TOON — Tool Output Optimization Notation
TOON (Tool Output Optimization Notation) is a structured approach to formatting MCP tool responses that prioritizes token efficiency without losing information the model actually needs.
The core insight: most tool outputs are formatted for human readability or API compatibility, not for model consumption. TOON reframes outputs around what the model needs to proceed.
TOON Principles
1. Return only requested fields
Instead of returning the full API response object, define output schemas that match what downstream tasks need. If the model is checking whether a user exists, return {"exists": true, "user_id": "abc123"} — not the full user record.
2. Use compact formats over verbose ones
Prefer structured key-value pairs over nested JSON where the structure is implied. Return arrays of values instead of arrays of objects when field names are fixed. A list of 100 email addresses as a JSON array of strings costs a fraction of a list of 100 user objects.
3. Replace prose with codes
For status fields, use short codes instead of full strings. "status": "OPN" vs "status": "The ticket is currently open and awaiting review". Define a code legend in the system prompt once, not in every output.
4. Summarize, don’t dump
For large outputs (documents, logs, datasets), have the tool perform a first-pass summarization server-side rather than returning raw content. The model gets the summary; if it needs more detail, it can request specific sections.
5. Use null compression
Omit null and empty fields entirely. A response with 40 fields where 30 are null wastes 30 field names worth of tokens. Return only populated fields.
TOON in Practice
Here’s a before/after for a calendar event lookup:
Before (standard API response, ~180 tokens):
{
"event_id": "evt_123",
"title": "Q3 Planning Meeting",
"description": "Quarterly planning session with leadership",
"start_time": "2025-08-15T14:00:00Z",
"end_time": "2025-08-15T16:00:00Z",
"attendees": [{"name": "Alice", "email": "alice@co.com", "rsvp": "accepted"}, ...],
"location": "Conference Room B",
"organizer": {"name": "Bob", "email": "bob@co.com"},
"recurrence": null,
"attachments": [],
"reminders": []
}
After (TOON-optimized, ~60 tokens):
{
"id": "evt_123",
"title": "Q3 Planning Meeting",
"time": "2025-08-15 14:00-16:00 UTC",
"location": "Conference Room B",
"attendees": ["alice@co.com", "bob@co.com"]
}
Same information for the task at hand — two-thirds fewer tokens.
Technique 4: Selective Tool Exposure by Context
One coffee. One working app.
You bring the idea. Remy manages the project.
Not every tool should be available for every task. Structure your MCP server to expose tool subsets based on task context.
Tool Groups and Namespaces
Organize tools into logical groups: crm_tools, email_tools, calendar_tools, data_tools. At session start, the orchestrator determines which groups are relevant and only loads those. A customer service agent never needs the data analysis tools; a reporting agent never needs the email tools.
This is particularly powerful in multi-agent systems where different agents have different roles. Each agent loads only its role-relevant tools.
Dynamic Tool Registration
Go further by making tool exposure conditional on conversation state. After the model has gathered the information it needs from a read operation, unregister the read tools and register the write tools. This constrains the model’s action space and reduces schema tokens simultaneously.
Technique 5: Compress and Cache Repeated Outputs
Many agents call the same tools multiple times within a session, or across sessions, with identical inputs.
Session-Level Caching
Implement a simple request-scoped cache keyed on tool name + serialized inputs. The first call executes; subsequent identical calls return the cached result without re-executing. The model still “sees” the output, but you avoid repeated API calls and — more importantly — repeated large outputs appended to context.
Output Deduplication
When a tool returns the same data multiple times in a conversation (e.g., a user profile fetched in step 2 and again in step 7), replace the second occurrence with a reference: "[See user profile from step 2]". The model understands this; it reduces context bloat by the size of the full output.
Cross-Session Caching
For data that doesn’t change frequently (product catalogs, configuration data, reference tables), cache outputs with a TTL and serve them without round-tripping to the source. This reduces both latency and token cost for outputs that the model ends up ignoring most of.
Technique 6: Paginate and Stream Large Results
When a tool might return large datasets, don’t return everything at once.
Pagination
Design tools to accept limit and offset parameters. Default to small page sizes (10–20 items). The model can request additional pages if needed, but most tasks are resolved with the first page.
The key is making this behavior explicit in the tool schema description so the model knows to expect paginated results and how to request more.
Streaming with Early Termination
For tools that stream results (web searches, file reads, long API calls), allow the model to request early termination once it has enough information. This requires the tool to support interruption, but it’s worth implementing for tools that regularly return far more than needed.
Technique 7: Optimize Tool Schemas Themselves
Tool schemas are fixed costs — they load every time a tool is available. Shrinking them directly reduces baseline token consumption.
Schema Writing Principles
- Name tools concisely but unambiguously.
get_user_by_emailoverretrieves_user_information_based_on_provided_email_address - Write description for the model, not for a human developer. Focus on when to use the tool, not what it does internally.
- Use enums aggressively. If a parameter accepts a fixed set of values, define them explicitly — it prevents hallucination and often reduces the description length needed.
- Remove optional parameters that are rarely used. If a parameter is used <5% of the time, consider removing it from the default schema and offering it through a separate “advanced” tool variant.
Built like a system. Not vibe-coded.
Remy manages the project — every layer architected, not stitched together at the last second.
Schema Versioning
Maintain a “minimal” and “full” schema version for each tool. Load minimal schemas by default; only load full schemas when the task requires the advanced parameters.
Technique 8: Use Structured Output Contracts
Instead of leaving tool output format implicit, define explicit contracts between the tool and the model.
Tell the model in the system prompt exactly what format tool outputs will use. This lets you strip field names from repeated outputs — when the model knows the format, positional data works. A list of [id, name, status] tuples costs far fewer tokens than a list of {"id": ..., "name": ..., "status": ...} objects.
This is a form of shared context: establish once what the structure means, then exploit that shared understanding to compress repetitive structured data.
Technique 9: Summarize Tool History
In long agentic tasks, the tool call history grows until it dominates the context window.
Rolling Summarization
After every N tool calls (e.g., every 10), generate a summary of what has been accomplished and what was learned. Replace the detailed tool call history with this summary in the context. The model retains the high-level understanding while the token count resets.
This requires careful prompt engineering to ensure the summary captures what the model actually needs to continue the task — missing a key detail in summarization can break downstream reasoning. Test this extensively before deploying.
Checkpointing
For very long tasks, implement explicit checkpoints where state is serialized and the conversation is reset with a clean context containing only the checkpoint state. This prevents context windows from filling indefinitely and is essential for tasks that run over hours or days.
Technique 10: Monitor Token Usage Per Tool
You can’t optimize what you don’t measure. Track token consumption at the tool level.
What to Measure
- Average output token count per tool call
- Frequency of calls per tool per session
- Which tools appear in context most (schema tokens × sessions)
- Token cost of failed tool calls
Using the Data
Sort tools by (avg_output_tokens × call_frequency) to identify your biggest consumers. These are where TOON and output compression will have the most impact. Sort tool schemas by size to find candidates for schema minimization.
Set up alerts when total tool token consumption exceeds thresholds — it’s often the first signal that an agent has gotten into a loop or is making unnecessary calls.
How MindStudio Helps With MCP Token Efficiency
MindStudio supports building agentic MCP servers that expose your AI workflows to other systems — including Claude, Cursor, and other MCP-compatible clients. But where it connects most directly to this article is in how it handles the infrastructure that causes token bloat in the first place.
When you build an agent in MindStudio, you’re working with a visual workflow builder that handles tool orchestration without requiring you to manage raw prompt construction. The platform’s 1,000+ pre-built integrations return clean, structured outputs rather than raw API responses — which means you’re naturally working with more compact data from the start.
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
For developers building custom MCP setups, MindStudio’s Agent Skills Plugin (@mindstudio-ai/agent) handles the infrastructure layer — rate limiting, retries, auth — so you’re not adding boilerplate to your tool schemas or outputs just to handle operational concerns. Methods like agent.searchGoogle() or agent.runWorkflow() return typed, predictable results rather than raw API payloads.
If you’re building or managing agentic systems and want cleaner tool outputs without reinventing the plumbing, it’s worth exploring. You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
What is MCP server token usage and why does it matter?
MCP (Model Context Protocol) server token usage refers to the number of tokens consumed by tool definitions, tool call inputs, and tool outputs within an AI agent’s context window. It matters because context windows are finite and expensive — every token spent on tool overhead is a token not available for reasoning, user messages, or task content. In setups with many tools or verbose outputs, this overhead can represent 30–50% of total token consumption.
What is TOON in the context of MCP optimization?
TOON (Tool Output Optimization Notation) is an approach to structuring MCP tool responses to minimize token consumption. Rather than returning full API responses or human-readable prose, TOON outputs are compact, stripped of null fields, and formatted around what the model actually needs to proceed with its task. Applying TOON to tool outputs consistently can cut output token costs by 50–70% without losing task-critical information.
How does code execution reduce MCP token usage?
Code execution consolidates multiple sequential tool calls into a single call. Instead of 10 discrete tool calls — each appending inputs and outputs to the growing context — the model writes a script, executes it once, and receives a single aggregated result. This collapses what might be 5,000+ tokens of intermediate tool history into a single compact output, often reducing total token consumption by 80–90% for data processing tasks.
What is tool search and how does it reduce token consumption?
Tool search is a pattern where, instead of loading all tool schemas upfront, the agent has access to a single search_tools function. It queries this function with a description of what it needs and receives only the relevant tool schemas in response. Since most agents use a small fraction of available tools in any given session, this eliminates the token cost of irrelevant schema loading — often 80–95% of schema tokens in setups with 40+ tools.
How do I know which MCP optimization technique to prioritize?
Start by measuring. Track which tools consume the most output tokens (output volume × call frequency) and which tool schemas are largest. High-output tools are candidates for TOON compression and pagination. Setups with many tools benefit most from tool search. Tasks involving sequential data operations benefit most from code execution. Most production MCP setups benefit from combining several of these techniques rather than relying on one.
Can MCP token optimizations hurt model performance?
Plans first. Then code.
Remy writes the spec, manages the build, and ships the app.
Yes, if done incorrectly. Compressing outputs too aggressively can strip information the model needs, leading to incorrect reasoning or repeated tool calls. Rolling summarization of tool history can lose critical details. The risk is manageable through careful testing: measure task completion rates before and after optimization, and monitor for increased error rates or tool call loops, which often signal that the model is missing context. Start with conservative compression and tighten incrementally.
Key Takeaways
- MCP server token consumption comes from three main sources: tool schemas, verbose outputs, and multi-turn tool call history — all of which are optimizable.
- Tool search is typically the highest-leverage single change, capable of cutting schema token consumption by 80–95% in setups with many tools.
- Code execution collapses multi-step sequential operations into single calls, often reducing intermediate output tokens by 80–90%.
- TOON provides a systematic framework for compressing tool outputs — omitting nulls, using compact formats, and returning only task-relevant fields.
- Monitoring token usage per tool is essential for knowing where to focus optimization effort.
- These techniques combine: a production MCP setup using tool search, code execution, TOON outputs, and caching can realistically reduce total token consumption by 90%+ compared to a naive setup.
If you’re building agents or MCP servers and want to start with infrastructure that handles the fundamentals cleanly, MindStudio is worth a look.