MCP Servers Use 35x More Tokens Than CLI Tools — And Reliability Drops to 72% on Hard Tasks
A direct benchmark shows MCP uses 35x more tokens than CLI on the same task, with reliability falling from 100% to 72% as complexity grows. Use CLIs instead.
MCP Burned 35x More Tokens Than CLI on the Same Task
Someone ran the same agentic task through an MCP server and a CLI tool and measured what came back. MCP used 35 times more tokens. Reliability on harder tasks dropped from 100% to 72%. That benchmark should change how you architect agent tooling, and most builders haven’t seen it yet.
The finding is specific enough to act on: MCP servers, as currently implemented, carry enormous token overhead compared to equivalent CLI tools. On simple tasks, both approaches work. As task complexity grows, the CLI holds at 100% reliability while MCP degrades. You’re paying 35x more for a worse outcome on the tasks that matter most.
This isn’t a knock on the MCP protocol as a concept. It’s a measurement of what happens in practice when you route tool calls through a protocol layer that wasn’t designed with token economy as a first-order constraint.
Why This Surprised People Who Should Have Seen It Coming
The appeal of MCP is obvious. You get a standardized interface for connecting models to tools. You write the server once, and any MCP-compatible client can call it. The protocol handles discovery, schema negotiation, and invocation. That’s genuinely useful infrastructure.
The problem is that “standardized interface” has a cost. Every MCP call involves schema transmission, capability negotiation, and structured response wrapping. The protocol is verbose by design — it needs to be self-describing so clients can discover what tools exist and how to call them. That verbosity is the feature. It’s also the tax.
CLI tools don’t carry that overhead. You call a binary. It returns output. The model sees the result. There’s no schema preamble, no capability manifest, no structured envelope. The information density per token is much higher.
This is the same tradeoff you see in token management for Claude Code sessions — every layer of abstraction you add between the model and the actual work costs tokens. Sometimes that cost buys you something real. Sometimes it buys you a protocol handshake the model didn’t need.
The 35x figure is the gap between those two worlds, measured empirically.
What the Numbers Actually Show
The benchmark compared MCP servers against CLI tools on identical tasks, measuring token consumption and task completion rate across a range of complexity levels.
On simple tasks, both approaches completed reliably. The token gap was already present — MCP was more expensive — but both got the job done. This is the regime where most demos live. You show the agent calling a tool, the tool returns a result, the agent uses it. It works. You ship.
The divergence shows up when tasks get harder. CLI tools held at 100% completion. MCP dropped to 72%. That 28-percentage-point reliability gap is not a rounding error. If you’re running a workflow that calls tools dozens of times, a 72% per-call success rate compounds into a workflow that fails most of the time.
The token cost compounds too. At 35x overhead per call, a workflow that makes ten tool calls through MCP costs the equivalent of 350 CLI calls. That’s not a theoretical concern — it’s the difference between a workflow that fits in a context window and one that doesn’t, between a cost-effective agent and one that burns through budget on protocol overhead.
The model routing recommendation that’s been circulating in the agent builder community makes this concrete: use a local Gemma-class model for cheap background classification, GPT-5.5 or Codex for hard implementation work, Claude API for high-judgment architectural decisions, and cheaper hosted models for bulk summarization. That routing logic only makes sense if you’re also controlling the tool call overhead at each step. Routing to a cheap model and then paying MCP overhead on every tool call defeats the purpose.
The Mechanism Behind the Gap
Understanding why MCP is more expensive helps you decide when the cost is worth paying.
An MCP server exposes tools through a structured protocol. When a model wants to call a tool, it first needs to know what tools are available. The server sends a capability manifest. The model reads it. Then the model constructs a structured invocation. The server processes it and returns a structured response. The model parses the response.
Every one of those steps involves tokens. The capability manifest alone can be substantial if you have many tools with rich descriptions. The structured invocation format adds tokens compared to a raw command. The structured response wraps the actual output in metadata the model has to process.
How Remy works. You talk. Remy ships.
CLI tools skip most of this. The model knows the command syntax because it’s in the system prompt or CLAUDE.md. It constructs a command string. The tool runs. The output comes back as text. The model reads the text. That’s it.
This is why the WAT framework for Claude Code projects — which separates Workflows, Agents, and Tools into distinct layers — recommends keeping tool interfaces as thin as possible. The tool layer should do one thing and return clean output. The more you add to the tool interface, the more tokens every invocation costs.
The reliability gap is related but distinct. MCP’s structured invocation format gives the model more ways to get the call wrong. A malformed JSON invocation fails. A missing required field fails. A type mismatch fails. CLI tools are more forgiving — the model generates a command string, and if it’s close enough, the tool runs. The failure modes are different, and on harder tasks where the model is already working near the edge of its capability, the stricter failure modes of MCP compound the problem.
When MCP Is Still the Right Choice
The benchmark doesn’t mean you should rip out every MCP integration. It means you should be deliberate about when you use it.
MCP makes sense when tool discovery is genuinely dynamic — when you don’t know at build time what tools will be available, and the model needs to discover and adapt at runtime. It makes sense when you’re building a client that needs to work with many different tool providers without custom integration code for each. It makes sense when the standardization benefit outweighs the token cost, which is more likely in interactive use cases than in high-volume automated workflows.
For automated workflows running at scale, the calculus usually goes the other way. You know exactly what tools you need. You can define their interfaces precisely in a system prompt. You can write thin CLI wrappers that return clean text output. The model calls them directly. You pay no protocol overhead.
This is the same reasoning behind Andrej Karpathy’s LLM wiki approach, which cuts token use by up to 95% on small knowledge bases compared to RAG. The more you can front-load information into the context in a dense, directly usable form, the less overhead you pay per operation. MCP is the RAG of tool calling — powerful and general, but expensive when you could use something more targeted.
The practical heuristic: if you’re building a tool integration that will be called hundreds or thousands of times in automated workflows, write a CLI wrapper. If you’re building a general-purpose agent that needs to discover and use tools it hasn’t seen before, MCP is appropriate. Most production workflows fall into the first category.
Designing for Token Economy From the Start
The benchmark result is a forcing function for a design question you should be asking earlier: what is the token cost of each component in my workflow, and is that cost justified by what it buys?
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
This question applies beyond tool calling. It applies to memory retrieval, context construction, and model selection at each step. The model routing recommendation — local Gemma for classification, GPT-5.5 for implementation, Claude API for architectural judgment — is an expression of the same principle. Match the cost of the reasoning to the value of the output.
Memory architecture is part of this too. The OpenBrain project’s memory provenance labels — observed from source, confirmed by user, inferred by model, imported from transcript — exist partly for trust reasons, but also for retrieval efficiency. If you know a memory was directly observed from a source, you can retrieve it with high confidence and use it without re-verification. If it was inferred by a model, you might want to re-check it before acting on it. That distinction affects how much context you need to carry and how many tokens you spend on verification.
Platforms like MindStudio handle this orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which means the routing and tool-call decisions are configurable without rewriting integration code every time you want to swap a component.
The broader principle is that token economy is an architectural concern, not an optimization you add later. The 35x overhead of MCP versus CLI isn’t something you tune away with prompt engineering. It’s baked into the protocol choice. You have to make the right choice at design time.
The Reliability Problem Is Harder Than the Cost Problem
The 72% reliability figure deserves more attention than it usually gets, because it’s harder to fix than the token cost.
You can reduce token costs by switching from MCP to CLI. You can’t easily fix a 28-point reliability gap without understanding why it exists. And on harder tasks, that gap is the difference between a workflow that works and one that doesn’t.
The reliability drop likely comes from a combination of factors. Harder tasks push the model closer to its capability limits, which means it makes more mistakes in general. MCP’s structured invocation format provides more opportunities for those mistakes to manifest as hard failures rather than recoverable errors. And the higher token cost of MCP means harder tasks are more likely to approach context limits, which degrades model performance further.
This is why the Claude Code agentic workflow patterns that hold up in production tend to have explicit retry logic, clear failure modes, and checkpointing. A workflow that assumes 100% tool call success will fail in production. A workflow designed around 72% success needs to handle failures gracefully, which adds complexity and more tokens.
The cleaner solution is to use tool interfaces that don’t introduce unnecessary failure modes. CLI tools with well-defined output formats, called through a system prompt that gives the model clear examples, tend to be more reliable than MCP invocations on the same underlying functionality. You’re reducing the surface area for failure.
Other agents ship a demo. Remy ships an app.
Real backend. Real database. Real auth. Real plumbing. Remy has it all.
If you’re building full-stack tooling around these workflows, the same principle applies at the application layer. Remy takes a spec-driven approach — you write annotated markdown describing what the application should do, and it compiles that into a complete TypeScript backend, database, auth, and deployment. The spec is the source of truth; the generated code is derived output. That’s a different kind of token economy: you’re spending tokens on the spec once, not on re-describing the system to the model on every invocation.
What to Do This Week
If you have production workflows using MCP servers, run the same benchmark on your own tasks. Measure token consumption per tool call and completion rate across simple and complex tasks. The 35x figure is an average — your specific tools and tasks may be better or worse.
For any tool that gets called more than a few times per workflow run, ask whether a CLI wrapper would serve the same purpose with lower overhead. In most cases, it will. Write the wrapper, update your system prompt with the command syntax and expected output format, and measure the difference.
For new workflows, default to CLI tools unless you have a specific reason to need MCP’s dynamic discovery capabilities. The protocol overhead is a cost you pay on every call, and on hard tasks, it compounds into reliability problems you don’t want to debug in production.
The model routing recommendation — local Gemma for classification, GPT-5.5 for implementation, Claude API for judgment, cheap hosted models for bulk work — only delivers its cost savings if you’re also controlling tool call overhead at each step. A cheap classification model calling expensive MCP servers is still an expensive classification step.
Token economy is the constraint that shapes everything else in agent architecture. The benchmark just made that constraint visible.