Build a Custom CLI That Compresses 132,000 Tokens to 2,000 in Your Claude Context — In 10 Minutes
A School.com CLI built in 10 minutes compressed 132,000 tokens of API data to ~2,000 tokens in Claude's context — a 66x reduction. Here's how to replicate it.
132,000 Tokens In, 2,000 Tokens Out: What a School CLI Taught Me About Context Efficiency
A School.com CLI built in roughly 10 minutes compressed 132,000 tokens of API response data down to approximately 2,000 tokens inside Claude’s context window. That’s a 66x reduction. The agent sent about 260 tokens to School, got back 132,000 tokens of raw data, and none of that payload hit the context window — because the CLI handled the routing and returned only a clean summary.
That number is worth sitting with. If you’re paying per token or managing session limits in Claude Code, a 66x reduction on a single tool call is not a marginal improvement. It’s a different category of efficiency.
The tool that made this possible is Printing Press — a CLI factory plus a library of 50+ pre-built CLIs. The idea is straightforward: instead of hitting an API endpoint and dumping raw JSON into your context, you build a CLI that pre-formats the output, mirrors data locally in SQLite, and returns only what the agent actually needs.
This post is about why that architecture works, how to replicate it, and what it implies for how you should be thinking about tool calls in any Claude-based workflow.
The Problem With How Agents Talk to Tools Right Now
Remy doesn't write the code. It manages the agents who do.
Remy runs the project. The specialists do the work. You work with the PM, not the implementers.
The default mental model for connecting an AI agent to external data is: API call → JSON response → agent reads JSON. This works fine when you’re writing code that processes the response programmatically. It does not work well when the agent is the consumer, because agents pay per token and JSON is verbose.
Consider what a typical API response looks like. You hit a paginated endpoint, you get back a deeply nested JSON body with fields you don’t need, metadata you didn’t ask for, and formatting that was designed for a developer’s eyes, not a language model’s context window. The agent has to read all of it to find the three things it actually cares about.
MCP servers were supposed to fix this. And they did fix discovery — instead of knowing the exact endpoint, the agent can browse 50 available tools from a server. But that flexibility has a cost: you’re loading descriptions and schemas for all 50 tools whether you use them or not. The benchmark from Printing Press’s own testing puts it plainly: MCP used 35 times more tokens than a CLI on the same task, and reliability dropped from 100% with the CLI to 72% with MCP as task complexity increased.
That reliability gap is the part that doesn’t get discussed enough. It’s not just tokens — it’s that MCP’s overhead creates more surface area for things to go wrong.
Why CLIs Are the Right Abstraction for Agents
A CLI — command line interface — is not a new concept. Git has one. GitHub has one. The Google Cloud SDK has one. What’s new is building CLIs specifically optimized for how language models consume output.
The properties that make CLIs good for agents are:
Lazy discovery. The agent doesn’t load tool schemas upfront. It invokes a command when it needs it. No passive token burn.
Pre-formatted output. Instead of raw JSON, the CLI returns 200 tokens of clean text. The formatting decision is made once, at build time, not at inference time.
Local SQLite mirror. No round trips for repeated queries. No rate limit exposure on every agent turn.
Composability. You can chain CLI commands together the same way you’d pipe Unix commands. The agent can build workflows out of discrete, reliable primitives.
Auth is solved once. The CLI holds the token. The agent doesn’t need to manage credentials in-context.
The School CLI example illustrates all five. School doesn’t have a public API. The CLI reverse-engineered the endpoints, built a local mirror, and now when Claude asks “grab me the 10 most recent posts in my community,” it gets back a clean summary — not 132,000 tokens of raw HTML and JSON.
If you’re thinking about how to manage context rot in Claude Code, this is the upstream fix. Compacting at 60% capacity is good hygiene, but if individual tool calls are injecting 132,000 tokens into your session, you’re fighting a losing battle. The right move is to prevent the bloat from entering in the first place.
Building the School CLI: What Actually Happened
The build took about 10 minutes using Claude Code with Printing Press installed. The process was roughly:
- Drop three URLs into Claude Code: the Printing Press library, the factory documentation, and the School community page.
- Say: “This is a new tool that has pre-loaded CLIs and helps you build CLIs. Install everything I need.”
- Claude Code installs the factory (which requires Go — a free, open-source language from Google, takes about a minute to set up).
- Ask Claude Code to build a School CLI. Since School has no public API, Claude Code did deep discovery — analyzed the site, reverse-engineered the endpoints, and generated a Go CLI that wraps them.
Hire a contractor. Not another power tool.
Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.
The output: a working CLI that can pull posts by category, filter by engagement, and return structured summaries. When Claude Code later asked “grab me the three strongest wins from my community,” it used the CLI, got back a clean list with links, and the agent verified each link was real.
The token accounting was explicit: 260 tokens out, 132,000 tokens returned by School’s backend, ~2,000 tokens into Claude’s context. The CLI absorbed the difference.
This is the pattern. You’re not reducing the data School sends — you’re deciding what fraction of it the agent needs to see.
The Catalog and What’s Already Built
Printing Press ships with a library of 50+ pre-built CLIs. The starter pack includes ESPN, Flight Goat, Movie Goat, and Recipe Goat. There are also CLIs for Amazon, Craigslist, eBay, TikTok Shops, Shopify, Airbnb, and a contact lookup tool that cross-references LinkedIn and email verification services.
The ESPN CLI is a good example of why this matters. ESPN has no public API for real-time game data. The CLI handles the scraping, formats the output, and returns something like: “Two NBA games tonight: Knicks vs Sixers at 7pm ET, Spurs vs Timberwolves at 9:30pm ET.” That’s maybe 30 tokens. The raw HTML of ESPN’s game page would be thousands.
The Hacker News CLI demo from the source material shows the build process for a site that does have a public API. Even there, the CLI is worth building: it returns the top 10 stories with 100+ points from the last 24 hours, ranked by score, in a format the agent can immediately use. No pagination handling, no JSON parsing, no schema negotiation. The agent just asks “which sites are dominating Hacker News today” and gets a clean answer.
For anyone building agent workflows that touch multiple data sources, the mental model should be: CLI first, API second, MCP third. If a CLI exists, use it. If an API exists, build a CLI around it. MCP is the fallback when you need dynamic tool discovery and you’ve accepted the token overhead.
The Broader Architecture Implication
Here’s the thing that’s easy to miss: this isn’t just about saving tokens on individual calls. It’s about what becomes possible when tool calls are cheap.
If a single School API call costs 132,000 tokens, you can afford maybe one or two per session before you’re in context trouble. If it costs 2,000 tokens, you can afford dozens. That changes what kinds of workflows are viable. Agents that need to cross-reference multiple data sources, run iterative queries, or maintain state across many tool calls become practical instead of prohibitively expensive.
The self-improving agent architectures — like what Hermes Agent does with its skill system, or what you’d build with Claude Code’s AutoResearch loop for self-improving skills — depend on cheap, reliable tool calls. If every external data fetch is burning 100k+ tokens, the agent spends most of its context budget on data retrieval rather than reasoning.
This is also where the Claude Code skills architecture becomes relevant. A skill that wraps a CLI is a fundamentally different artifact than a skill that wraps an API call. The CLI skill is fast, predictable, and token-efficient. You write it once, the output format is stable, and the agent can invoke it confidently across sessions.
Not a coding agent. A product manager.
Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.
Platforms like MindStudio handle this kind of orchestration at a higher level — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — but the underlying principle is the same: the interface between your agent and your data sources determines how much of your context budget goes to retrieval versus reasoning.
Security and Sharing
One detail worth being explicit about: CLIs should not embed credentials. The same rule that applies to API keys in agent workflows applies here — store tokens in environment variables or a secrets manager, not in the CLI script itself.
The Tally CLI example from the source material shows the right pattern: build the CLI, wrap it in a skill, push the skill to a private GitHub repo, invite team members as contributors, and have each person supply their own API key in their local environment. The CLI logic is shared; the credentials are not.
This also means CLIs are composable across teams in a way that raw API integrations often aren’t. You can publish a CLI to the Printing Press library and other people can pull it in and use it with their own auth. The interface is standardized; the credentials are personal.
For teams building internal tooling, this is a meaningful workflow. If you’re thinking about how to build a skill-based content system in Claude Code, wrapping your data sources in CLIs before building skills on top of them gives you a much cleaner foundation. The skill doesn’t need to know anything about pagination, rate limits, or response schemas — it just calls the CLI and gets clean output.
What to Build First
If you want to replicate the School CLI result, the fastest path is:
- Install Go (one command, Claude Code will handle it if you ask).
- Install the Printing Press starter pack from
printingpress.dev. - Install the factory (the component that lets you build custom CLIs).
- Pick one data source you use regularly that has a messy API or no API at all.
- Drop the URL into Claude Code and say: “Use the Printing Press CLI factory to build me a CLI for this. Do deep discovery first, then generate the Go CLI, then verify it works.”
The build will take 10-30 minutes depending on complexity. The result is a binary that lives locally, returns clean output, and can be wrapped in a Claude Code skill for natural language invocation.
The token math compounds quickly. If you have three data sources each returning 50,000+ tokens per call, and you convert all three to CLIs that return 2,000 tokens each, you’ve freed up roughly 144,000 tokens per session. That’s the difference between a session that hits context limits after a few tool calls and one that can run complex multi-step workflows without compaction.
Tools like Remy take a related approach at the application layer — you write an annotated markdown spec and it compiles into a complete TypeScript backend, SQLite database, auth, and deployment. The spec is the source of truth; the generated code is derived output. The abstraction is different, but the underlying logic is the same: the interface you write should be optimized for human intent, and the system should handle the translation to whatever the underlying infrastructure needs.
How Remy works. You talk. Remy ships.
The School CLI is a small example of a large principle. Every token your agent spends reading raw API responses is a token it’s not spending on reasoning. The right architecture makes data retrieval cheap so that reasoning can be rich. A 66x compression ratio on a single tool call is not a trick — it’s what happens when you design the interface correctly.
One More Thing About Reliability
The 100% vs 72% reliability gap between CLI and MCP on harder tasks deserves more attention than it usually gets.
The intuition is that as tasks get more complex, the agent needs to make more decisions about which tools to call and how to interpret their outputs. MCP’s overhead — the token cost of loading tool schemas, the ambiguity of dynamic discovery — introduces more failure modes. The CLI’s simplicity is a feature, not a limitation.
When you’re building agent workflows that need to run reliably — scheduled crons, automated reporting, anything that runs without human supervision — reliability is the primary constraint. A workflow that works 72% of the time is not a workflow you can trust. A workflow that works 100% of the time is infrastructure.
The Claude Code context management practices that help you extend session limits are valuable, but they’re downstream of this. If your tool calls are reliable and token-efficient, you have more headroom for everything else. If they’re not, you’re optimizing the wrong layer.
Build the CLI. The 10 minutes is worth it.