How to Use Prompt Caching and Token Management in Claude Code Dynamic Workflows

Why Dynamic Workflows Burn Through Tokens (and What to Do About It)

Claude-powered dynamic workflows are genuinely powerful — especially when you chain multiple agents together to handle complex, multi-step tasks. But there’s a catch: token costs in these workflows don’t scale linearly. They compound. Each agent call carries context, each sub-agent processes input, and if you’re not deliberate about how you structure prompts and model selection, you can watch your API bill climb fast.

This guide covers the practical mechanics of token management in Claude workflows — specifically prompt caching, model tiering with Haiku, and bounding scope before it gets away from you. Whether you’re building with Claude Code directly or orchestrating agents through a platform, these techniques apply.

Understanding How Token Costs Compound in Multi-Agent Setups

In a simple single-turn Claude interaction, token math is straightforward: input tokens + output tokens = cost. Dynamic workflows break that math.

When you have an orchestrator agent delegating tasks to sub-agents, each sub-agent call typically includes:

The original system prompt (or a version of it)
Relevant context from prior steps
Tool definitions
The actual task instruction

If your system prompt is 2,000 tokens and you run 10 sub-agent calls, that’s 20,000 tokens just for context that doesn’t change between calls. Add in tool schemas — which can be verbose — and you’re looking at significant overhead on every single invocation.

Where the Waste Actually Happens

The most common sources of unnecessary token spend in dynamic workflows:

Redundant context passing — sending the full conversation history to every sub-agent when only a slice is relevant
Repeated system prompts — re-sending identical instructions on every call without caching
Model mismatch — using Claude Sonnet or Opus for tasks that don’t require that level of reasoning
Unbounded output length — not setting max_tokens explicitly and letting models run long on simple tasks
Bloated tool definitions — including every possible tool in every agent call instead of scoping to what’s needed

Remy doesn't build the plumbing. It inherits it.

Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.

WHAT REMY DOESN'T HAVE TO BUILD

200+

AI MODELS

GPT · Claude · Gemini · Llama

✓

1,000+

INTEGRATIONS

Slack · Stripe · Notion · HubSpot

✓

MANAGED DB

AUTH

PAYMENTS

CRONS

Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.

How Prompt Caching Works in Claude

Prompt caching is one of the most underused cost-reduction tools in Claude’s API. The concept is simple: if you’re sending the same large block of content repeatedly (a system prompt, a document, a set of tool definitions), Claude can cache that content and skip re-processing it on subsequent calls.

The Mechanics

You enable caching by adding a cache_control parameter with "type": "ephemeral" to specific content blocks in your API request. Claude will cache everything up to that breakpoint.

Here’s what the structure looks like:

{
  "system": [
    {
      "type": "text",
      "text": "You are a code review assistant...[long system prompt]",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [...]
}

On the first call, Claude writes the cache. On subsequent calls within the cache window, it reads from cache instead of reprocessing.

What It Actually Costs

Cache writes cost slightly more than standard input tokens — typically around 25% more. But cache reads cost significantly less — roughly 90% less than standard input token pricing. So the economics work like this: pay a small premium once to write the cache, then save substantially on every subsequent read.

The cache TTL is 5 minutes by default. You can extend this — Anthropic supports a 1-hour cache duration — which matters a lot for workflows that run across longer time windows.

Minimum Token Requirements

Not every prompt qualifies for caching. Claude requires a minimum number of tokens before it will cache a block:

Claude 3.5 Sonnet and Claude 3.5 Haiku: 1,024 tokens minimum
Claude 3 Opus: 2,048 tokens minimum

If your system prompt is short, caching won’t kick in. This is actually a nudge toward being more thorough in your system prompts when you know they’ll be reused — consolidating instructions, tool context, and relevant background into a single cacheable block pays off.

Where to Place Cache Breakpoints

You can use up to four cache breakpoints per request. Strategically, you want to place them at natural boundaries where content is stable across calls:

End of the system prompt — almost always worth caching if it’s long
After tool definitions — tool schemas can be hundreds of tokens each and rarely change mid-workflow
After large static documents — if every agent call needs to reference the same knowledge base or codebase snippet, cache it
After conversation history that’s “settled” — in long-running sessions, earlier turns in the conversation don’t change

Using Claude Haiku for Sub-Agents

Model selection is the single biggest lever for controlling costs in multi-agent workflows. Not every task in a workflow requires the same reasoning capability, and pricing differences between models are significant.

Claude 3.5 Haiku is notably cheaper per token than Claude 3.5 Sonnet, which is itself cheaper than Claude 3 Opus. For many sub-agent tasks, Haiku is more than capable — and using it where appropriate can cut your workflow costs substantially.

Tasks Where Haiku Performs Well

Classification and routing — determining which agent or path to invoke next
Simple extraction — pulling structured data from text, converting formats
Summarization of sub-results — condensing an agent’s output before passing it upstream
Tool call formatting — constructing the exact JSON payload a tool expects
Validation steps — checking whether output meets a simple condition

Other agents start typing. Remy starts asking.

YOU SAID "Build me a sales CRM."

REMY ASKS

01 DESIGN Should it feel like Linear, or Salesforce?

02 UX How do reps move deals — drag, or dropdown?

03 ARCH Single team, or multi-org with permissions?

Scoping, trade-offs, edge cases — the real work. Before a line of code.

Tasks Where You Actually Need Sonnet or Opus

Complex multi-step reasoning — tasks where getting the logic wrong cascades into downstream errors
Code generation requiring correctness — writing logic that will actually run, not just look plausible
Nuanced judgment calls — situations where context, ambiguity, and trade-offs matter
Orchestration decisions — the top-level agent deciding how to decompose a complex task

How to Build a Tiered Model Architecture

The practical approach is to define a model tier per agent role in your workflow:

Orchestrator → Claude Sonnet (reasoning, decomposition)
  ├── Router Agent → Claude Haiku (classify, delegate)
  ├── Extraction Agent → Claude Haiku (parse, format)
  ├── Analysis Agent → Claude Sonnet (interpret results)
  └── Summary Agent → Claude Haiku (compress, consolidate)

This isn’t a theoretical exercise. In a workflow with 15 sub-agent calls, if 10 of those are Haiku-appropriate tasks, you’re running the cheaper model for two-thirds of your volume.

Bounding Scope Before Costs Spiral

Caching and model tiering help, but they’re optimizations on top of a baseline. The baseline itself needs to be sound. Unbounded workflows — where context accumulates without limits, outputs run unconstrained, and tool sets are maximalist — will overwhelm any optimization you layer on top.

Set Explicit Output Limits

Every Claude API call should have an explicit max_tokens value. Don’t leave it at the model default. For sub-agents doing narrow tasks:

Routing decisions: 50–100 tokens
Extraction tasks: 200–500 tokens
Analysis summaries: 500–1,000 tokens
Full generation tasks: set a ceiling based on what you actually need

This is especially important for Haiku-tier sub-agents doing simple work. A routing agent that returns “TASK_TYPE: extraction” doesn’t need 4,096 tokens of headroom.

Trim Context Before Passing It

Sub-agents rarely need the full conversation history. Before invoking a sub-agent, extract only the relevant slice of context it needs. A few patterns that work well:

Summary compression — run a cheap Haiku call to summarize prior steps into 200 tokens before passing to the next agent
Structured handoffs — pass a structured object (task description, relevant prior outputs, required format) rather than raw message history
Sliding window — keep only the last N exchanges in the context passed to sub-agents

Scope Tool Definitions Per Agent

Tool schemas can be surprisingly token-heavy, especially when you’re using multiple tools with detailed descriptions and parameter schemas. Don’t include the full tool set in every agent call. Instead:

Define tool subsets per agent role
Pass only the tools an agent is authorized to use for its specific task
Cache tool definitions alongside system prompts when they’re stable

Use Structured Outputs to Reduce Noise

When you need a specific format, ask for it explicitly and enforce it with JSON mode or structured output schemas. This reduces the likelihood of a model producing verbose reasoning when you just need a clean result, which keeps output tokens down and makes downstream parsing cheaper and more reliable.

Practical Caching Patterns for Common Workflow Types

Different workflow architectures have different caching opportunities. Here are the patterns that pay off most in practice.

Document Processing Pipelines

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

If your workflow processes a large document (a contract, a codebase, a research paper) through multiple analysis steps, cache the document content once and run multiple passes against the cache.

Structure it so the document appears before the cache breakpoint, and each agent call changes only the task instruction, not the document itself. This is the scenario where caching delivers the most dramatic cost reduction.

Recurring Scheduled Workflows

Workflows that run on a schedule — daily reports, monitoring agents, regular data processing — often have nearly identical system prompts and tool definitions across runs. If your workflow runs every hour, set your cache TTL to 1 hour and structure your calls so the stable content lands before a cache breakpoint. You’ll pay for one cache write per window and get multiple reads.

Long-Running Interactive Sessions

For workflows where a user interacts over multiple turns, the conversation history grows with each turn. As history accumulates, add a cache breakpoint at the end of the “settled” history — the turns that won’t change. New turns get processed fresh; the history gets read from cache.

This is especially useful when conversation history is long but the recent turns are what actually drives the current task.

Where MindStudio Fits Into This Picture

If you’re building Claude-powered workflows and don’t want to manage all of this infrastructure manually, MindStudio handles a lot of it at the platform level.

MindStudio’s visual workflow builder lets you route tasks to different models — you can assign Claude Haiku to simple sub-agent steps and Claude Sonnet to the reasoning-heavy ones without writing orchestration code. The model selection is per-step, not global, which maps directly to the tiered model architecture described above.

For teams building multi-agent systems with 200+ available models (including the full Claude lineup), MindStudio abstracts away the API key management, rate limiting, and retry logic that would otherwise sit between your workflow logic and the models. That’s the kind of infrastructure overhead that, when you’re managing it yourself, tends to introduce complexity that makes optimization harder.

You can also use the Agent Skills Plugin — an npm SDK — to call MindStudio capabilities directly from Claude Code or other agent frameworks. This means your Claude Code agent can delegate specific tasks (like sending emails, running searches, or triggering other workflows) without those tasks expanding your Claude token footprint unnecessarily.

If you’re iterating on multi-agent Claude workflows and want to test model tiering quickly, MindStudio is free to start at mindstudio.ai.

Common Mistakes That Inflate Costs

Even with caching and tiering in place, a few patterns tend to undo the savings.

Over-Fetching Context at Every Step

If every agent call retrieves the full knowledge base “just in case,” you’re burning tokens on context that 80% of calls don’t use. Be explicit about what each step needs and fetch or pass only that.

Ignoring Cache Invalidation

Adding a cache breakpoint isn’t permanent — if the content before the breakpoint changes (even slightly), the cache misses and you write a new one. Dynamic content, timestamps, or user-specific data injected before a cache breakpoint will kill your cache hit rate. Keep static content before breakpoints; keep dynamic content after.

Not Monitoring Token Usage

Most teams only notice token cost problems after the bill arrives. Build logging into your workflow that tracks input tokens, output tokens, and cache hits/misses per step. This makes it obvious which steps are expensive and whether caching is working as expected. Anthropic’s API responses include usage metadata — use it.

Treating All Agents as Equal

Some tasks genuinely require more reasoning. But “more reasoning” shouldn’t mean “pass more context.” Sometimes the answer is a better prompt, not more tokens. Write system prompts that give agents clear decision criteria so they don’t need to reason through ambiguity from scratch on every call.

Frequently Asked Questions

How does Claude prompt caching actually reduce costs?

Prompt caching works by storing the KV (key-value) computation for a section of your prompt, so Claude doesn’t need to reprocess those tokens on subsequent calls. Cache reads are charged at roughly 10% of the normal input token rate — significantly cheaper than re-sending the same content. The tradeoff is a small cache write premium (around 125% of normal input token cost), which typically pays off after just a few reads.

When should I use Claude Haiku vs. Sonnet in a multi-agent workflow?

Use Haiku for tasks that are narrow and well-defined: classification, extraction, formatting, validation, and simple summarization. Use Sonnet when a task requires genuine reasoning, handling ambiguity, generating complex code, or making decisions that significantly affect downstream steps. The rule of thumb: if you can write a test that checks whether the output is correct or not, Haiku can probably handle it.

What’s the minimum prompt size for caching to kick in?

For Claude 3.5 Sonnet and Claude 3.5 Haiku, the minimum cacheable block is 1,024 tokens. For Claude 3 Opus, it’s 2,048 tokens. Content blocks shorter than these thresholds won’t be cached — you’ll be charged standard input token rates. If your system prompt is currently shorter than the minimum, consider whether adding more structured context, detailed instructions, or tool definitions to it would both improve performance and qualify it for caching.

How long does a prompt cache last?

The default cache TTL in Claude’s API is 5 minutes. Anthropic supports an extended 1-hour cache duration as well. For workflows that run on a schedule or process multiple requests in a short window, aligning your cache TTL with your call frequency is important. If calls are spread out over hours, you’ll need to re-write the cache each time, which is still cheaper than not caching at all if you have multiple calls within a session.

Can I use prompt caching with tool definitions?

Yes, and this is often one of the highest-value places to apply caching. Tool schemas — especially for complex tools with many parameters — can be several hundred tokens each. If your workflow uses a stable set of tools across multiple calls, placing a cache breakpoint after tool definitions means you pay to process those schemas once and read from cache on subsequent calls.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

What’s the best way to reduce token costs without caching?

Model tiering (using Haiku for sub-agents), setting explicit max_tokens limits per call, scoping tool definitions per agent, and compressing context before passing it to sub-agents all reduce costs independently of caching. Structured outputs help too — when you enforce a specific output schema, models produce less extraneous text. Combining these with caching gives you the best overall cost profile.

Key Takeaways

Token costs in dynamic workflows compound — context overhead, tool definitions, and sub-agent calls add up fast if you’re not deliberate.
Prompt caching with cache_control breakpoints dramatically reduces costs for stable, repeated content — system prompts, tool definitions, and large documents are the best candidates.
Model tiering — using Claude Haiku for narrow, well-defined sub-agent tasks and Sonnet for complex reasoning — is the biggest single lever for managing costs.
Bounded scope — explicit max_tokens, trimmed context, and per-role tool sets — keeps your baseline efficient before optimizations kick in.
Monitor token usage per step so you can see where spend is concentrated and whether caching is hitting as expected.

Building multi-agent Claude workflows efficiently is mostly about discipline: knowing what each agent actually needs, not what it might possibly use. Start there, layer in caching and model tiering, and costs become predictable rather than surprising. If you want a platform that handles the infrastructure layer so you can focus on the workflow logic itself, MindStudio is worth a look.