Skip to main content
MindStudio
Pricing
Blog About
My Workspace

What Is Prompt Caching in Claude Code? How to Save Millions of Tokens

Prompt caching lets Claude reuse expensive context across sessions. Learn how it works, when to use it, and how to extend your session limits significantly.

MindStudio Team RSS
What Is Prompt Caching in Claude Code? How to Save Millions of Tokens

The Hidden Cost Problem in Long Claude Sessions

If you’ve ever worked through an extended Claude Code session on a large codebase, you’ve probably noticed your token usage climbing fast. Long system prompts, thousands of lines of code loaded into context, tool definitions, few-shot examples — it all adds up. And every time you send a new message, Claude processes all of it again from scratch.

That’s the problem prompt caching in Claude solves. It lets Claude store a snapshot of expensive context and reuse it across multiple requests, instead of re-reading and re-processing the same tokens each time. The result is faster responses, dramatically lower token costs, and significantly extended session capacity.

This guide explains exactly how prompt caching works, when it activates, how to structure your prompts to make the most of it, and how it can help you stretch sessions that would otherwise hit limits far too early.


How Prompt Caching Actually Works

Claude’s prompt caching is a server-side feature in Anthropic’s API. When you mark a portion of your prompt for caching, Anthropic stores a computed representation of that content on their infrastructure. On subsequent requests that include the same cached prefix, Claude can skip the work of re-processing that content entirely — it reads from cache instead.

There are two types of token operations when caching is involved:

  • Cache write: The first time a cacheable block is processed and stored. This costs slightly more than a standard input token (typically 25% more).
  • Cache read: Every subsequent request that hits the cached content. This costs significantly less — typically 90% less than the base input token price.

Other agents ship a demo. Remy ships an app.

UI
React + Tailwind ✓ LIVE
API
REST · typed contracts ✓ LIVE
DATABASE
real SQL, not mocked ✓ LIVE
AUTH
roles · sessions · tokens ✓ LIVE
DEPLOY
git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The math makes caching worthwhile quickly. If you have a 10,000-token system prompt and you’re making 20 requests in a session, without caching you’re paying for 200,000 input tokens. With caching, you pay the cache-write price once and cache-read prices 19 times after that.

Cache Lifetime and Refreshes

The default cache lifetime for Claude prompt caching is 5 minutes. If you send a request that hits a cached block within that window, the cache timer resets to another 5 minutes. Active sessions naturally keep their caches warm.

For certain use cases — particularly long-running agentic tasks or automated workflows — Anthropic supports longer cache lifetimes. The key to maintaining a warm cache is consistent request patterns: if requests stop for more than 5 minutes, the cache expires and the next request pays cache-write prices again.


What Claude Code Caches (and What It Doesn’t)

Claude Code can cache several types of content, but not everything qualifies. Understanding the distinction helps you structure your workflow to get the most benefit.

Content That Can Be Cached

System prompts are the most obvious target. If you’re giving Claude a detailed set of instructions, a role definition, or a project-specific context block at the start of every session, that’s exactly what caching was built for.

Tool definitions — the schemas that describe what tools Claude can call — are another strong candidate. These are often verbose and largely static within a session.

Large document or codebase chunks loaded into context are high-value cache targets. If you’re working with a library, a long README, or a full file loaded into the prompt, caching prevents Claude from re-reading all of it on every turn.

Few-shot examples included in system or user messages can also be cached when they appear at a stable position in the prompt.

Content That Cannot Be Cached

The trailing, dynamic portion of your conversation can’t be cached in the same way. Cache blocks must appear as a prefix — content that stays consistent from request to request. Anything after a cache boundary that changes between requests won’t benefit from caching.

This is why prompt structure matters so much. You want your static, expensive content at the top, and your dynamic, changing content (like the latest user message) at the bottom.


Token Cost Breakdown: The Real Savings

To make this concrete, here’s how the economics work using Claude’s typical pricing structure.

Token TypeRelative Cost
Standard input tokens1× (baseline)
Cache write tokens~1.25×
Cache read tokens~0.1×
Output tokens~5× (varies by model)

Say you’re running a coding assistant with a 20,000-token system prompt (detailed instructions, code style guidelines, project context). You run 50 requests in a session.

Without caching:

  • 50 × 20,000 = 1,000,000 input tokens

With caching:

  • 1 cache write: 20,000 × 1.25 = 25,000 token-cost units
  • 49 cache reads: 49 × 20,000 × 0.1 = 98,000 token-cost units
  • Total: ~123,000 token-cost units

That’s roughly an 88% reduction in the cost of processing your system prompt. On Claude Sonnet at current pricing, that difference across a long project session can easily reach hundreds of thousands of effective tokens saved.


How to Enable Prompt Caching in Claude Code

RWORK ORDER · NO. 0001ACCEPTED 09:42
YOU ASKED FOR
Sales CRM with pipeline view and email integration.
✓ DONE
REMY DELIVERED
Same day.
yourapp.msagent.ai
AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

Prompt caching is controlled via the cache_control parameter in the Anthropic API. You add it to the specific content blocks you want cached.

Marking Content for Caching

In the messages API, you attach a cache_control object with "type": "ephemeral" to the content block you want cached:

{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Here is the full codebase context: [10,000 tokens of code]",
      "cache_control": {"type": "ephemeral"}
    },
    {
      "type": "text",
      "text": "Now answer this question: what does the auth module do?"
    }
  ]
}

The cache boundary is placed at the end of the block marked with cache_control. Everything up to and including that block is the cached prefix. Everything after it is the dynamic suffix.

Caching the System Prompt

For system-level caching, you can mark the system prompt block directly:

{
  "system": [
    {
      "type": "text",
      "text": "You are a senior software engineer working on Project X. [detailed context...]",
      "cache_control": {"type": "ephemeral"}
    }
  ]
}

This is often the highest-leverage change you can make, especially if your system prompt is long and consistent across requests.

Multiple Cache Breakpoints

The API supports up to four cache breakpoints per request. This lets you cache multiple separate blocks at different positions in the prompt — useful when you have distinct chunks of static content at different levels.

For example:

  1. Cache breakpoint after the system prompt
  2. Cache breakpoint after tool definitions
  3. Cache breakpoint after a large document you’ve loaded
  4. Dynamic conversation history and the current message (not cached)

Strategies for Maximum Cache Efficiency

Getting prompt caching to work is straightforward. Getting the most out of it requires some thought about how you structure your prompts.

Put Static Content First, Always

The single most important rule: everything you want cached must come before anything that changes. This sounds obvious, but it’s easy to accidentally break by including session-specific details early in a prompt, or by using dynamic timestamps in the system prompt.

Move any dynamic elements — user names, session IDs, current dates — to the end of the prompt or into the trailing, non-cached message.

Keep Your System Prompt Stable

Frequent edits to the system prompt reset the cache. If you’re iterating on system prompt language during development, each version creates a new cache entry. Once you’ve settled on a working prompt for production or a session, stabilize it.

Load Documents Once, Cache Them

If you’re working with Claude Code on a project and need to load file contents into context, load them at the start of the session with a cache marker. Don’t reload or reformat them between requests — that breaks the cache hit.

Keep Tool Definitions Consistent

In agentic workflows where Claude has access to tools, your tool schema should be identical across requests. Even minor formatting changes will cause a cache miss. Version your tool definitions and treat them as fixed artifacts during a session.

Use Conversation History Efficiently

Remy doesn't write the code. It manages the agents who do.

R
Remy
Product Manager Agent
Leading
Design
Engineer
QA
Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

As a conversation grows, the older portions of the history become increasingly static. Some Claude Code implementations cache the conversation history up to a certain point and only process the newest messages fresh. This effectively extends how long a session can run before hitting context limits.


Extending Session Limits with Prompt Caching

One of the less obvious benefits of prompt caching is how it interacts with rate limits and session capacity.

Anthropic’s rate limits are typically measured in tokens per minute (TPM) and requests per minute (RPM). Cache reads count against TPM, but at their reduced cost — so a 20,000-token cache read “costs” roughly 2,000 tokens against your rate limit instead of 20,000.

This means:

  • You can pack more context into each request without burning through your TPM limit as fast
  • Long sessions with large context windows become viable for extended work
  • Agentic loops that would previously exhaust token budgets can now run many more iterations

In practice, a well-cached workflow can make 5–10× more effective use of the same rate limit headroom compared to uncached requests with the same context size.

For Claude Code specifically, this matters a lot when working on large codebases where you need to maintain substantial project context across dozens of back-and-forth turns.


Common Mistakes That Kill Cache Hits

Even with caching set up correctly, a few common errors will cause consistent cache misses.

Dynamic content before the cache boundary. If you include anything that changes between requests — a timestamp, a session variable, a random seed — before your cache_control marker, every request generates a new cache write rather than a read.

Inconsistent whitespace or formatting. The cache key is sensitive to the exact content of the text. Even a trailing space or a different newline character will create a new cache entry.

Tool schema drift. If you’re generating tool definitions programmatically and the output changes slightly between requests (different key ordering in JSON, for example), you’ll get constant cache misses on your tool definitions.

Short sessions without enough requests. If you’re making just one or two requests, the cache-write overhead isn’t worth it. Caching becomes clearly beneficial around 3–5 requests with the same context, and its value increases with session length.

Forgetting to include cache_control in every request. The cache breakpoint must be present in each request for Claude to know where to look for cached content. Omitting it on a request means it won’t benefit from the cache, even if the content itself hasn’t changed.


How MindStudio Helps with Token-Efficient Claude Workflows

If you’re building workflows that use Claude at scale — not just one-off queries but structured, repeating pipelines — managing context, caching, and token efficiency becomes an infrastructure problem, not just a prompt design problem.

This is where MindStudio fits naturally. MindStudio’s visual workflow builder lets you design multi-step AI agent workflows with Claude (and 200+ other models) without writing the infrastructure layer yourself. You control what context flows into each step, when to load documents, and how to structure the conversation — the kind of decisions that directly affect cache hit rates.

For example, you can build a Claude-powered code review agent in MindStudio where:

  • The system prompt and code style guidelines are defined once at the workflow level
  • Each run loads only the specific file being reviewed
  • The static context stays consistent, maximizing cache reuse across all runs

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

Because MindStudio handles retries, model routing, and API management, you’re not rebuilding that plumbing every time. You get to focus on prompt structure — which is where the cache efficiency actually lives.

MindStudio also gives you access to Claude alongside every other major model on the same platform, so you can mix and match based on what each step needs. For token-heavy tasks where caching matters, Claude Sonnet with proper cache setup is often the right call. For lightweight steps, you might route to a smaller model entirely.

You can start building on MindStudio for free — no API keys, no separate accounts, just the workflow.


Frequently Asked Questions

Does Claude Code use prompt caching automatically?

Claude Code in some configurations does take advantage of prompt caching automatically — particularly for conversation history. However, to maximize cache efficiency, especially for large system prompts or documents you load into context, you need to explicitly add cache_control markers to your API calls. Relying on automatic caching alone leaves a lot of token savings on the table.

How long does a cached prompt last?

The default cache lifetime is 5 minutes. Each request that hits the cached content resets the timer for another 5 minutes. In active sessions with frequent requests, the cache stays warm indefinitely. If your session goes idle for more than 5 minutes, the cache expires and the next request will pay cache-write prices again.

Does caching affect Claude’s response quality?

No. Prompt caching is purely an optimization for token processing efficiency. Claude’s access to the cached content is identical to reading it fresh — the model sees the same information either way. Response quality, reasoning, and accuracy are unaffected.

Can you cache conversation history?

Yes, but with constraints. You can mark a point in the conversation history as a cache boundary, and everything up to that point will be cached. The typical pattern is to cache everything except the most recent few turns, treating the growing-but-stable history as a cacheable prefix and the newest messages as the dynamic suffix.

What’s the difference between prompt caching and context windows?

These are separate concepts. The context window is the maximum amount of text Claude can process and “see” in a single request. Prompt caching doesn’t change the context window size — it just changes the cost and speed of processing content within that window. Caching makes it more affordable to use more of your context window consistently.

Is prompt caching worth it for short sessions?

Not usually. The cache-write cost (25% more than standard input) means you need at least a few requests hitting the same cache to break even, and the savings only become substantial after 5 or more requests. For single-shot queries, skip caching. For iterative coding sessions, research workflows, or any task with 10+ back-and-forth turns, caching is almost always worth enabling.


Key Takeaways

  • Prompt caching stores computed representations of your prompt content server-side, so Claude can skip re-processing the same tokens on every request.
  • Cache reads cost roughly 90% less than standard input tokens, making long sessions with large context dramatically more affordable.
  • Static content (system prompts, tool definitions, loaded documents) must appear before dynamic content to form a valid cache prefix.
  • The cache lifetime is 5 minutes by default, automatically refreshed by active requests.
  • Common cache-busting mistakes include dynamic content before the cache boundary, inconsistent formatting, and varying tool schemas between requests.
  • Properly cached workflows can extend effective session capacity by 5–10× compared to uncached equivalents with the same context size.
  • Tools like MindStudio can help you structure and run Claude workflows where prompt consistency — and therefore cache efficiency — is maintained by design rather than by hand.

Presented by MindStudio

No spam. Unsubscribe anytime.