What Is Prompt Caching in Claude Code? How to Save Millions of Tokens
Prompt caching cuts Claude token costs by 90% for repeated context. Learn how cache TTL works, what breaks the cache, and three habits that maximize savings.
The Token Bill Nobody Saw Coming
If you’ve been using Claude Code for serious development work — loading large codebases, attaching documentation, running multi-turn conversations — your token costs are probably higher than they need to be. In some cases, much higher.
Prompt caching in Claude is one of the most practical cost-reduction tools available. It can cut input token costs by up to 90% for repeated context. But most developers either don’t know it exists or aren’t using it correctly.
This guide explains exactly how prompt caching works, what the cache TTL means in practice, what silently breaks the cache (and costs you full price), and three concrete habits that will minimize your Claude token spend.
What Prompt Caching Actually Is
Every time you send a message to Claude, the model processes all the tokens in your request — system prompt, conversation history, attached files, tool definitions, everything. Even if you sent the exact same context two minutes ago.
Prompt caching changes that. When caching is enabled, Anthropic stores a processed version of your context on their servers. On subsequent requests that include that same prefix, Claude reads from the cache instead of re-processing from scratch.
The result: cached input tokens are billed at roughly 10% of the normal input token rate. That’s a 90% reduction.
This matters most when you have large, stable context that appears in request after request — like a long system prompt, a full codebase dump, or an extensive documentation block. Without caching, you pay full price every time. With caching, you pay full price once (plus a small cache write fee), then a fraction of that for every subsequent hit.
What Gets Cached
Anthropic’s prompt caching documentation specifies that the cache stores a “prefix” of your request — everything from the beginning of your input up to the cache breakpoint you define.
This includes:
- System prompts
- Tool and function definitions
- Long conversation histories
- Attached documents or file contents
- Any other large, repeated context
What matters is that the content before the breakpoint stays identical across requests. The moment anything changes in that prefix, the cache is invalidated.
How Cache TTL and Pricing Work
Cache Lifetime
The default cache TTL (time to live) is 5 minutes from the last time the cache was accessed. Each successful cache hit resets the clock. So if you’re actively working and hitting the cache regularly, it stays alive.
Anthropic has also introduced extended cache TTL options — up to 1 hour for certain use cases — which is useful for longer development sessions where you might step away between requests.
If you’re seeing cache misses in a workflow that should be hitting the cache, the first thing to check is whether more than 5 minutes passed between requests.
Token Minimums
Not every request qualifies for caching. There’s a minimum threshold:
- Claude 3.5 Sonnet, Claude 3 Opus: 1,024 tokens minimum
- Claude 3.5 Haiku: 2,048 tokens minimum
If your cacheable prefix is shorter than the threshold, caching won’t activate. For most Claude Code workflows with loaded context, this isn’t an issue — but it’s worth knowing if you’re experimenting with shorter system prompts.
What Caching Costs
Cache writes (the first time Claude processes and stores the context) cost 25% more than standard input tokens. Cache reads (every subsequent hit) cost about 10% of the standard input token price.
So the break-even math is simple: if you make the same request more than twice, caching starts saving money. By the third request, you’re meaningfully ahead. By the tenth, the savings are substantial.
For Claude Code sessions involving large codebases, the savings can reach millions of tokens per day — which translates to real dollars fast.
How to Enable Prompt Caching in Claude Code
Prompt caching uses a cache_control parameter attached to specific content blocks in your API request. You mark the point in your input where you want Claude to cache everything before it.
Here’s a basic structure using the Messages API:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a senior software engineer...",
},
{
"type": "text",
"text": "<entire codebase or documentation here>",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Refactor the authentication module."}
]
)
The cache_control: {"type": "ephemeral"} marker tells Claude: cache everything up to and including this block.
Cache Breakpoints
You can set up to 4 cache breakpoints in a single request. This is useful when you have multiple distinct sections of stable context — for example, a system prompt, a set of tool definitions, and a document corpus. Each breakpoint acts as a potential cache hit point.
One coffee. One working app.
You bring the idea. Remy manages the project.
Place breakpoints at the end of each major stable section, not in the middle. Claude caches the prefix up to the breakpoint, so anything after the last breakpoint is always re-processed.
What Breaks the Cache
This is where developers lose money without realizing it. Several common behaviors silently invalidate the cache.
1. Changing Anything Before the Breakpoint
The cache stores an exact prefix. If a single token changes before the cache breakpoint — a different date injected into the system prompt, a logging statement that adds dynamic content, a timestamp — the entire cache is invalidated.
Dynamic content is the most common cause of unexpected cache misses. If your system prompt includes something like “Current date: {date}”, that changes every day and breaks the cache.
Fix: Push all dynamic content (current date, user-specific info, session context) to the section after your last cache breakpoint. Keep everything before the breakpoint completely static.
2. Reordering Content
The cache is order-sensitive. If you rearrange the blocks in your system message, even with identical content, the cache won’t recognize it.
Fix: Lock in the order of your static content blocks and don’t change it across sessions.
3. Switching Models
Cache entries are model-specific. A cache built with claude-3-5-sonnet-20241022 won’t be hit by a request using claude-3-opus-20240229, even if the content is identical.
Fix: Standardize on one model per workflow. Switching models for experimentation is fine, just know that each model version maintains its own cache.
4. Different API Keys or Organizations
Cache is scoped to your API key and organization. If your team is running requests under different API keys, each key maintains a separate cache — there’s no sharing.
Fix: Route your Claude Code requests through a consistent API key for shared caching benefits.
5. Letting the Cache Expire
If you don’t make a request within the TTL window, the cache clears and the next request is a full cache write (charged at the higher write rate).
Fix: For long-running workflows, structure them to send a request at least every 5 minutes if possible — or invest in workflows with the extended TTL when that’s available.
Three Habits That Maximize Prompt Cache Savings
These aren’t theoretical optimizations. They’re practical habits that directly affect your monthly token bill.
Habit 1: Front-Load Stable Context, Backload Dynamic Context
Structure your prompts so that everything stable comes first, followed by everything that changes.
A well-structured Claude Code system message might look like:
- Static role/persona (never changes)
- Codebase dump or documentation (changes rarely — cache this)
- Tool definitions (stable per session — cache this)
- [Cache breakpoint here]
- Current task or dynamic instructions (changes per request — not cached)
This pattern maximizes cache reuse because the large, expensive sections are always identical, while the small, dynamic sections are cheap to re-process.
Habit 2: Batch Your Context, Not Your Requests
A common mistake is breaking up context across multiple smaller requests instead of loading it all once. Each new request has to re-establish context from scratch unless you’re using caching — and even with caching, fragmenting context across requests can break the cache pattern.
How Remy works. You talk. Remy ships.
Instead, load your full context (the entire relevant portion of your codebase, the complete documentation block) in a single well-structured initial message, then let subsequent turns hit the cache.
Think of the initial context-loading request as a fixed cost you pay once per session. Everything after that should be a cache hit.
Habit 3: Monitor Cache Hit Rates in Your Usage Metadata
Claude’s API returns cache metadata in every response. The usage object includes:
cache_creation_input_tokens— tokens written to cache (costs 25% more)cache_read_input_tokens— tokens read from cache (costs 90% less)input_tokens— tokens processed normally
If you’re not logging this data, you’re flying blind. Set up logging to track your cache hit rate across a session. If you’re seeing high cache_creation_input_tokens and low cache_read_input_tokens, you have a cache invalidation problem — something is breaking the cache before it can pay off.
A healthy pattern looks like: one cache write at session start, then cache reads for every subsequent request in that session.
Prompt Caching in Agentic Workflows
Prompt caching becomes even more valuable in agentic workflows, where an AI agent might make dozens or hundreds of API calls in a single run. Long tool definitions, large knowledge bases, and extended conversation histories all compound the savings.
For agents doing code review, documentation generation, or codebase analysis, the pattern is especially clear: the agent loads the relevant files once, caches them, then asks multiple questions against that cached context. Without caching, each question re-processes thousands of tokens. With caching, only the new question is processed fresh.
This is where MindStudio becomes directly relevant.
Managing Claude Workflows Without Token Overhead
MindStudio is a no-code platform for building and deploying AI agents, and it supports Claude alongside 200+ other models out of the box. When you build a Claude-powered workflow in MindStudio, the platform handles the infrastructure layer — including how context is structured and passed across steps in multi-turn workflows.
For teams who want the benefits of prompt caching without managing the API plumbing manually, MindStudio’s visual workflow builder lets you design your context structure, define stable vs. dynamic sections, and run Claude agents that follow best practices by default. You don’t need to manually instrument cache_control blocks across your codebase.
You can also use MindStudio’s Agent Skills Plugin to let external agents — including Claude Code — call MindStudio-powered capabilities as simple method calls, which can be a cleaner way to offload complex, context-heavy tasks to pre-built, optimized workflows.
Try MindStudio free at mindstudio.ai.
Real-World Numbers: What the Savings Look Like
To make this concrete, here’s a rough calculation for a Claude Code session doing codebase analysis.
Assume:
- System prompt + codebase context: 50,000 tokens
- 20 requests per session, each with a new question
- Claude 3.5 Sonnet pricing: $3 per million input tokens
Without caching:
- 50,000 tokens × 20 requests = 1,000,000 tokens = $3.00 per session
With caching:
- Cache write (once): 50,000 tokens × 1.25 = 62,500 tokens × $3/M = $0.19
- Cache reads (19 times): 50,000 tokens × 0.10 × 19 = 95,000 tokens × $3/M = $0.29
- Total: $0.48 per session
Everyone else built a construction worker.
We built the contractor.
One file at a time.
UI, API, database, deploy.
That’s an 84% reduction for one session. Scale that across a team running multiple sessions per day, and the savings become significant quickly.
FAQ
Does prompt caching work with all Claude models?
Prompt caching is supported on Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, and Claude 3 Haiku. It’s not available on older or unsupported model versions. Always check the Anthropic model documentation for the latest compatibility list, since new models are released regularly.
How long does a prompt cache last?
The default cache TTL is 5 minutes from the last access. Each cache hit resets the 5-minute timer. Anthropic has introduced extended TTL options (up to 1 hour) for certain workflows. If you’re seeing cache misses in an active session, check whether your requests are spaced more than 5 minutes apart.
What is the minimum number of tokens needed to use prompt caching?
The minimum cacheable prefix is 1,024 tokens for Claude 3 Sonnet and Opus models, and 2,048 tokens for Claude 3.5 Haiku. Requests with a prefix shorter than these thresholds won’t activate caching — they’ll be billed at normal input token rates.
Does prompt caching reduce output token costs?
No. Prompt caching only applies to input tokens — specifically, the tokens in your request that match the cached prefix. Output tokens (what Claude generates in response) are always billed at the standard output token rate.
Can I share a cache across multiple users or API keys?
No. Cache entries are scoped to a specific API key and organization. Different API keys maintain separate caches, even if the content is identical. For team environments where you want shared caching benefits, route requests through a consistent API key.
What happens when the cache expires mid-session?
If the cache expires (after 5 minutes of inactivity), the next request is treated as a cache miss. Claude re-processes and re-caches the full prefix, billed at the higher cache write rate (1.25× standard). Your session then continues with a fresh 5-minute TTL. This is a one-time cost per expiration event — not a compounding issue unless your workflow has long gaps between requests.
Key Takeaways
- Prompt caching stores your request prefix server-side and charges 90% less for cached input tokens on subsequent requests.
- Cache TTL is 5 minutes by default, reset on each hit. Cache writes cost 1.25× normal; cache reads cost 0.10× normal.
- The most common cache breakers are dynamic content injected before the breakpoint, content reordering, and model switches.
- The three habits that maximize savings: front-load stable context before cache breakpoints, batch context in one initial load per session, and monitor cache hit/miss rates in API response metadata.
- In agentic workflows with dozens of API calls, caching can reduce token costs by 80%+ per session.
If you’re building Claude-powered workflows and want to handle context management without writing low-level API code, MindStudio’s visual workflow builder gives you a faster path to production. Start free at mindstudio.ai.