Prompt Caching in Claude Code: How to Save Millions of Tokens and Extend Session Limits
Learn how Claude Code's prompt caching works, what breaks the cache, and three habits that save millions of tokens and extend your session limits.
Why Your Claude Code Sessions Cost More Than They Should
If you’ve spent any real time working with Claude Code, you’ve probably noticed your context filling up faster than expected — and your token costs climbing with it. A long session involving a large codebase can burn through hundreds of thousands of tokens before you’ve even solved the core problem.
Prompt caching in Claude Code is the mechanism that prevents you from paying full price (and burning context) on content Claude has already seen. When it works well, it’s one of the highest-leverage optimizations available. When it breaks, you’re re-sending the same massive system prompts and file contents over and over, spending tokens you don’t need to spend.
This guide explains how Claude Code’s prompt caching works at a practical level, exactly what causes cache misses, and three concrete habits that can save millions of tokens across a project’s lifetime — while keeping your sessions alive longer.
How Prompt Caching Actually Works
Claude’s prompt caching is built on a straightforward idea: if the beginning of a request is identical to something Claude processed recently, skip re-processing that part and serve it from a stored cache instead.
Seven tools to build an app. Or just Remy.
Editor, preview, AI agents, deploy — all in one tab. Nothing to install.
Under the hood, when Claude processes a prompt, it generates a key-value (KV) cache from its attention layers. That cache represents Claude’s “understanding” of everything it’s read so far. If the next request starts with the exact same sequence of tokens, Anthropic’s infrastructure can skip recomputing that understanding and pick up from the cache checkpoint.
What Gets Cached in Claude Code
Claude Code sends a structured request every time you interact with it. That request typically includes:
- The system prompt (Claude Code’s instructions, plus anything you’ve added via
CLAUDE.md) - File contents Claude has read into context
- Tool definitions (bash, file read/write, etc.)
- The conversation history so far
The cache applies to the prefix of this request — anything that appears at the start and stays unchanged between turns. The more stable content you have at the top of your context, the more cache hits you get.
The Pricing Difference Is Significant
Anthropic documents prompt caching pricing in detail. Cache read tokens cost about 10% of what full input tokens cost. Cache write tokens (when the cache is being created) cost slightly more than normal — around 125% — but subsequent reads make up for it quickly.
If Claude Code is re-processing a 50,000-token context on every message, and you send 20 messages in a session, that’s 1 million input tokens at full price. With a warm cache and stable context, most of those tokens become cache reads at 10% cost. The math adds up fast on a large codebase.
The Cache Has Rules You Need to Know
Caching isn’t automatic or guaranteed. There are constraints that determine whether Claude can use a cached version of your context.
Minimum Cache Size
Claude’s prompt caching requires a minimum threshold before it will attempt to cache a block. For Claude 3.5 Sonnet and Claude 3 Opus, that threshold is 1,024 tokens. Shorter prefixes aren’t cached at all. In practice, Claude Code’s system prompt alone usually clears this bar, but it’s worth knowing for edge cases.
Cache TTL
The cached KV state doesn’t live forever. Anthropic maintains cached prompts for 5 minutes of inactivity by default. If you stop a session and come back 30 minutes later, the cache is cold. Claude Code will need to re-process your full context from scratch on the next message.
This is why long breaks within a session can spike your token usage unexpectedly. The cache was warm, you grabbed coffee, and now you’re paying full input price again.
Cache Is Prefix-Sensitive
This is the most important rule and the one that causes the most accidental cache misses. The cache applies to an exact token sequence from the start of the prompt. If anything changes at position N, everything from position N onward is a cache miss — even if the content after that point is identical.
Think of it like a longest common prefix check. Claude compares the current request against the cached version, character by character from the start, and the moment they diverge, caching stops.
What Breaks the Cache (And Why It Matters)
Most unnecessary token spending in Claude Code sessions comes from unintentional cache invalidation. Here are the most common culprits.
Editing CLAUDE.md Mid-Session
Day one: idea. Day one: app.
Not a sprint plan. Not a quarterly OKR. A finished product by end of day.
CLAUDE.md is loaded as part of the system prompt — one of the very first things in your request. If you edit it during a session, the system prompt changes, and the cache is fully invalidated. Every subsequent message will re-process the entire context from scratch until a new cache builds.
This is one of the most common mistakes. People tweak their project instructions mid-session and don’t realize they’ve just nuked their cache.
Reading Files in a Different Order
When Claude Code reads files into context, the order matters. If Claude reads auth.py, then models.py, then routes.py, the cache is keyed on that sequence. If a new conversation reads models.py first, the prefix diverges immediately and the cache doesn’t help.
This matters most when you use slash commands like /read manually or when you start a new session and the tool discovery order varies.
Modifying Files That Were Read Early
If Claude read a file at the start of the session and you’ve since edited that file, the next time Claude checks its content, the tokens are different. The cache breaks at exactly that point.
This is unavoidable in active development — you’re supposed to edit files. But understanding it helps you think about which files you open first. Files you’re going to edit a lot should ideally be loaded later in the session, not early.
Restarting the Session
Any fresh claude invocation starts with a cold cache. There’s no persistent cache across sessions. If you regularly close and reopen Claude Code on the same project, you’re rebuilding the cache from zero each time.
The /continue flag in Claude Code lets you resume a prior conversation, but the cache still needs to warm up on the first message of the new session.
Adding or Changing Tool Definitions
Claude Code sends tool definitions (file operations, bash, etc.) as part of the structured request. If the tool list changes — which can happen between Claude Code versions or with certain configurations — the cache breaks at that point in the prompt.
Three Habits That Save Millions of Tokens
Knowing what breaks the cache is half the battle. Here are three concrete practices that preserve it.
Habit 1: Finalize CLAUDE.md Before Starting Work
Treat CLAUDE.md like a config file you only touch between sessions, not during them. Before you start coding, make sure your project instructions are where you want them. Add your conventions, style preferences, important context about the codebase — then leave it alone.
If you realize mid-session that you need to update your instructions, note it and do it after you finish. The token savings from a stable system prompt across a long session are substantial.
For teams, this means establishing a review process for CLAUDE.md changes — similar to how you’d handle changes to a CI config. Unreviewed mid-session edits to shared CLAUDE.md files are a hidden source of token waste.
Habit 2: Establish a Consistent File Loading Order
At the start of each session, load your most stable, least-likely-to-change files first. Think: README.md, schema definitions, interface files, constants — anything that describes the architecture without being something you’ll edit in this session.
Then load the files you’ll actually be working on. This way, if you edit a working file, the cache only breaks at that point. All the stable architecture files above it remain cached.
You can use a short startup script or a custom CLAUDE.md instruction that tells Claude to read files in a specific order when beginning a session. Consistency here compounds over time.
Habit 3: Use /compact Strategically, Not Reactively
Claude Code’s /compact command summarizes the conversation history to reclaim context space. It’s useful, but timing matters.
If you use /compact reactively — waiting until you’re nearly out of context — you’ve already spent tokens on a long, uncached conversation. The better approach is to use /compact proactively, before the conversation grows so large that the cache is getting stale or the session is at risk.
After a /compact, Claude rewrites the conversation history with a summary. This creates a new, shorter prefix that can be cached efficiently. The next several messages will be much cheaper.
Think of it as resetting to a clean slate while preserving the important context. Done right, it extends how long you can stay productive in a single session without running out of context or burning tokens unnecessarily.
Extending Session Limits With Caching
Claude Code’s session limits are ultimately about context window capacity — not wall-clock time. The 200K token context window on Claude 3.5 Sonnet and Claude 3 Opus means you have a finite amount of space for conversation history, file contents, and outputs.
Caching doesn’t increase the context window, but it does reduce how quickly you consume billable tokens. And by using /compact and careful file management, you can keep the conversation history shorter without losing useful context, extending the practical length of a session.
The Session Length Math
Say you’re working on a large codebase and each exchange adds about 5,000 tokens to your conversation history. Without compaction, you’ll hit the context limit in roughly 30–35 exchanges. With proactive /compact calls every 10–15 exchanges, you can keep working indefinitely — as long as the summary retains the context you need.
Combined with a warm cache, your cost per exchange drops significantly. The fixed cost of the stable system prompt and file contents is amortized across many cache reads instead of paid in full each time.
What Happens When You Hit the Limit
When Claude Code’s context fills up, it will either refuse to continue or force a compaction. At that point, you have less control over what gets summarized and what gets dropped. Getting ahead of it gives you better control over the session’s memory.
How MindStudio Fits Into AI Workflow Optimization
If you’re thinking about token efficiency for Claude Code, you’re probably also thinking about how to build systems around Claude — not just use it interactively. This is where MindStudio becomes relevant.
Other agents start typing. Remy starts asking.
Scoping, trade-offs, edge cases — the real work. Before a line of code.
MindStudio is a no-code platform for building AI agents and automated workflows. It supports 200+ models out of the box, including Claude 3.5 Sonnet, and lets you design multi-step workflows that reason across tools, APIs, and data sources. Crucially, when you build a workflow in MindStudio, you control the structure of the prompt that gets sent on each run — which means you can design your workflows with caching in mind from the start.
For teams using Claude in production pipelines (code review bots, documentation agents, QA workflows), MindStudio gives you a clean abstraction layer. You define stable system prompts, consistent tool configurations, and structured context windows — exactly the conditions that maximize cache hit rates.
The Agent Skills Plugin is also worth knowing about if you’re building agents that call out to external tools. It’s an npm SDK that lets any Claude-powered agent call 120+ typed capabilities (send email, search the web, run a workflow) without building the infrastructure yourself. That means less glue code in your prompts and tighter, more cacheable context.
You can try MindStudio free at mindstudio.ai.
Frequently Asked Questions
How do I know if prompt caching is working in Claude Code?
Claude Code doesn’t expose cache hit/miss status directly in the UI. But you can infer it from the Anthropic API response metadata if you’re running Claude Code in a context where you can see raw API responses. The response includes cache_read_input_tokens and cache_creation_input_tokens fields that tell you exactly how many tokens were read from cache versus freshly processed. If you’re using Claude Code through the standard CLI, the most practical signal is your token usage over a session — stable early-session costs followed by lower per-message costs suggests caching is working.
Does prompt caching work across different Claude Code sessions?
No. The cache is session-specific and has a TTL of about 5 minutes. When you close Claude Code and start a new session, you start with a cold cache. The first message of any session will always be a full cache write. If you want to warm the cache quickly, send an initial message that loads your standard files and system context, then proceed with your actual work.
What’s the minimum context size for prompt caching to apply?
Claude’s prompt caching has a minimum threshold of 1,024 tokens for the cached prefix. In practice, Claude Code’s default system prompt is well above this limit, so caching applies to most sessions. The threshold matters more if you’re building custom Claude integrations with very short system prompts.
Can I force Claude to cache specific parts of my prompt?
If you’re using the Anthropic API directly, you can add cache_control breakpoints to your prompt structure to tell Claude where to create cache checkpoints. Claude Code manages this automatically based on the prompt structure it sends. If you’re building your own Claude-powered tools, explicitly placing cache breakpoints after your stable system prompt and before the conversation history gives you fine-grained control.
Why does my token usage spike suddenly mid-session?
Most mid-session spikes are caused by cache invalidation — usually from editing CLAUDE.md, changing which files are in context, or taking a long break (letting the 5-minute TTL expire). If you see a sudden jump in token usage after a stable period, check whether you made any changes to the system prompt or the order of file reads in that conversation turn.
Remy doesn't build the plumbing. It inherits it.
Other agents wire up auth, databases, models, and integrations from scratch every time you ask them to build something.
Remy ships with all of it from MindStudio — so every cycle goes into the app you actually want.
Does using /compact save money or just extend context?
Both. /compact reduces the conversation history to a summary, which shrinks the size of the prefix sent on subsequent messages. Shorter prompts mean lower absolute token counts, and a shorter stable prefix is easier for caching to cover efficiently. The combination of reduced history length and effective caching after a /compact typically produces meaningfully lower per-message costs for the rest of the session.
Key Takeaways
- Prompt caching works on the prefix of your request — anything that changes early in the prompt invalidates the cache for everything after it.
- Cache hits cost roughly 10% of full input token prices; a warm cache across a long session can save the majority of your input token costs.
- The three biggest cache killers are editing
CLAUDE.mdmid-session, inconsistent file loading order, and session restarts. - Finalize your
CLAUDE.mdbefore starting, load stable files first, and use/compactproactively — not reactively. - Cache TTL is ~5 minutes. Long breaks kill the cache; plan accordingly.
- Tools like MindStudio let you build Claude-powered workflows with cache-friendly prompt structures baked in from the start — worth exploring if you’re running Claude in production pipelines.