What Is Prompt Caching in Claude Code? How to Save Millions of Tokens

The Token Bill Nobody Saw Coming

If you’ve been using Claude Code for serious development work — loading large codebases, attaching documentation, running multi-turn conversations — your token costs are probably higher than they need to be. In some cases, much higher.

Prompt caching in Claude is one of the most practical cost-reduction tools available. It can cut input token costs by up to 90% for repeated context. But most developers either don’t know it exists or aren’t using it correctly.

This guide explains exactly how prompt caching works, what the cache TTL means in practice, what silently breaks the cache (and costs you full price), and three concrete habits that will minimize your Claude token spend.

What Prompt Caching Actually Is

Every time you send a message to Claude, the model processes all the tokens in your request — system prompt, conversation history, attached files, tool definitions, everything. Even if you sent the exact same context two minutes ago.

Prompt caching changes that. When caching is enabled, Anthropic stores a processed version of your context on their servers. On subsequent requests that include that same prefix, Claude reads from the cache instead of re-processing from scratch.

The result: cached input tokens are billed at roughly 10% of the normal input token rate. That’s a 90% reduction.

This matters most when you have large, stable context that appears in request after request — like a long system prompt, a full codebase dump, or an extensive documentation block. Without caching, you pay full price every time. With caching, you pay full price once (plus a small cache write fee), then a fraction of that for every subsequent hit.

What Gets Cached

Anthropic’s prompt caching documentation specifies that the cache stores a “prefix” of your request — everything from the beginning of your input up to the cache breakpoint you define.

This includes:

System prompts
Tool and function definitions
Long conversation histories
Attached documents or file contents
Any other large, repeated context

What matters is that the content before the breakpoint stays identical across requests. The moment anything changes in that prefix, the cache is invalidated.

How Cache TTL and Pricing Work

Cache Lifetime

The default cache TTL (time to live) is 5 minutes from the last time the cache was accessed. Each successful cache hit resets the clock. So if you’re actively working and hitting the cache regularly, it stays alive.

Anthropic has also introduced extended cache TTL options — up to 1 hour for certain use cases — which is useful for longer development sessions where you might step away between requests.

If you’re seeing cache misses in a workflow that should be hitting the cache, the first thing to check is whether more than 5 minutes passed between requests.

Token Minimums

Not every request qualifies for caching. There’s a minimum threshold:

Claude 3.5 Sonnet, Claude 3 Opus: 1,024 tokens minimum
Claude 3.5 Haiku: 2,048 tokens minimum

If your cacheable prefix is shorter than the threshold, caching won’t activate. For most Claude Code workflows with loaded context, this isn’t an issue — but it’s worth knowing if you’re experimenting with shorter system prompts.

What Caching Costs

Cache writes (the first time Claude processes and stores the context) cost 25% more than standard input tokens. Cache reads (every subsequent hit) cost about 10% of the standard input token price.

So the break-even math is simple: if you make the same request more than twice, caching starts saving money. By the third request, you’re meaningfully ahead. By the tenth, the savings are substantial.

For Claude Code sessions involving large codebases, the savings can reach millions of tokens per day — which translates to real dollars fast.

How to Enable Prompt Caching in Claude Code

Prompt caching uses a cache_control parameter attached to specific content blocks in your API request. You mark the point in your input where you want Claude to cache everything before it.

Here’s a basic structure using the Messages API:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a senior software engineer...",
        },
        {
            "type": "text",
            "text": "<entire codebase or documentation here>",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Refactor the authentication module."}
    ]
)

The cache_control: {"type": "ephemeral"} marker tells Claude: cache everything up to and including this block.

Cache Breakpoints

You can set up to 4 cache breakpoints in a single request. This is useful when you have multiple distinct sections of stable context — for example, a system prompt, a set of tool definitions, and a document corpus. Each breakpoint acts as a potential cache hit point.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Place breakpoints at the end of each major stable section, not in the middle. Claude caches the prefix up to the breakpoint, so anything after the last breakpoint is always re-processed.

What Breaks the Cache

This is where developers lose money without realizing it. Several common behaviors silently invalidate the cache.

1. Changing Anything Before the Breakpoint

The cache stores an exact prefix. If a single token changes before the cache breakpoint — a different date injected into the system prompt, a logging statement that adds dynamic content, a timestamp — the entire cache is invalidated.

Dynamic content is the most common cause of unexpected cache misses. If your system prompt includes something like “Current date: {date}”, that changes every day and breaks the cache.

Fix: Push all dynamic content (current date, user-specific info, session context) to the section after your last cache breakpoint. Keep everything before the breakpoint completely static.

2. Reordering Content

The cache is order-sensitive. If you rearrange the blocks in your system message, even with identical content, the cache won’t recognize it.

Fix: Lock in the order of your static content blocks and don’t change it across sessions.

3. Switching Models

Cache entries are model-specific. A cache built with claude-3-5-sonnet-20241022 won’t be hit by a request using claude-3-opus-20240229, even if the content is identical.

Fix: Standardize on one model per workflow. Switching models for experimentation is fine, just know that each model version maintains its own cache.

4. Different API Keys or Organizations

Cache is scoped to your API key and organization. If your team is running requests under different API keys, each key maintains a separate cache — there’s no sharing.

Fix: Route your Claude Code requests through a consistent API key for shared caching benefits.

5. Letting the Cache Expire

If you don’t make a request within the TTL window, the cache clears and the next request is a full cache write (charged at the higher write rate).

Fix: For long-running workflows, structure them to send a request at least every 5 minutes if possible — or invest in workflows with the extended TTL when that’s available.

Three Habits That Maximize Prompt Cache Savings

These aren’t theoretical optimizations. They’re practical habits that directly affect your monthly token bill.

Habit 1: Front-Load Stable Context, Backload Dynamic Context

Structure your prompts so that everything stable comes first, followed by everything that changes.

A well-structured Claude Code system message might look like:

Static role/persona (never changes)
Codebase dump or documentation (changes rarely — cache this)
Tool definitions (stable per session — cache this)
[Cache breakpoint here]
Current task or dynamic instructions (changes per request — not cached)

This pattern maximizes cache reuse because the large, expensive sections are always identical, while the small, dynamic sections are cheap to re-process.

Habit 2: Batch Your Context, Not Your Requests

A common mistake is breaking up context across multiple smaller requests instead of loading it all once. Each new request has to re-establish context from scratch unless you’re using caching — and even with caching, fragmenting context across requests can break the cache pattern.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

Instead, load your full context (the entire relevant portion of your codebase, the complete documentation block) in a single well-structured initial message, then let subsequent turns hit the cache.

Think of the initial context-loading request as a fixed cost you pay once per session. Everything after that should be a cache hit.

Habit 3: Monitor Cache Hit Rates in Your Usage Metadata

Claude’s API returns cache metadata in every response. The usage object includes:

cache_creation_input_tokens — tokens written to cache (costs 25% more)
cache_read_input_tokens — tokens read from cache (costs 90% less)
input_tokens — tokens processed normally

If you’re not logging this data, you’re flying blind. Set up logging to track your cache hit rate across a session. If you’re seeing high cache_creation_input_tokens and low cache_read_input_tokens, you have a cache invalidation problem — something is breaking the cache before it can pay off.

A healthy pattern looks like: one cache write at session start, then cache reads for every subsequent request in that session.

Prompt Caching in Agentic Workflows

Prompt caching becomes even more valuable in agentic workflows, where an AI agent might make dozens or hundreds of API calls in a single run. Long tool definitions, large knowledge bases, and extended conversation histories all compound the savings.

For agents doing code review, documentation generation, or codebase analysis, the pattern is especially clear: the agent loads the relevant files once, caches them, then asks multiple questions against that cached context. Without caching, each question re-processes thousands of tokens. With caching, only the new question is processed fresh.

This is where MindStudio becomes directly relevant.

Managing Claude Workflows Without Token Overhead

MindStudio is a no-code platform for building and deploying AI agents, and it supports Claude alongside 200+ other models out of the box. When you build a Claude-powered workflow in MindStudio, the platform handles the infrastructure layer — including how context is structured and passed across steps in multi-turn workflows.

For teams who want the benefits of prompt caching without managing the API plumbing manually, MindStudio’s visual workflow builder lets you design your context structure, define stable vs. dynamic sections, and run Claude agents that follow best practices by default. You don’t need to manually instrument cache_control blocks across your codebase.

You can also use MindStudio’s Agent Skills Plugin to let external agents — including Claude Code — call MindStudio-powered capabilities as simple method calls, which can be a cleaner way to offload complex, context-heavy tasks to pre-built, optimized workflows.

Try MindStudio free at mindstudio.ai.

Real-World Numbers: What the Savings Look Like

To make this concrete, here’s a rough calculation for a Claude Code session doing codebase analysis.

Assume:

System prompt + codebase context: 50,000 tokens
20 requests per session, each with a new question
Claude 3.5 Sonnet pricing: $3 per million input tokens

Without caching:

50,000 tokens × 20 requests = 1,000,000 tokens = $3.00 per session

With caching:

Cache write (once): 50,000 tokens × 1.25 = 62,500 tokens × $3/M = $0.19
Cache reads (19 times): 50,000 tokens × 0.10 × 19 = 95,000 tokens × $3/M = $0.29
Total: $0.48 per session

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

That’s an 84% reduction for one session. Scale that across a team running multiple sessions per day, and the savings become significant quickly.

FAQ

Does prompt caching work with all Claude models?

Prompt caching is supported on Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, and Claude 3 Haiku. It’s not available on older or unsupported model versions. Always check the Anthropic model documentation for the latest compatibility list, since new models are released regularly.

How long does a prompt cache last?

The default cache TTL is 5 minutes from the last access. Each cache hit resets the 5-minute timer. Anthropic has introduced extended TTL options (up to 1 hour) for certain workflows. If you’re seeing cache misses in an active session, check whether your requests are spaced more than 5 minutes apart.

What is the minimum number of tokens needed to use prompt caching?

The minimum cacheable prefix is 1,024 tokens for Claude 3 Sonnet and Opus models, and 2,048 tokens for Claude 3.5 Haiku. Requests with a prefix shorter than these thresholds won’t activate caching — they’ll be billed at normal input token rates.

Does prompt caching reduce output token costs?

No. Prompt caching only applies to input tokens — specifically, the tokens in your request that match the cached prefix. Output tokens (what Claude generates in response) are always billed at the standard output token rate.

No. Cache entries are scoped to a specific API key and organization. Different API keys maintain separate caches, even if the content is identical. For team environments where you want shared caching benefits, route requests through a consistent API key.

What happens when the cache expires mid-session?

If the cache expires (after 5 minutes of inactivity), the next request is treated as a cache miss. Claude re-processes and re-caches the full prefix, billed at the higher cache write rate (1.25× standard). Your session then continues with a fresh 5-minute TTL. This is a one-time cost per expiration event — not a compounding issue unless your workflow has long gaps between requests.

Key Takeaways

Prompt caching stores your request prefix server-side and charges 90% less for cached input tokens on subsequent requests.
Cache TTL is 5 minutes by default, reset on each hit. Cache writes cost 1.25× normal; cache reads cost 0.10× normal.
The most common cache breakers are dynamic content injected before the breakpoint, content reordering, and model switches.
The three habits that maximize savings: front-load stable context before cache breakpoints, batch context in one initial load per session, and monitor cache hit/miss rates in API response metadata.
In agentic workflows with dozens of API calls, caching can reduce token costs by 80%+ per session.

If you’re building Claude-powered workflows and want to handle context management without writing low-level API code, MindStudio’s visual workflow builder gives you a faster path to production. Start free at mindstudio.ai.

What Is Prompt Caching in Claude Code? How to Save Millions of Tokens

The Token Bill Nobody Saw Coming

What Prompt Caching Actually Is

What Gets Cached

How Cache TTL and Pricing Work

Cache Lifetime

Token Minimums

What Caching Costs

How to Enable Prompt Caching in Claude Code

Cache Breakpoints

One coffee. One working app.

What Breaks the Cache

1. Changing Anything Before the Breakpoint

2. Reordering Content

3. Switching Models

4. Different API Keys or Organizations

5. Letting the Cache Expire

Three Habits That Maximize Prompt Cache Savings

Habit 1: Front-Load Stable Context, Backload Dynamic Context

Habit 2: Batch Your Context, Not Your Requests

Other agents ship a demo. Remy ships an app.

Habit 3: Monitor Cache Hit Rates in Your Usage Metadata

Prompt Caching in Agentic Workflows

Managing Claude Workflows Without Token Overhead

Real-World Numbers: What the Savings Look Like

Plans first. Then code.

FAQ

Does prompt caching work with all Claude models?

How long does a prompt cache last?

What is the minimum number of tokens needed to use prompt caching?

Does prompt caching reduce output token costs?

What happens when the cache expires mid-session?

Key Takeaways

Related Articles

Plan with Fable 5, Build with Sonnet: The Model Routing Pattern That Cuts AI Costs

How to Reduce Claude Fable 5 Token Costs: 8 Settings to Change Right Now

How to Use the Advisor-Executor Pattern in Claude Code to Extend Your Fable 5 Limit

How to Use Claude Code's /fewer Permission Prompt to Build a Custom Allow List

The Token Bill Nobody Saw Coming

What Prompt Caching Actually Is

What Gets Cached

How Cache TTL and Pricing Work

Cache Lifetime

Token Minimums

What Caching Costs

How to Enable Prompt Caching in Claude Code

Cache Breakpoints

One coffee. One working app.

What Breaks the Cache

1. Changing Anything Before the Breakpoint

2. Reordering Content

3. Switching Models

4. Different API Keys or Organizations

5. Letting the Cache Expire

Three Habits That Maximize Prompt Cache Savings

Habit 1: Front-Load Stable Context, Backload Dynamic Context

Habit 2: Batch Your Context, Not Your Requests

Other agents ship a demo. Remy ships an app.

Habit 3: Monitor Cache Hit Rates in Your Usage Metadata

Prompt Caching in Agentic Workflows

Managing Claude Workflows Without Token Overhead

Real-World Numbers: What the Savings Look Like

Plans first. Then code.

FAQ

Does prompt caching work with all Claude models?

How long does a prompt cache last?

What is the minimum number of tokens needed to use prompt caching?

Does prompt caching reduce output token costs?

Can I share a cache across multiple users or API keys?

What happens when the cache expires mid-session?

Key Takeaways

Related Articles

Plan with Fable 5, Build with Sonnet: The Model Routing Pattern That Cuts AI Costs

How to Reduce Claude Fable 5 Token Costs: 8 Settings to Change Right Now

How to Use the Advisor-Executor Pattern in Claude Code to Extend Your Fable 5 Limit

How to Use Claude Code's /fewer Permission Prompt to Build a Custom Allow List