If you are making repeated Claude API calls with the same system prompt, document, or few-shot examples, you are paying full input token prices every single time. Prompt caching changes that. Mark a block of content as cacheable and Claude stores it server-side — subsequent calls that reuse it pay a fraction of the original price.
For workloads with long static context, the savings are significant: up to 90% on the cached portion. This guide covers exactly how to use it.
What Is Claude Prompt Caching?
Prompt caching is a feature in the Claude Messages API that lets you mark portions of your input for server-side reuse. When Claude processes a request with a cache-eligible block, it stores that block. If the next request contains the same cached block, Claude skips re-processing it and charges you at the cache read rate instead of the full input rate.
The result: dramatically lower token costs on anything that stays constant across calls — system prompts, reference documents, long instructions, or few-shot examples.
How the Pricing Works
Claude prompt caching uses three price tiers:
| Token Type | What It Is | Cost |
|---|---|---|
| Cache write | First time you send a cacheable block | 25% higher than base input price |
| Cache read | Subsequent calls that hit the cache | 90% lower than base input price |
| Regular input | Non-cached input tokens | Standard rate |
Cache writes cost slightly more upfront — you are paying Claude to store the context. Cache reads are dramatically cheaper. The math works in your favor as soon as you make more than two calls with the same cached block.
Example: a 10,000-token system prompt at $3/M input tokens costs $0.03 per call without caching. With caching, the first call costs $0.0375 (write cost), then every subsequent call costs $0.003 (read cost). By call four you have already saved money. By call 100 you have paid $0.33 instead of $3.00.
The Cache TTL
Cached blocks persist for 5 minutes from the last use. Each cache hit resets the 5-minute timer. For pipelines that process one document at a time, this is usually sufficient. For workflows with long pauses between calls, you may need to warm the cache at the start of each batch.
How to Use cache_control
Enable caching by passing the system parameter as an array of content blocks instead of a plain string, and adding cache_control to the block you want cached:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a contract analysis assistant. Extract key terms, obligations, and risk factors. Always respond in structured JSON.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": contract_text}
]
)The only cache type currently supported is "ephemeral" — 5-minute TTL. Anthropic may introduce longer-lived cache types in the future.
What You Can Cache
Any content block in your API request can be marked for caching. The most common use cases:
1. System Prompts
The highest-value target. Long system prompts stay constant across all calls in a pipeline — cache them.
system=[
{
"type": "text",
"text": long_system_prompt, # 2000+ tokens of instructions
"cache_control": {"type": "ephemeral"}
}
]2. Reference Documents
If every call in a session queries the same document (a contract, a codebase, a spec), cache the document in the messages array:
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Reference document:\n\n{large_document}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": user_question # varies per call
}
]
}
]3. Few-Shot Examples
Few-shot prompts can be expensive — 20 examples at 100 tokens each adds 2,000 tokens to every call. Cache them:
system=[
{
"type": "text",
"text": persona_instructions,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": few_shot_examples, # 2000 tokens of examples
"cache_control": {"type": "ephemeral"}
}
]You can mark multiple blocks for caching. Each gets its own cache entry.
TypeScript Example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You are a senior code reviewer specializing in security.
[... 1500 tokens of detailed instructions ...]`;
async function reviewCode(code: string) {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }
}
],
messages: [
{
role: "user",
content: `Review this code for security issues:\n\n${code}`
}
]
});
return response.content[0].text;
}
// Call this 100 times per hour — only the first call per 5-minute window
// pays cache write price. All others pay 90% less.How to Verify the Cache Is Working
The API response includes usage data that shows whether your cache hit or missed. Check usage.cache_read_input_tokens and usage.cache_creation_input_tokens:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}
],
messages=[{"role": "user", "content": question}]
)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
# First call output:
# Input tokens: 150
# Cache write tokens: 2000
# Cache read tokens: 0
# Second call output (within 5 min):
# Input tokens: 150
# Cache write tokens: 0
# Cache read tokens: 2000 <-- cache hit!Caching in Multi-turn Conversations
For chatbots and multi-turn sessions, the conversation history grows with each turn. You can cache the growing conversation to avoid re-processing early messages:
def chat(messages: list, system: str) -> str:
# Cache everything except the last user message
# by marking the second-to-last message as cacheable
if len(messages) > 2:
messages[-2]["cache_control"] = {"type": "ephemeral"}
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}],
messages=messages
)
return response.content[0].textThis pattern is especially valuable for long support conversations or coding sessions where the history carries important context.
Minimum Cache Size
Claude only caches blocks that meet a minimum token threshold. As of current API specs:
- claude-sonnet-4-6 / claude-opus-4-6: minimum 1,024 tokens to cache
- claude-haiku-4-5: minimum 2,048 tokens to cache
Short system prompts below the minimum will not be cached even if you pass cache_control. The API silently ignores the cache directive — no error is returned, and the usage response will show zero cache tokens. Check your usage.cache_creation_input_tokens to confirm caching is active.
Best Practices
Put Static Content First
Cache works by prefix — the cached block must appear at the beginning of the content it represents. Put your static system prompt and reference documents before any dynamic content. Dynamic content (user question, runtime variables) goes last.
Warm the Cache Intentionally
The first call in a batch pays cache write price. For latency-sensitive pipelines, send a warmup request at the start of each 5-minute window to ensure subsequent calls hit the cache immediately.
Combine with Batch API for Maximum Savings
Claude also offers a Batch API that processes requests asynchronously at 50% off standard prices. Combining batch processing with prompt caching stacks the discounts — ideal for nightly document processing pipelines that do not require real-time responses.
Monitor Cache Hit Rate
Log cache_read_input_tokens per call. A low hit rate means your cached blocks are expiring before subsequent calls arrive — either increase call frequency or restructure your pipeline to batch calls closer together.
When Not to Use Prompt Caching
- Highly dynamic prompts — if every call has a different system prompt, there is nothing to cache. Caching only saves money on static or semi-static content.
- Very short prompts — prompts under the minimum token threshold will not cache regardless. Short system prompts are cheap anyway.
- One-off calls — if you make a single API call, you pay cache write cost and never recoup it. Only enables savings at scale.
- Extended thinking enabled — prompt caching interacts with extended thinking; test carefully and verify cache hits in usage data before optimizing in production.
Cost Estimation Worksheet
Use this to estimate your savings before implementing:
# Inputs
static_tokens = 5000 # tokens in your cached block
calls_per_hour = 100 # how many API calls per hour
base_input_price = 3.0 # $/M tokens (sonnet-4-6 example)
# Without caching
cost_per_call_no_cache = (static_tokens / 1_000_000) * base_input_price
hourly_cost_no_cache = cost_per_call_no_cache * calls_per_hour
# With caching (1 write + 99 reads per 5-min window)
cache_write_price = base_input_price * 1.25
cache_read_price = base_input_price * 0.10
cost_per_hour_cache = (
(static_tokens / 1_000_000) * cache_write_price * 12 + # 12 writes/hour
(static_tokens / 1_000_000) * cache_read_price * 88 # 88 reads/hour
)
savings_pct = (1 - cost_per_hour_cache / hourly_cost_no_cache) * 100
print(f"Without caching: ${hourly_cost_no_cache:.4f}/hr")
print(f"With caching: ${cost_per_hour_cache:.4f}/hr")
print(f"Savings: {savings_pct:.1f}%")The Bottom Line
Prompt caching is one of the highest-leverage API optimizations available. If your system has any static content in its prompts — and most do — implement caching before you scale. The cost difference between cached and uncached workloads compounds rapidly at volume.
Next steps: review your current system prompts with our system prompt guide, then add cache_control to any block over 1,024 tokens that stays constant across calls. Check your usage data after the first day to confirm cache hits and calculate actual savings.