How much does Claude prompt caching cost?

Prompt caching has two prices relative to standard input tokens: cache writes cost 25% more than the base input rate (you pay to store the content), and cache reads cost 90% less than the base input rate. The math works in your favor after two or more calls with the same cached block. At 100 calls per hour on a 5,000-token system prompt, caching can reduce that portion of your bill by 70-85%.

How long does Claude's prompt cache last?

Claude prompt cache entries persist for 5 minutes from the last use. Each cache hit resets the timer. For high-frequency pipelines this is sufficient — the cache stays warm continuously. For workflows with longer pauses, send a periodic warmup request or structure your batches to complete within 5-minute windows.

How do I know if Claude prompt caching is working?

Check the usage object in the API response. A successful cache hit shows a non-zero value in usage.cache_read_input_tokens. A cache write shows a non-zero value in usage.cache_creation_input_tokens. If both are zero on a call where you expected caching, your block may be below the minimum token threshold.

Can I cache the conversation history in a multi-turn chat?

Yes. Mark earlier turns in the messages array with cache_control to avoid re-processing them on each new turn. A common pattern is to cache the second-to-last message, keeping only the most recent user message dynamic. This is especially valuable for long conversations where history carries important context.

Can I use prompt caching with Claude extended thinking?

Extended thinking and prompt caching can be used together, but they interact in ways that may affect cache hit rates. Test your specific setup and verify cache hits in the usage data before optimizing in production. The extended thinking tokens themselves are not cacheable — only the static input blocks are.

What content should I cache in Claude API calls?

Cache any content that stays constant across multiple calls: system prompts with detailed instructions, reference documents processed repeatedly, few-shot examples, and persona definitions. Do not try to cache dynamic content like user questions, runtime variables, or unique per-call context — it will never hit the cache.

Is Claude prompt caching available on all models?

Prompt caching is available on Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude Sonnet 4.6, Claude Opus 4.6, and Claude Haiku 4.5. The minimum token threshold differs: 1,024 tokens for Opus/Sonnet models, 2,048 tokens for Haiku models. Check the Anthropic documentation for the latest supported model list.

Claude Prompt Caching: Cut API Costs by 90%

Q: What is the minimum token size to use Claude prompt caching?

For claude-sonnet-4-6 and claude-opus-4-6, the minimum cacheable block is 1,024 tokens. For claude-haiku-4-5, the minimum is 2,048 tokens. Blocks below the minimum will not be cached even if you pass cache_control — the API silently ignores the directive. Check usage.cache_creation_input_tokens in the response to confirm caching is active.

Claude prompt caching slashes API bills by up to 90% on repeated context. Learn how cache_control works, what to cache, and real code examples.

If you are making repeated Claude API calls with the same system prompt, document, or few-shot examples, you are paying full input token prices every single time. Prompt caching changes that. Mark a block of content as cacheable and Claude stores it server-side — subsequent calls that reuse it pay a fraction of the original price.

For workloads with long static context, the savings are significant: up to 90% on the cached portion. This guide covers exactly how to use it.

What Is Claude Prompt Caching?

Prompt caching is a feature in the Claude Messages API that lets you mark portions of your input for server-side reuse. When Claude processes a request with a cache-eligible block, it stores that block. If the next request contains the same cached block, Claude skips re-processing it and charges you at the cache read rate instead of the full input rate.

The result: dramatically lower token costs on anything that stays constant across calls — system prompts, reference documents, long instructions, or few-shot examples.

How the Pricing Works

Claude prompt caching uses three price tiers:

Token Type	What It Is	Cost
Cache write	First time you send a cacheable block	25% higher than base input price
Cache read	Subsequent calls that hit the cache	90% lower than base input price
Regular input	Non-cached input tokens	Standard rate

Cache writes cost slightly more upfront — you are paying Claude to store the context. Cache reads are dramatically cheaper. The math works in your favor as soon as you make more than two calls with the same cached block.

Example: a 10,000-token system prompt at $3/M input tokens costs $0.03 per call without caching. With caching, the first call costs $0.0375 (write cost), then every subsequent call costs $0.003 (read cost). By call four you have already saved money. By call 100 you have paid $0.33 instead of $3.00.

The Cache TTL

Cached blocks persist for 5 minutes from the last use. Each cache hit resets the 5-minute timer. For pipelines that process one document at a time, this is usually sufficient. For workflows with long pauses between calls, you may need to warm the cache at the start of each batch.

How to Use cache_control

Enable caching by passing the system parameter as an array of content blocks instead of a plain string, and adding cache_control to the block you want cached:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a contract analysis assistant. Extract key terms, obligations, and risk factors. Always respond in structured JSON.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": contract_text}
    ]
)

The only cache type currently supported is "ephemeral" — 5-minute TTL. Anthropic may introduce longer-lived cache types in the future.

What You Can Cache

Any content block in your API request can be marked for caching. The most common use cases:

1. System Prompts

The highest-value target. Long system prompts stay constant across all calls in a pipeline — cache them.

system=[
    {
        "type": "text",
        "text": long_system_prompt,  # 2000+ tokens of instructions
        "cache_control": {"type": "ephemeral"}
    }
]

2. Reference Documents

If every call in a session queries the same document (a contract, a codebase, a spec), cache the document in the messages array:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Reference document:\n\n{large_document}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_question  # varies per call
            }
        ]
    }
]

3. Few-Shot Examples

Few-shot prompts can be expensive — 20 examples at 100 tokens each adds 2,000 tokens to every call. Cache them:

system=[
    {
        "type": "text",
        "text": persona_instructions,
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": few_shot_examples,  # 2000 tokens of examples
        "cache_control": {"type": "ephemeral"}
    }
]

You can mark multiple blocks for caching. Each gets its own cache entry.

TypeScript Example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You are a senior code reviewer specializing in security.
[... 1500 tokens of detailed instructions ...]`;

async function reviewCode(code: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" }
      }
    ],
    messages: [
      {
        role: "user",
        content: `Review this code for security issues:\n\n${code}`
      }
    ]
  });

  return response.content[0].text;
}

// Call this 100 times per hour — only the first call per 5-minute window
// pays cache write price. All others pay 90% less.

How to Verify the Cache Is Working

The API response includes usage data that shows whether your cache hit or missed. Check usage.cache_read_input_tokens and usage.cache_creation_input_tokens:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}
    ],
    messages=[{"role": "user", "content": question}]
)

usage = response.usage
print(f"Input tokens:        {usage.input_tokens}")
print(f"Cache write tokens:  {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:   {usage.cache_read_input_tokens}")
print(f"Output tokens:       {usage.output_tokens}")

# First call output:
# Input tokens:        150
# Cache write tokens:  2000
# Cache read tokens:   0

# Second call output (within 5 min):
# Input tokens:        150
# Cache write tokens:  0
# Cache read tokens:   2000  <-- cache hit!

Caching in Multi-turn Conversations

For chatbots and multi-turn sessions, the conversation history grows with each turn. You can cache the growing conversation to avoid re-processing early messages:

def chat(messages: list, system: str) -> str:
    # Cache everything except the last user message
    # by marking the second-to-last message as cacheable
    if len(messages) > 2:
        messages[-2]["cache_control"] = {"type": "ephemeral"}

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}],
        messages=messages
    )
    return response.content[0].text

This pattern is especially valuable for long support conversations or coding sessions where the history carries important context.

Minimum Cache Size

Claude only caches blocks that meet a minimum token threshold. As of current API specs:

claude-sonnet-4-6 / claude-opus-4-6: minimum 1,024 tokens to cache
claude-haiku-4-5: minimum 2,048 tokens to cache

Short system prompts below the minimum will not be cached even if you pass cache_control. The API silently ignores the cache directive — no error is returned, and the usage response will show zero cache tokens. Check your usage.cache_creation_input_tokens to confirm caching is active.

Best Practices

Put Static Content First

Cache works by prefix — the cached block must appear at the beginning of the content it represents. Put your static system prompt and reference documents before any dynamic content. Dynamic content (user question, runtime variables) goes last.

Warm the Cache Intentionally

The first call in a batch pays cache write price. For latency-sensitive pipelines, send a warmup request at the start of each 5-minute window to ensure subsequent calls hit the cache immediately.

Combine with Batch API for Maximum Savings

Claude also offers a Batch API that processes requests asynchronously at 50% off standard prices. Combining batch processing with prompt caching stacks the discounts — ideal for nightly document processing pipelines that do not require real-time responses.

Monitor Cache Hit Rate

Log cache_read_input_tokens per call. A low hit rate means your cached blocks are expiring before subsequent calls arrive — either increase call frequency or restructure your pipeline to batch calls closer together.

When Not to Use Prompt Caching

Highly dynamic prompts — if every call has a different system prompt, there is nothing to cache. Caching only saves money on static or semi-static content.
Very short prompts — prompts under the minimum token threshold will not cache regardless. Short system prompts are cheap anyway.
One-off calls — if you make a single API call, you pay cache write cost and never recoup it. Only enables savings at scale.
Extended thinking enabled — prompt caching interacts with extended thinking; test carefully and verify cache hits in usage data before optimizing in production.

Cost Estimation Worksheet

Use this to estimate your savings before implementing:

# Inputs
static_tokens = 5000       # tokens in your cached block
calls_per_hour = 100       # how many API calls per hour
base_input_price = 3.0     # $/M tokens (sonnet-4-6 example)

# Without caching
cost_per_call_no_cache = (static_tokens / 1_000_000) * base_input_price
hourly_cost_no_cache = cost_per_call_no_cache * calls_per_hour

# With caching (1 write + 99 reads per 5-min window)
cache_write_price = base_input_price * 1.25
cache_read_price = base_input_price * 0.10
cost_per_hour_cache = (
    (static_tokens / 1_000_000) * cache_write_price * 12 +   # 12 writes/hour
    (static_tokens / 1_000_000) * cache_read_price * 88      # 88 reads/hour
)

savings_pct = (1 - cost_per_hour_cache / hourly_cost_no_cache) * 100
print(f"Without caching: ${hourly_cost_no_cache:.4f}/hr")
print(f"With caching:    ${cost_per_hour_cache:.4f}/hr")
print(f"Savings:         {savings_pct:.1f}%")

The Bottom Line

Prompt caching is one of the highest-leverage API optimizations available. If your system has any static content in its prompts — and most do — implement caching before you scale. The cost difference between cached and uncached workloads compounds rapidly at volume.

Next steps: review your current system prompts with our system prompt guide, then add cache_control to any block over 1,024 tokens that stays constant across calls. Check your usage data after the first day to confirm cache hits and calculate actual savings.