← Back to Blog

Claude Prompt Caching: Cut API Costs by 90%

Claude prompt caching slashes API bills by up to 90% on repeated context. Learn how cache_control works, what to cache, and real code examples.


If you are making repeated Claude API calls with the same system prompt, document, or few-shot examples, you are paying full input token prices every single time. Prompt caching changes that. Mark a block of content as cacheable and Claude stores it server-side — subsequent calls that reuse it pay a fraction of the original price.

For workloads with long static context, the savings are significant: up to 90% on the cached portion. This guide covers exactly how to use it.

What Is Claude Prompt Caching?

Prompt caching is a feature in the Claude Messages API that lets you mark portions of your input for server-side reuse. When Claude processes a request with a cache-eligible block, it stores that block. If the next request contains the same cached block, Claude skips re-processing it and charges you at the cache read rate instead of the full input rate.

The result: dramatically lower token costs on anything that stays constant across calls — system prompts, reference documents, long instructions, or few-shot examples.

How the Pricing Works

Claude prompt caching uses three price tiers:

Token TypeWhat It IsCost
Cache writeFirst time you send a cacheable block25% higher than base input price
Cache readSubsequent calls that hit the cache90% lower than base input price
Regular inputNon-cached input tokensStandard rate

Cache writes cost slightly more upfront — you are paying Claude to store the context. Cache reads are dramatically cheaper. The math works in your favor as soon as you make more than two calls with the same cached block.

Example: a 10,000-token system prompt at $3/M input tokens costs $0.03 per call without caching. With caching, the first call costs $0.0375 (write cost), then every subsequent call costs $0.003 (read cost). By call four you have already saved money. By call 100 you have paid $0.33 instead of $3.00.

The Cache TTL

Cached blocks persist for 5 minutes from the last use. Each cache hit resets the 5-minute timer. For pipelines that process one document at a time, this is usually sufficient. For workflows with long pauses between calls, you may need to warm the cache at the start of each batch.

How to Use cache_control

Enable caching by passing the system parameter as an array of content blocks instead of a plain string, and adding cache_control to the block you want cached:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a contract analysis assistant. Extract key terms, obligations, and risk factors. Always respond in structured JSON.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": contract_text}
    ]
)

The only cache type currently supported is "ephemeral" — 5-minute TTL. Anthropic may introduce longer-lived cache types in the future.

What You Can Cache

Any content block in your API request can be marked for caching. The most common use cases:

1. System Prompts

The highest-value target. Long system prompts stay constant across all calls in a pipeline — cache them.

system=[
    {
        "type": "text",
        "text": long_system_prompt,  # 2000+ tokens of instructions
        "cache_control": {"type": "ephemeral"}
    }
]

2. Reference Documents

If every call in a session queries the same document (a contract, a codebase, a spec), cache the document in the messages array:

messages=[
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Reference document:\n\n{large_document}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_question  # varies per call
            }
        ]
    }
]

3. Few-Shot Examples

Few-shot prompts can be expensive — 20 examples at 100 tokens each adds 2,000 tokens to every call. Cache them:

system=[
    {
        "type": "text",
        "text": persona_instructions,
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": few_shot_examples,  # 2000 tokens of examples
        "cache_control": {"type": "ephemeral"}
    }
]

You can mark multiple blocks for caching. Each gets its own cache entry.

TypeScript Example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You are a senior code reviewer specializing in security.
[... 1500 tokens of detailed instructions ...]`;

async function reviewCode(code: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" }
      }
    ],
    messages: [
      {
        role: "user",
        content: `Review this code for security issues:\n\n${code}`
      }
    ]
  });

  return response.content[0].text;
}

// Call this 100 times per hour — only the first call per 5-minute window
// pays cache write price. All others pay 90% less.

How to Verify the Cache Is Working

The API response includes usage data that shows whether your cache hit or missed. Check usage.cache_read_input_tokens and usage.cache_creation_input_tokens:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}
    ],
    messages=[{"role": "user", "content": question}]
)

usage = response.usage
print(f"Input tokens:        {usage.input_tokens}")
print(f"Cache write tokens:  {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:   {usage.cache_read_input_tokens}")
print(f"Output tokens:       {usage.output_tokens}")

# First call output:
# Input tokens:        150
# Cache write tokens:  2000
# Cache read tokens:   0

# Second call output (within 5 min):
# Input tokens:        150
# Cache write tokens:  0
# Cache read tokens:   2000  <-- cache hit!

Caching in Multi-turn Conversations

For chatbots and multi-turn sessions, the conversation history grows with each turn. You can cache the growing conversation to avoid re-processing early messages:

def chat(messages: list, system: str) -> str:
    # Cache everything except the last user message
    # by marking the second-to-last message as cacheable
    if len(messages) > 2:
        messages[-2]["cache_control"] = {"type": "ephemeral"}

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[{"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}],
        messages=messages
    )
    return response.content[0].text

This pattern is especially valuable for long support conversations or coding sessions where the history carries important context.

Minimum Cache Size

Claude only caches blocks that meet a minimum token threshold. As of current API specs:

  • claude-sonnet-4-6 / claude-opus-4-6: minimum 1,024 tokens to cache
  • claude-haiku-4-5: minimum 2,048 tokens to cache

Short system prompts below the minimum will not be cached even if you pass cache_control. The API silently ignores the cache directive — no error is returned, and the usage response will show zero cache tokens. Check your usage.cache_creation_input_tokens to confirm caching is active.

Best Practices

Put Static Content First

Cache works by prefix — the cached block must appear at the beginning of the content it represents. Put your static system prompt and reference documents before any dynamic content. Dynamic content (user question, runtime variables) goes last.

Warm the Cache Intentionally

The first call in a batch pays cache write price. For latency-sensitive pipelines, send a warmup request at the start of each 5-minute window to ensure subsequent calls hit the cache immediately.

Combine with Batch API for Maximum Savings

Claude also offers a Batch API that processes requests asynchronously at 50% off standard prices. Combining batch processing with prompt caching stacks the discounts — ideal for nightly document processing pipelines that do not require real-time responses.

Monitor Cache Hit Rate

Log cache_read_input_tokens per call. A low hit rate means your cached blocks are expiring before subsequent calls arrive — either increase call frequency or restructure your pipeline to batch calls closer together.

When Not to Use Prompt Caching

  • Highly dynamic prompts — if every call has a different system prompt, there is nothing to cache. Caching only saves money on static or semi-static content.
  • Very short prompts — prompts under the minimum token threshold will not cache regardless. Short system prompts are cheap anyway.
  • One-off calls — if you make a single API call, you pay cache write cost and never recoup it. Only enables savings at scale.
  • Extended thinking enabled — prompt caching interacts with extended thinking; test carefully and verify cache hits in usage data before optimizing in production.

Cost Estimation Worksheet

Use this to estimate your savings before implementing:

# Inputs
static_tokens = 5000       # tokens in your cached block
calls_per_hour = 100       # how many API calls per hour
base_input_price = 3.0     # $/M tokens (sonnet-4-6 example)

# Without caching
cost_per_call_no_cache = (static_tokens / 1_000_000) * base_input_price
hourly_cost_no_cache = cost_per_call_no_cache * calls_per_hour

# With caching (1 write + 99 reads per 5-min window)
cache_write_price = base_input_price * 1.25
cache_read_price = base_input_price * 0.10
cost_per_hour_cache = (
    (static_tokens / 1_000_000) * cache_write_price * 12 +   # 12 writes/hour
    (static_tokens / 1_000_000) * cache_read_price * 88      # 88 reads/hour
)

savings_pct = (1 - cost_per_hour_cache / hourly_cost_no_cache) * 100
print(f"Without caching: ${hourly_cost_no_cache:.4f}/hr")
print(f"With caching:    ${cost_per_hour_cache:.4f}/hr")
print(f"Savings:         {savings_pct:.1f}%")

The Bottom Line

Prompt caching is one of the highest-leverage API optimizations available. If your system has any static content in its prompts — and most do — implement caching before you scale. The cost difference between cached and uncached workloads compounds rapidly at volume.

Next steps: review your current system prompts with our system prompt guide, then add cache_control to any block over 1,024 tokens that stays constant across calls. Check your usage data after the first day to confirm cache hits and calculate actual savings.


Free: Claude custom instructions template pack

Eight copy-paste templates — developer, writer, analyst, CLAUDE.md starter, and more. Plus new guides in your inbox. No spam, unsubscribe anytime.

Or grab the templates directly — no email needed

Keep learning — for free

50+ AI courses. 590+ lessons. No paywall for starters.

Need help building this?

We build MCP servers, Claude workflows, and AI agents for teams. Strategy calls start at $150/hr.