Claude API Prompt Caching: When It Saves Money and When It Doesn't

March 23, 2026 · 8 min read · claude-api, prompt-caching, llm-cost-optimization
Claude API Prompt Caching: When It Saves Money and When It Doesn't

I have an agent that reads a 15,000 token knowledge base on every turn. Multi-turn conversation, roughly 40 calls per user session. Without caching, every turn repays the full input token cost for a context window that never changes. With Claude API prompt caching, the knowledge base gets written once at 1.25x input price, then every subsequent read costs 0.1x. After the second call, it is already cheaper than paying full input rate. That math is the whole reason the feature exists.

Use prompt caching when you have a large, stable prefix that repeats across calls: RAG pipelines, multi-turn agents with fixed system prompts, tool-heavy workflows where the tool definitions alone are thousands of tokens. Skip it for single-shot prompts, cold calls spaced more than five minutes apart, or prompts under 1,024 tokens on Sonnet (2,048 on Haiku). The break-even is about two hits within the TTL window. Below that, the 25% write premium eats your savings.

How Claude implements caching

Anthropic exposes caching through a cache_control field on individual content blocks. You mark a block as cacheable with {"type": "ephemeral"} and the API hashes that block plus everything that precedes it in the prompt. On the next call, if the hash matches an entry still within the 5-minute TTL, you pay cache read rates instead of full input rates.

Two details matter. First, the cache key is the full prefix up to and including the marked block, not just the block itself. Change a single character earlier in the system prompt and the cache misses. Second, the 5-minute window is a sliding TTL: each hit refreshes it. A busy agent holding a cached knowledge base in flight keeps that cache warm indefinitely, but a quiet agent loses it between user sessions.

Cache breakpoints can be placed on up to four blocks per request. The common pattern is one breakpoint on the system prompt and one on the tool definitions, which lets the tools cache independently if your system prompt changes but tools stay fixed.

Which blocks are cacheable

Three block types accept cache_control:

  • System prompt blocks. The most common target. Long instructions, persona definitions, style guides.
  • Tool definitions: the tools array. If you have 20 tools with detailed JSON schemas, this alone can be 3,000 to 5,000 tokens.
  • Message content blocks, both user and assistant. Useful for RAG patterns where you inject retrieved documents into the user turn.

The caveat on message-level caching: it only helps if the same document block appears in subsequent requests in the same position. For most RAG pipelines, retrieved chunks vary per query, so message caching rarely pays. System and tool caching is where the real savings live.

Minimum token counts before a block becomes cache-eligible:

  • Claude Sonnet 4.6: 1,024 tokens
  • Claude Haiku 4.5: 2,048 tokens
  • Claude Opus 4.7: 1,024 tokens

Blocks under the minimum are silently ignored (the cache_control field is accepted, no error, but nothing is cached). Always verify with a usage check, covered below.

The price math

Two rates matter:

  • Cache write: 1.25x the base input price (a 25% premium on the first call that creates the entry)
  • Cache read: 0.1x the base input price (a 90% discount on every hit)

Base Sonnet 4.6 input runs $3 per million tokens at time of writing. So cached reads come out to $0.30 per million. Writes cost $3.75 per million.

Break-even is simple. If your cached prefix is N tokens:

  • First call cost: N * 1.25 (write)
  • Each subsequent hit: N * 0.1 (read)
  • Without caching, each call: N * 1.0

After one hit, you have paid 1.25N + 0.1N = 1.35N for two calls versus 2.0N without caching. Already ahead. After ten hits: 1.25N + 1.0N = 2.25N versus 10.0N. The bigger the prefix and the more hits, the more the discount compounds.

TypeScript example

Here is the pattern I use: large system prompt cached, per-request user message uncached.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You are a customer support agent for a SaaS product.

[... 2,500 tokens of product documentation, tone guide,
policies, escalation rules, FAQ content ...]
`;

async function answerQuestion(userMessage: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      {
        role: "user",
        content: userMessage,
      },
    ],
  });

  const usage = response.usage;
  console.log({
    cache_creation_input_tokens: usage.cache_creation_input_tokens,
    cache_read_input_tokens: usage.cache_read_input_tokens,
    input_tokens: usage.input_tokens,
    output_tokens: usage.output_tokens,
  });

  return response.content;
}

// First call: cache_creation_input_tokens ~= 2500, cache_read_input_tokens = 0
await answerQuestion("How do I reset my password?");

// Second call within 5 minutes: cache_creation = 0, cache_read_input_tokens ~= 2500
await answerQuestion("Can I export my data as CSV?");

Note the system field takes an array of content blocks, not a plain string, once you want to apply cache_control. The API also accepts a string for the non-cached case, which is why most starter snippets use that shorthand.

For prompt caching inside a tool-use workflow, the same pattern applies to the Claude API tool use flow: cache your tool definitions once, reuse across turns.

Verifying cache hits

The usage object on every response tells you exactly what happened:

  • input_tokens: uncached input tokens charged at standard rate
  • cache_creation_input_tokens: tokens written to cache this call (charged at 1.25x)
  • cache_read_input_tokens: tokens read from cache this call (charged at 0.1x)
  • output_tokens: standard output rate

If cache_read_input_tokens is zero on what you expected to be a cached call, something invalidated the cache. Common causes: the 5-minute TTL expired, a whitespace or character change in the cached prefix, or the prefix is under the minimum token count.

A quick sanity check: log the usage object on every production call and alert if cache hit ratio drops below an expected threshold. I run this check in the same structured output validation pattern I use for Claude API structured output, since both share the same “trust but verify” principle around Anthropic responses.

Real-world scenario: 15k token knowledge base agent

Back to the opening scenario. Suppose an agent answers questions against a 15,000 token knowledge base loaded into the system prompt. Each user turn is roughly 200 input tokens plus 500 output tokens. Sonnet 4.6 rates: $3 input, $15 output per million.

Without caching, per call:

  • Input: 15,200 tokens * $3/M = $0.0456
  • Output: 500 * $15/M = $0.0075
  • Total: $0.0531

With caching, first call (cache write):

  • Cache write: 15,000 * $3.75/M = $0.0563
  • Uncached input: 200 * $3/M = $0.0006
  • Output: $0.0075
  • Total: $0.0644

With caching, subsequent calls (cache hit):

  • Cache read: 15,000 * $0.30/M = $0.0045
  • Uncached input: 200 * $3/M = $0.0006
  • Output: $0.0075
  • Total: $0.0126

Compare total cost at different session lengths (calls per session, assuming all land inside the 5-minute window):

CallsNo cachingWith cachingSavings
1$0.053$0.064worse
2$0.106$0.07727%
5$0.266$0.11457%
10$0.531$0.17767%
40$2.124$0.55474%

One-shot calls lose money. Two hits break even. Anything above five hits is a clear win. For a 40-turn agent session, caching cuts the bill by roughly three quarters. That is the shape of the curve, and it holds for any prefix size once you clear the minimum.

Gotchas that burn practitioners

Whitespace sensitivity. The cache key hashes the exact byte content. Trailing newline added by your template engine, a space after a variable interpolation, a different indent level in a multi-line prompt. All of these miss the cache. Build your system prompt the same way every call, ideally by loading from a single canonical source.

Tool definitions invalidate everything downstream. If you cache both tools and system prompt, and you modify a single tool description, the tools block hash changes, and every block after it in the prefix order (including system prompt if it comes later) also misses. Order your cacheable blocks from most stable to least stable. Tools first if they rarely change, system prompt next, then anything more dynamic.

Cache within a single message vs across messages. Caching happens across API calls, not within a single call. You cannot cache a block and reuse it inside the same request. Each request is scored against the cache independently.

Model version changes. Switching from claude-sonnet-4-6 to claude-opus-4-7 is a cache miss. The cache is keyed per model. This matters when you run experiments or do A/B routing between models.

The 5-minute TTL is short. For background jobs that run every ten minutes, caching buys you nothing. For a chatty agent, it keeps compounding. If you need longer retention, Anthropic has shipped extended TTL options for some tiers, but the default is five minutes and you should plan around that.

Prefill and caching interaction. The assistant-prefill trick I use for structured output works fine with caching as long as the prefill content itself is stable. If the prefill varies per request, don’t cache it.

For agent patterns where caching really earns its keep, the Claude Code SDK agent patterns writeup covers multi-turn workflows end to end, including how repeated tool definitions dominate the token budget.

When prompt caching is worth the complexity

Turn on prompt caching when:

  • You have a stable prefix over 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku)
  • You expect two or more calls against that prefix within five minutes
  • You are running multi-turn conversations, RAG pipelines, or tool-heavy agents
  • Your prefix-to-variable ratio is high (big system prompt, short user messages)

Skip it when:

  • Calls are single-shot or spaced hours apart
  • The prefix changes every call (dynamic RAG with unique documents per query)
  • Your prompt is under the minimum token threshold
  • You are prototyping and the instrumentation overhead isn’t worth the pennies

The decision is not about whether caching is “better.” It is about whether your workload has the shape that lets the 10x read discount outweigh the 1.25x write premium. Do the math on your actual call pattern, log cache_read_input_tokens on every response, and set an alert when the hit ratio drops. That is how you know caching is actually doing what you think it is.

Download the AI Automation Checklist (PDF)