Anthropic Prompt Caching — Best Practices (May 2026)

Date: 2026-05-31 Type: Research Status: Tier-S synthesis — official docs + 2026 community guides Sources: anthropic-prompt-caching-best-practices-2026-05-31.sources.json

TL;DR

Use prompt caching whenever you reuse ≥1k tokens of prefix (system prompt, tool definitions, retrieved documents). Cache reads cost 10% of base input; writes cost 125% (5 min TTL) or 200% (1 h TTL).
Structure prompts as stable prefix → volatile suffix: tools → system → messages. Cache breakpoints sit on the last block you want included in the cached prefix.
Two modes:
Automatic (single top-level cache_control) — best for multi-turn chat; breakpoint auto-advances as conversation grows.
Explicit (up to 4 cache_control markers on individual blocks) — fine-grained control for agents with layered static context (tools, system, docs, history).
Default to 5 min TTL. Switch to 1 h only when telemetry says ≥4 reads per write — break-even point.
Measure with usage.cache_read_input_tokens, cache_creation_input_tokens, input_tokens. Steady-state success = reads dominate, writes near zero.
The killers: dynamic content (timestamps, UUIDs, session IDs) in cached blocks, tool definition edits, tool_choice changes, image add/remove, model switches, non-deterministic JSON key ordering in tool_use blocks.

1. How it works (mental model)

The cache is a prefix hash, not semantic memory. The API hashes everything from the start of the request through the last cache_control block — tools → system → messages in that order. On the next request, it checks whether that exact hash exists in your workspace's cache. Hit → 10% read cost. Miss → write at 125%/200% of base input.

Two consequences:

The prefix must be byte-identical across requests.
The breakpoint defines what gets cached, not where caching begins. Caching always begins at the start of the request.

Caches live ~5 min or ~1 h (TTL choice). Every successful read refreshes the TTL, so hot caches stay alive indefinitely.

2. Pricing snapshot (per million tokens)

Model	Base input	5 m write	1 h write	Cache read
Claude Opus 4.8 / 4.7 / 4.6 / 4.5	$5	$6.25	$10	$0.50
Claude Opus 4.1 / 4 (dep.)	$15	$18.75	$30	$1.50
Claude Sonnet 4.6 / 4.5	$3	$3.75	$6	$0.30
Claude Haiku 4.5	$1	$1.25	$2	$0.10

Source: Anthropic prompt caching docs, May 2026.

Math the team needs to know: - 5 min TTL break-even = 3 reads per write (write surcharge 25%, savings 90% → 0.25/0.9 ≈ 0.28). - 1 h TTL break-even vs 5 min = ~4 additional reads per write (extra write surcharge 100% over base / 90% read savings).

3. Minimum cacheable prefix

Model	Minimum cacheable tokens
Claude Opus 4.7 / 4.6 / 4.5, Mythos Preview	4,096
Claude Opus 4.8, Opus 4.1, Sonnet 4.6 / 4.5	1,024
Claude Haiku 4.5	4,096
Claude Haiku 3.5 (Vertex only)	2,048

Below the threshold, cache_control is silently ignored — you pay the overhead with zero benefit. Always verify cache_creation_input_tokens > 0 on the first call.

4. What you actually cache

The hierarchy tools → system → messages is also the invalidation cascade: a change at level N invalidates N and everything after.

Change	Tools	System	Messages
Tool definitions (name / description / schema)	✘	✘	✘
Web-search tool toggle	✓	✘	✘
System prompt edit	✓	✘	✘
Messages content edit	✓	✓	✘
`tool_choice` change	✘	✘	✘
Adding / removing any image	✘	✘	✘
Extended-thinking budget change	✓	✓	✘ (alters thinking blocks)
Model swap (e.g. Sonnet → Opus)	✘	✘	✘

Reading from this table is the fastest way to spot why your hit rate dropped.

Special cases

Thinking blocks: cannot be marked with cache_control directly, but they ride along inside cached assistant turns and count as input tokens when read. Opus 4.5+ / Sonnet 4.6+ preserve them by default across tool-result turns.
Citations: cache the parent document block, not the sub-block.
Empty text blocks: not cacheable.

5. Best-practice patterns

5.1 Default placement — single explicit breakpoint on system

Smallest useful diff vs no caching:

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "<long stable system prompt>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [{ "role": "user", "content": "..." }]
}

Gives ~85–90% cost reduction on the system prompt as soon as it crosses the model's minimum prefix.

5.2 Agent / Claude Code-style aggressive caching

Real Claude Code sessions hit ~96% read rate vs the naive ~90% because they place breakpoints on multiple layers, not just system:

Tools (3–8k tokens of MCP schemas).
System prompt + persona.
Loaded documents / CLAUDE.md / project context.
Conversation history up to the last assistant turn.

That's all 4 breakpoints. Use them.

{
  "tools": [ ... last tool: { ..., "cache_control": {"type":"ephemeral"} } ],
  "system": [ ..., { "type":"text", "text":"<persona>", "cache_control":{"type":"ephemeral"} } ],
  "messages": [
    { "role":"user", "content":[ { "type":"text", "text":"<doc context>", "cache_control":{"type":"ephemeral"} } ] },
    ... history ...,
    { "role":"assistant", "content":[ { "type":"text", "text":"<last assistant>", "cache_control":{"type":"ephemeral"} } ] },
    { "role":"user", "content":"<new turn>" }
  ]
}

5.3 Mixed-TTL pattern (long-running agents)

For agents with a static persona + tools but bursty user activity:

1 h TTL on the most static prefix (tools + persona).
5 min TTL on conversation-history breakpoints.

Hard rule from Anthropic: 1 h breakpoints must appear before 5 min breakpoints in the request. Reversing the order is an error.

5.4 Automatic caching for chat

If your only goal is "make multi-turn chat cheaper", drop a single top-level cache_control and forget it:

{ "cache_control": { "type": "ephemeral" }, "system": [...], "messages": [...] }

The system advances the breakpoint to the last cacheable block each turn. Treat this as the smoke test; graduate to explicit breakpoints when you need control over tools / docs / TTL mixing.

5.5 Pre-warming with `max_tokens: 0`

Fire a zero-output request to write the cache before users arrive:

client.messages.create(
    model="claude-opus-4-8",
    max_tokens=0,
    system=[{"type":"text","text":"<persona>","cache_control":{"type":"ephemeral"}}],
    messages=[{"role":"user","content":"warmup"}]
)

Charged as one cache write, zero output tokens. The breakpoint must be on the shared block (system / tools), not the placeholder message, or the warmup won't carry over.

For 5 min caches, re-warm every <5 min. For longer idle windows, use 1 h TTL instead of poll-warming.

6. Cache-killing anti-patterns

Dynamic system prompt — f"Current time is {now}. ..." makes every request a cache miss + a fresh write. Move the timestamp into the user message, after the breakpoint.
Trimming / summarising old turns mid-conversation — the prefix changes, cache dies. Append, don't rewrite.
Per-user variables in shared system prompts — push user identity into the message layer.
Non-deterministic JSON key order in tool_use content (Swift, Go default to map iteration order). Force a stable serializer.
Toggling tool_choice between auto and a specific tool — full cache wipe even if tools is identical.
Image add/remove anywhere in the prompt — full wipe.
Workspace mixing — as of Feb 5 2026 caches are workspace-isolated on Claude API / AWS / Foundry. Dev and prod workspaces don't share. (Bedrock + Vertex still org-level.)
Model switching — Opus and Sonnet do not share caches. Treat a model swap as starting over.
Breakpoint on a volatile block — writes happen at the breakpoint. If the block changes every request, you pay write-price every turn and never read.
20-block lookback overflow — long conversations push the last write out of the 20-block lookback window. Add a second breakpoint closer to the head once history grows.

7. Measuring it

resp = client.messages.create(...)
u = resp.usage
read = u.cache_read_input_tokens
write = u.cache_creation_input_tokens
miss = u.input_tokens                # tokens AFTER the last breakpoint
hit_rate = read / max(read + write + miss, 1)

What "healthy" looks like in steady state: - cache_read_input_tokens dominates. - cache_creation_input_tokens ≈ 0 (occasional refresh write is fine). - input_tokens ≈ the size of the last user turn.

Anti-signals: - cache_creation_input_tokens > 0 on every call → prefix not actually stable. - cache_read_input_tokens == 0 after first call → breakpoint placement wrong, or TTL expired, or workspace/model boundary crossed.

Use Anthropic's cache-diagnostics endpoint to diff consecutive requests and locate the divergence byte.

8. Decision flow

Prefix ≥ model min?           ── no ──► don't bother
        │ yes
Prefix reused ≥3 times per write? ── no ──► single-call work, skip cache
        │ yes
Idle window > 5 min?          ── no ──► 5 min TTL
        │ yes
≥4 reads per write expected?  ── no ──► 5 min + pre-warm loop
        │ yes
                              ── yes ─► 1 h TTL

How many static layers (tools, system, docs, history)?
  1 → automatic caching is enough
  2-4 → explicit breakpoints, one per layer

Counterpoints

OpenAI semantic / automatic caching is cheaper if your prefix is unstable. OpenAI hashes prefixes automatically with no developer intervention and no write surcharge — Anthropic's 25–100% write premium punishes apps that recompute prefixes often. If your workload is "lots of short, varied prompts to a small system", caching may net out negative on Anthropic.
Workspace isolation can backfire. Teams that built around org-level caching pre-Feb 2026 saw hit rates collapse when prod/staging split into separate workspaces. Audit your workspace topology before assuming caching still works post-migration.
1 h TTL is rarely worth it. Multiple 2026 community write-ups (mager.co, claudebuddy.art) flag teams paying 100% write premiums for caches that never see the 4-read break-even, especially on long-tail user sessions.
Cache invalidation is brittle. Any innocuous change — re-ordering tools, editing a tool description, toggling web search — silently wipes the entire prefix. Teams without alerting on cache_creation_input_tokens discover this only when the bill arrives.
Thinking blocks complicate cost math. Cached thinking blocks count as input tokens on read, so extended-thinking workloads can show "high hit rate, similar bill" because the cached payload is bigger, not because savings dropped.

Recommendation

Default stance for a production Anthropic SDK app:

Single explicit breakpoint on the system block, 5 min TTL — baseline.
Add tool-definitions breakpoint when tool schemas cross ~1k tokens.
Add a third breakpoint at the head of conversation history once turns > 5.
Switch the most-static layer (tools or tools+system) to 1 h TTL only after telemetry shows ≥4 reads per write at idle.
Wire a dashboard on cache_read / (cache_read + cache_creation + input_tokens) per route. Alert if hit rate drops > 10% week-over-week — that's how you catch a silent prefix-breaking change.
Treat workspace, model, tool_choice, and image presence as cache boundaries during code review.

Sources captured in anthropic-prompt-caching-best-practices-2026-05-31.sources.json.