⌂ Home ☷ Board

Anthropic Prompt Caching — Best Practices (May 2026)

Date: 2026-05-31 Type: Research Status: Tier-S synthesis — official docs + 2026 community guides Sources: anthropic-prompt-caching-best-practices-2026-05-31.sources.json


TL;DR


1. How it works (mental model)

The cache is a prefix hash, not semantic memory. The API hashes everything from the start of the request through the last cache_control block — tools → system → messages in that order. On the next request, it checks whether that exact hash exists in your workspace's cache. Hit → 10% read cost. Miss → write at 125%/200% of base input.

Two consequences:

  1. The prefix must be byte-identical across requests.
  2. The breakpoint defines what gets cached, not where caching begins. Caching always begins at the start of the request.

Caches live ~5 min or ~1 h (TTL choice). Every successful read refreshes the TTL, so hot caches stay alive indefinitely.


2. Pricing snapshot (per million tokens)

Model Base input 5 m write 1 h write Cache read
Claude Opus 4.8 / 4.7 / 4.6 / 4.5 $5 $6.25 $10 $0.50
Claude Opus 4.1 / 4 (dep.) $15 $18.75 $30 $1.50
Claude Sonnet 4.6 / 4.5 $3 $3.75 $6 $0.30
Claude Haiku 4.5 $1 $1.25 $2 $0.10

Source: Anthropic prompt caching docs, May 2026.

Math the team needs to know: - 5 min TTL break-even = 3 reads per write (write surcharge 25%, savings 90% → 0.25/0.9 ≈ 0.28). - 1 h TTL break-even vs 5 min = ~4 additional reads per write (extra write surcharge 100% over base / 90% read savings).


3. Minimum cacheable prefix

Model Minimum cacheable tokens
Claude Opus 4.7 / 4.6 / 4.5, Mythos Preview 4,096
Claude Opus 4.8, Opus 4.1, Sonnet 4.6 / 4.5 1,024
Claude Haiku 4.5 4,096
Claude Haiku 3.5 (Vertex only) 2,048

Below the threshold, cache_control is silently ignored — you pay the overhead with zero benefit. Always verify cache_creation_input_tokens > 0 on the first call.


4. What you actually cache

The hierarchy tools → system → messages is also the invalidation cascade: a change at level N invalidates N and everything after.

Change Tools System Messages
Tool definitions (name / description / schema)
Web-search tool toggle
System prompt edit
Messages content edit
tool_choice change
Adding / removing any image
Extended-thinking budget change ✘ (alters thinking blocks)
Model swap (e.g. Sonnet → Opus)

Reading from this table is the fastest way to spot why your hit rate dropped.

Special cases


5. Best-practice patterns

5.1 Default placement — single explicit breakpoint on system

Smallest useful diff vs no caching:

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "<long stable system prompt>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [{ "role": "user", "content": "..." }]
}

Gives ~85–90% cost reduction on the system prompt as soon as it crosses the model's minimum prefix.

5.2 Agent / Claude Code-style aggressive caching

Real Claude Code sessions hit ~96% read rate vs the naive ~90% because they place breakpoints on multiple layers, not just system:

  1. Tools (3–8k tokens of MCP schemas).
  2. System prompt + persona.
  3. Loaded documents / CLAUDE.md / project context.
  4. Conversation history up to the last assistant turn.

That's all 4 breakpoints. Use them.

{
  "tools": [ ... last tool: { ..., "cache_control": {"type":"ephemeral"} } ],
  "system": [ ..., { "type":"text", "text":"<persona>", "cache_control":{"type":"ephemeral"} } ],
  "messages": [
    { "role":"user", "content":[ { "type":"text", "text":"<doc context>", "cache_control":{"type":"ephemeral"} } ] },
    ... history ...,
    { "role":"assistant", "content":[ { "type":"text", "text":"<last assistant>", "cache_control":{"type":"ephemeral"} } ] },
    { "role":"user", "content":"<new turn>" }
  ]
}

5.3 Mixed-TTL pattern (long-running agents)

For agents with a static persona + tools but bursty user activity:

Hard rule from Anthropic: 1 h breakpoints must appear before 5 min breakpoints in the request. Reversing the order is an error.

5.4 Automatic caching for chat

If your only goal is "make multi-turn chat cheaper", drop a single top-level cache_control and forget it:

{ "cache_control": { "type": "ephemeral" }, "system": [...], "messages": [...] }

The system advances the breakpoint to the last cacheable block each turn. Treat this as the smoke test; graduate to explicit breakpoints when you need control over tools / docs / TTL mixing.

5.5 Pre-warming with max_tokens: 0

Fire a zero-output request to write the cache before users arrive:

client.messages.create(
    model="claude-opus-4-8",
    max_tokens=0,
    system=[{"type":"text","text":"<persona>","cache_control":{"type":"ephemeral"}}],
    messages=[{"role":"user","content":"warmup"}]
)

Charged as one cache write, zero output tokens. The breakpoint must be on the shared block (system / tools), not the placeholder message, or the warmup won't carry over.

For 5 min caches, re-warm every <5 min. For longer idle windows, use 1 h TTL instead of poll-warming.


6. Cache-killing anti-patterns

  1. Dynamic system promptf"Current time is {now}. ..." makes every request a cache miss + a fresh write. Move the timestamp into the user message, after the breakpoint.
  2. Trimming / summarising old turns mid-conversation — the prefix changes, cache dies. Append, don't rewrite.
  3. Per-user variables in shared system prompts — push user identity into the message layer.
  4. Non-deterministic JSON key order in tool_use content (Swift, Go default to map iteration order). Force a stable serializer.
  5. Toggling tool_choice between auto and a specific tool — full cache wipe even if tools is identical.
  6. Image add/remove anywhere in the prompt — full wipe.
  7. Workspace mixing — as of Feb 5 2026 caches are workspace-isolated on Claude API / AWS / Foundry. Dev and prod workspaces don't share. (Bedrock + Vertex still org-level.)
  8. Model switching — Opus and Sonnet do not share caches. Treat a model swap as starting over.
  9. Breakpoint on a volatile block — writes happen at the breakpoint. If the block changes every request, you pay write-price every turn and never read.
  10. 20-block lookback overflow — long conversations push the last write out of the 20-block lookback window. Add a second breakpoint closer to the head once history grows.

7. Measuring it

resp = client.messages.create(...)
u = resp.usage
read = u.cache_read_input_tokens
write = u.cache_creation_input_tokens
miss = u.input_tokens                # tokens AFTER the last breakpoint
hit_rate = read / max(read + write + miss, 1)

What "healthy" looks like in steady state: - cache_read_input_tokens dominates. - cache_creation_input_tokens ≈ 0 (occasional refresh write is fine). - input_tokens ≈ the size of the last user turn.

Anti-signals: - cache_creation_input_tokens > 0 on every call → prefix not actually stable. - cache_read_input_tokens == 0 after first call → breakpoint placement wrong, or TTL expired, or workspace/model boundary crossed.

Use Anthropic's cache-diagnostics endpoint to diff consecutive requests and locate the divergence byte.


8. Decision flow

Prefix ≥ model min?           ── no ──► don't bother
        │ yes
Prefix reused ≥3 times per write? ── no ──► single-call work, skip cache
        │ yes
Idle window > 5 min?          ── no ──► 5 min TTL
        │ yes
≥4 reads per write expected?  ── no ──► 5 min + pre-warm loop
        │ yes
                              ── yes ─► 1 h TTL

How many static layers (tools, system, docs, history)?
  1 → automatic caching is enough
  2-4 → explicit breakpoints, one per layer

Counterpoints


Recommendation

Default stance for a production Anthropic SDK app:

  1. Single explicit breakpoint on the system block, 5 min TTL — baseline.
  2. Add tool-definitions breakpoint when tool schemas cross ~1k tokens.
  3. Add a third breakpoint at the head of conversation history once turns > 5.
  4. Switch the most-static layer (tools or tools+system) to 1 h TTL only after telemetry shows ≥4 reads per write at idle.
  5. Wire a dashboard on cache_read / (cache_read + cache_creation + input_tokens) per route. Alert if hit rate drops > 10% week-over-week — that's how you catch a silent prefix-breaking change.
  6. Treat workspace, model, tool_choice, and image presence as cache boundaries during code review.

Sources captured in anthropic-prompt-caching-best-practices-2026-05-31.sources.json.