Date: 2026-05-31 Type: Research Status: Tier-S synthesis — official docs + 2026 community guides Sources: anthropic-prompt-caching-best-practices-2026-05-31.sources.json
tools → system → messages. Cache breakpoints sit on the last block you want included in the cached prefix.cache_control) — best for multi-turn chat; breakpoint auto-advances as conversation grows.cache_control markers on individual blocks) — fine-grained control for agents with layered static context (tools, system, docs, history).usage.cache_read_input_tokens, cache_creation_input_tokens, input_tokens. Steady-state success = reads dominate, writes near zero.tool_choice changes, image add/remove, model switches, non-deterministic JSON key ordering in tool_use blocks.The cache is a prefix hash, not semantic memory. The API hashes everything from the start of the request through the last cache_control block — tools → system → messages in that order. On the next request, it checks whether that exact hash exists in your workspace's cache. Hit → 10% read cost. Miss → write at 125%/200% of base input.
Two consequences:
Caches live ~5 min or ~1 h (TTL choice). Every successful read refreshes the TTL, so hot caches stay alive indefinitely.
| Model | Base input | 5 m write | 1 h write | Cache read |
|---|---|---|---|---|
| Claude Opus 4.8 / 4.7 / 4.6 / 4.5 | $5 | $6.25 | $10 | $0.50 |
| Claude Opus 4.1 / 4 (dep.) | $15 | $18.75 | $30 | $1.50 |
| Claude Sonnet 4.6 / 4.5 | $3 | $3.75 | $6 | $0.30 |
| Claude Haiku 4.5 | $1 | $1.25 | $2 | $0.10 |
Source: Anthropic prompt caching docs, May 2026.
Math the team needs to know: - 5 min TTL break-even = 3 reads per write (write surcharge 25%, savings 90% → 0.25/0.9 ≈ 0.28). - 1 h TTL break-even vs 5 min = ~4 additional reads per write (extra write surcharge 100% over base / 90% read savings).
| Model | Minimum cacheable tokens |
|---|---|
| Claude Opus 4.7 / 4.6 / 4.5, Mythos Preview | 4,096 |
| Claude Opus 4.8, Opus 4.1, Sonnet 4.6 / 4.5 | 1,024 |
| Claude Haiku 4.5 | 4,096 |
| Claude Haiku 3.5 (Vertex only) | 2,048 |
Below the threshold, cache_control is silently ignored — you pay the overhead with zero benefit. Always verify cache_creation_input_tokens > 0 on the first call.
The hierarchy tools → system → messages is also the invalidation cascade: a change at level N invalidates N and everything after.
| Change | Tools | System | Messages |
|---|---|---|---|
| Tool definitions (name / description / schema) | ✘ | ✘ | ✘ |
| Web-search tool toggle | ✓ | ✘ | ✘ |
| System prompt edit | ✓ | ✘ | ✘ |
| Messages content edit | ✓ | ✓ | ✘ |
tool_choice change |
✘ | ✘ | ✘ |
| Adding / removing any image | ✘ | ✘ | ✘ |
| Extended-thinking budget change | ✓ | ✓ | ✘ (alters thinking blocks) |
| Model swap (e.g. Sonnet → Opus) | ✘ | ✘ | ✘ |
Reading from this table is the fastest way to spot why your hit rate dropped.
cache_control directly, but they ride along inside cached assistant turns and count as input tokens when read. Opus 4.5+ / Sonnet 4.6+ preserve them by default across tool-result turns.Smallest useful diff vs no caching:
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "<long stable system prompt>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [{ "role": "user", "content": "..." }]
}
Gives ~85–90% cost reduction on the system prompt as soon as it crosses the model's minimum prefix.
Real Claude Code sessions hit ~96% read rate vs the naive ~90% because they place breakpoints on multiple layers, not just system:
That's all 4 breakpoints. Use them.
{
"tools": [ ... last tool: { ..., "cache_control": {"type":"ephemeral"} } ],
"system": [ ..., { "type":"text", "text":"<persona>", "cache_control":{"type":"ephemeral"} } ],
"messages": [
{ "role":"user", "content":[ { "type":"text", "text":"<doc context>", "cache_control":{"type":"ephemeral"} } ] },
... history ...,
{ "role":"assistant", "content":[ { "type":"text", "text":"<last assistant>", "cache_control":{"type":"ephemeral"} } ] },
{ "role":"user", "content":"<new turn>" }
]
}
For agents with a static persona + tools but bursty user activity:
Hard rule from Anthropic: 1 h breakpoints must appear before 5 min breakpoints in the request. Reversing the order is an error.
If your only goal is "make multi-turn chat cheaper", drop a single top-level cache_control and forget it:
{ "cache_control": { "type": "ephemeral" }, "system": [...], "messages": [...] }
The system advances the breakpoint to the last cacheable block each turn. Treat this as the smoke test; graduate to explicit breakpoints when you need control over tools / docs / TTL mixing.
max_tokens: 0Fire a zero-output request to write the cache before users arrive:
client.messages.create(
model="claude-opus-4-8",
max_tokens=0,
system=[{"type":"text","text":"<persona>","cache_control":{"type":"ephemeral"}}],
messages=[{"role":"user","content":"warmup"}]
)
Charged as one cache write, zero output tokens. The breakpoint must be on the shared block (system / tools), not the placeholder message, or the warmup won't carry over.
For 5 min caches, re-warm every <5 min. For longer idle windows, use 1 h TTL instead of poll-warming.
f"Current time is {now}. ..." makes every request a cache miss + a fresh write. Move the timestamp into the user message, after the breakpoint.tool_use content (Swift, Go default to map iteration order). Force a stable serializer.tool_choice between auto and a specific tool — full cache wipe even if tools is identical.resp = client.messages.create(...)
u = resp.usage
read = u.cache_read_input_tokens
write = u.cache_creation_input_tokens
miss = u.input_tokens # tokens AFTER the last breakpoint
hit_rate = read / max(read + write + miss, 1)
What "healthy" looks like in steady state:
- cache_read_input_tokens dominates.
- cache_creation_input_tokens ≈ 0 (occasional refresh write is fine).
- input_tokens ≈ the size of the last user turn.
Anti-signals:
- cache_creation_input_tokens > 0 on every call → prefix not actually stable.
- cache_read_input_tokens == 0 after first call → breakpoint placement wrong, or TTL expired, or workspace/model boundary crossed.
Use Anthropic's cache-diagnostics endpoint to diff consecutive requests and locate the divergence byte.
Prefix ≥ model min? ── no ──► don't bother
│ yes
Prefix reused ≥3 times per write? ── no ──► single-call work, skip cache
│ yes
Idle window > 5 min? ── no ──► 5 min TTL
│ yes
≥4 reads per write expected? ── no ──► 5 min + pre-warm loop
│ yes
── yes ─► 1 h TTL
How many static layers (tools, system, docs, history)?
1 → automatic caching is enough
2-4 → explicit breakpoints, one per layer
cache_creation_input_tokens discover this only when the bill arrives.Default stance for a production Anthropic SDK app:
cache_read / (cache_read + cache_creation + input_tokens) per route. Alert if hit rate drops > 10% week-over-week — that's how you catch a silent prefix-breaking change.tool_choice, and image presence as cache boundaries during code review.Sources captured in anthropic-prompt-caching-best-practices-2026-05-31.sources.json.