Date: 2026-05-17 Type: Research Status: Comprehensive guide to prompt caching best practices, strategies, and production patterns for the Anthropic Claude API Sources: prompt-caching-anthropic-api-best-practices-2026-05-17.sources.json
Prompt caching is the single highest-impact cost optimization available on the Anthropic Claude API. When used correctly, it reduces input token costs by up to 90% and latency by up to 85%. When used incorrectly, it can increase costs by 25-30%. The difference comes down to understanding five things the docs don't put in big letters: prefix matching is byte-exact, the hierarchy is unforgiving, minimum token thresholds silently skip caching, the 20-block lookback window can lose long conversations, and the ROI math requires consistent reuse within the TTL window.
This guide synthesizes the official Anthropic documentation with production experience reports from teams running prompt caching at scale.
Prompt caching stores the prefix of your request — everything from the beginning up to a "breakpoint" you define with cache_control. On subsequent requests, if the prefix is byte-for-byte identical, Claude reads from cache instead of reprocessing those tokens.
Key properties: - Prefix-only: You can only cache the start of a prompt. Middle sections can't be cached independently if the beginning changed. - Exact match: The cache key is computed from the literal token sequence — not the semantic meaning. A single extra space or different JSON key ordering breaks the cache. - TTL: Default 5 minutes, refreshes on every cache hit. A 1-hour TTL is available at 2× the base input price.
The API processes blocks in a fixed order that forms the cache prefix:
TOOLS → SYSTEM → MESSAGES
A change at any level invalidates that level and every level below it. If you reorder a tool definition, the entire system prompt and all messages become a cache miss. If you add a character to the system prompt, all messages become a cache miss.
| Mode | How | Best for |
|---|---|---|
| Automatic | cache_control: {"type": "ephemeral"} at top level of request |
Multi-turn conversations where the growing message history should cache automatically |
| Explicit | cache_control on individual content blocks |
Fine-grained control over what gets cached, multi-tenant scenarios, stable system prompts with volatile messages |
| Model | Base Input | 5m Cache Write | 1h Cache Write | Cache Read |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $6.25 (1.25×) | $10.00 (2×) | $0.50 (0.1×) |
| Claude Sonnet 4.6 | $3.00 | $3.75 (1.25×) | $6.00 (2×) | $0.30 (0.1×) |
| Claude Haiku 4.5 | $1.00 | $1.25 (1.25×) | $2.00 (2×) | $0.10 (0.1×) |
The cache write surcharge means caching only pays off if you read the same prefix enough times within the TTL:
| TTL | Write cost | Reads to break even vs. uncached |
|---|---|---|
| 5 minutes | 1.25× base | ~2 reads (1.25 + 0.1 = 1.35 < 2.0 for two uncached) |
| 1 hour | 2.0× base | ~11 reads (2.0 + 1.0 = 3.0, need 11 to beat 11× base) |
⚠️ If your call pattern is "one-shot, cold cache every time," prompt caching makes you slightly worse off — you pay the 1.25× write surcharge and never get a read.
Caching silently does nothing if your prefix is below the model-specific minimum. The API accepts cache_control without error — it just returns cache_creation_input_tokens: 0 and cache_read_input_tokens: 0.
| Model | Minimum cacheable prefix |
|---|---|
| Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 | 4,096 tokens |
| Sonnet 4.6, Sonnet 4.5, Opus 4.1 | 1,024 tokens |
| Haiku 3.5 (retired, except Vertex AI) | 2,048 tokens |
Gotcha: If you built a workflow on Sonnet with a 2,500-token cached prefix and then upgraded to Opus 4.7 (4,096 minimum), your cache hit rate drops to zero with no error message. Use
POST /v1/messages/count_tokensto verify your prefix length.
Place static content at the beginning, dynamic content at the end:
┌─────────────────────────────┐
│ TOOL DEFINITIONS │ ← Most stable. Rarely changes.
│ (cache_control here) │
├─────────────────────────────┤
│ SYSTEM PROMPT │ ← Slow-moving. Instructions, persona.
│ (cache_control here) │
├─────────────────────────────┤
│ FEW-SHOT EXAMPLES / RAG │ ← Semi-static. Changes per-session or per-day.
│ (cache_control here) │
├─────────────────────────────┤
│ USER MESSAGE │ ← Volatile. Never cached.
└─────────────────────────────┘
Rules:
- Never put timestamps, user IDs, or per-request variables in the cached prefix
- Never interpolate dynamic values like {{CURRENT_TIME}} into the system prompt
- Sort tool definitions deterministically (by name) so JSON ordering is stable
- Use the same separator characters consistently (\n\n vs \n changes the cache key)
| ✅ Put in the cached block | ❌ Keep out of the cached block |
|---|---|
| System prompt (instructions, persona) | The user's actual question |
| Tool definitions (sorted, deterministic) | Per-user IDs, session IDs, request IDs |
| Knowledge base / RAG context | Today's date or Date.now() |
| Few-shot examples | Per-customer variables (name, account, plan) |
| Style guides, format specs | Anything that changes between runs |
The #1 mistake is placing cache_control on a block that changes every request.
❌ WRONG:
[system prompt] [knowledge base] [timestamp: 2026-05-17T14:32:01Z] ← cache_control here
→ The timestamp changes every call. No cache hit ever occurs.
→ The lookback does NOT find the stable content behind it.
✅ RIGHT:
[system prompt] [knowledge base] ← cache_control here
[timestamp: 2026-05-17T14:32:01Z] [user question]
→ Stable prefix cached. Timestamp is outside the cached block.
Why this matters: Cache reads look backward from the breakpoint for entries that prior requests already wrote at their own breakpoints. The lookback does NOT scan for stable content behind the breakpoint and cache it. If you never wrote a cache entry at the position of the stable content, the lookback finds nothing.
The system checks at most 20 positions per breakpoint when looking for a prior cache write. In a growing conversation:
Fix: Add a second breakpoint partway through the conversation so there's always a cache write within 20 blocks of the current breakpoint.
For multi-turn chat, use automatic caching (top-level cache_control):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # Automatic caching
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "What's my name?"}, # Cache moves here
],
)
The cache point moves forward automatically as the conversation grows. No need to update breakpoints.
For multi-tenant applications, use multiple independent cache segments:
system = [
{
"type": "text",
"text": SYSTEM_PROMPT, # ~1200 tokens, identical for ALL tenants
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
{
"type": "text",
"text": tenant_context, # ~600 tokens, per-tenant
"cache_control": {"type": "ephemeral"},
},
]
Why two segments? - Segment 1 (system prompt) caches globally across all tenants → hit rate near 100% - Segment 2 (tenant context) caches per-tenant → changes for tenant A don't invalidate tenant B's cache - If you combine them, tenant A's context change invalidates the system prompt cache for everyone
Order matters: The more-static segment must come first. If per-tenant content comes first, the system prompt's cache key includes it — making the most cacheable segment uncacheable across tenants.
| Scenario | Recommended TTL | Why |
|---|---|---|
| Active chat (user typing) | 5 minutes | Cache stays warm from traffic |
| Agent loops with pauses | 5 minutes | Most tool calls happen within 5 minutes |
| Batch/eval harnesses | 1 hour | Same prefix hit 100+ times across a wave |
| Daily batch jobs | 1 hour | Prevents cold-start between runs |
| Cross-tenant shared prefix | 1 hour | Maximizes reuse window |
When mixing TTLs: The longer-TTL block must appear before the shorter-TTL block in the hierarchy:
tools (1h cache) → system (1h cache) → messages (5m cache)
Before real traffic, seed the cache with a warmup request:
client.messages.create(
model="claude-sonnet-4-6",
max_tokens=0, # No output needed
system=shared_system, # Your cached prefix
messages=[{"role": "user", "content": "warmup"}],
)
Then subsequent real requests hit a warm cache immediately.
The tension: agent loops want many tools (more capability), but caching wants a stable tool catalog (changes invalidate the prefix).
Solution: Keep a small, stable core tool catalog in the cached prefix, and load additional tools on demand:
# Core tools — always present, always cached
core_tools = [read_tool, write_tool, exec_tool, search_tool, status_tool, find_tool]
# When agent needs a specialized tool:
# Call find_tool → returns a tool_reference appended to messages (not tools array)
# Cache on core tools stays warm; specialized tools don't break the prefix
Each on-demand tool adds ~150-300 tokens to the message stream, but saves the cache miss from a 60-tool catalog.
# ❌ WRONG — changes every call, kills the cache
system = f"You are a helpful assistant. Current time: {datetime.now()}"
# ✅ RIGHT — dynamic content goes in the user message
system = "You are a helpful assistant." # Stable, cached
messages = [{"role": "user", "content": f"[Time: {datetime.now()}] My question..."}]
Adding a single tool invalidates the entire prefix (tools → system → messages). Use on-demand tool loading or the tool_reference pattern instead.
Caches are model-specific. A cache built for Sonnet 4.6 is not portable to Opus 4.7 (different tokenizers). If your agent fails over between models, every fallback is a cold start.
Every edit to your JSON output schema or tool definitions invalidates every cache write. Batch schema changes into stabilization sweeps rather than continuous tweaks.
Since February 2026, caches are isolated per workspace, not per organization. Dev and prod workspaces don't share cache even with identical prompts.
A cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent ones.
Languages like Go and Swift may randomize JSON key order during serialization. Use sorted/ordered maps for tool_use blocks to ensure consistent cache keys.
Changing the tool_choice parameter between calls invalidates the message cache even if tools and system prompt are identical.
Track these response fields on every call:
usage = response.usage
cache_read = usage.cache_read_input_tokens # Tokens read from cache (cheapest)
cache_write = usage.cache_creation_input_tokens # Tokens written to cache (premium)
uncached = usage.input_tokens # Tokens after last breakpoint (base price)
total_input = cache_read + cache_write + uncached
Key metrics:
- Hit rate = cache_read / total_input — target >70% for a healthy cache
- Read-to-write ratio — must be >1.25 for 5m TTL, >11 for 1h TTL to break even
- If both cache fields are 0, your prefix is below the minimum threshold
Treat hit rate as a product metric, not a vanity metric. Display it in the same dashboard as latency and error rate. Below 70% deserves investigation.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # That's it — automatic caching
system="You are an expert assistant with deep knowledge of...",
messages=[
{"role": "user", "content": "Explain quantum computing"},
],
)
print(response.usage.model_dump_json())
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # ~2000 tokens
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
{
"type": "text",
"text": knowledge_base, # ~8000 tokens
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{"role": "user", "content": f"[Context: {user_context}]\n\n{user_question}"},
],
)
tools = [
{"name": "read_file", "description": "...", "input_schema": {...}},
{"name": "search", "description": "...", "input_schema": {...}},
# Last tool gets the cache breakpoint
{"name": "execute", "description": "...", "input_schema": {...},
"cache_control": {"type": "ephemeral"}},
]
Caching can increase costs for low-frequency workloads. If your API calls are infrequent (>5 min apart) or one-shot, the 1.25× write surcharge with zero reads makes caching a net loss. Not every workload benefits — instrument before you ship.
The 5-minute TTL is too short for many real-world patterns. Anthropic shifted the default from 60 minutes to 5 minutes in March 2026, which was a 30-60% cost increase for agent workloads with human-in-the-loop pauses. The 1-hour TTL helps but at 2× write cost, requiring 11 reads to break even.
The prefix-only design limits flexibility. You cannot cache a frequently-used middle section of your prompt if the beginning varies. This forces architectural compromises where all "shared" content must be hoisted to the top of every request.
Multi-tenant applications require careful segmentation. The two-segment trick works but adds complexity. Each additional breakpoint is one more thing that can break, and the 4-breakpoint limit constrains how many independent cache segments you can have.
Cache isolation is per-workspace, not per-org. Teams that share prompts across dev/staging/prod workspaces don't share cache, reducing the effective hit rate in environments with lower traffic.