Best Practices for Prompt Caching with the Anthropic API

Date: 2026-05-17 Type: Research Status: Comprehensive guide to prompt caching best practices, strategies, and production patterns for the Anthropic Claude API Sources: prompt-caching-anthropic-api-best-practices-2026-05-17.sources.json

Executive Summary

Prompt caching is the single highest-impact cost optimization available on the Anthropic Claude API. When used correctly, it reduces input token costs by up to 90% and latency by up to 85%. When used incorrectly, it can increase costs by 25-30%. The difference comes down to understanding five things the docs don't put in big letters: prefix matching is byte-exact, the hierarchy is unforgiving, minimum token thresholds silently skip caching, the 20-block lookback window can lose long conversations, and the ROI math requires consistent reuse within the TTL window.

This guide synthesizes the official Anthropic documentation with production experience reports from teams running prompt caching at scale.

1. How Prompt Caching Works

Core Mechanism

Prompt caching stores the prefix of your request — everything from the beginning up to a "breakpoint" you define with cache_control. On subsequent requests, if the prefix is byte-for-byte identical, Claude reads from cache instead of reprocessing those tokens.

Key properties: - Prefix-only: You can only cache the start of a prompt. Middle sections can't be cached independently if the beginning changed. - Exact match: The cache key is computed from the literal token sequence — not the semantic meaning. A single extra space or different JSON key ordering breaks the cache. - TTL: Default 5 minutes, refreshes on every cache hit. A 1-hour TTL is available at 2× the base input price.

The Hierarchy

The API processes blocks in a fixed order that forms the cache prefix:

TOOLS → SYSTEM → MESSAGES

A change at any level invalidates that level and every level below it. If you reorder a tool definition, the entire system prompt and all messages become a cache miss. If you add a character to the system prompt, all messages become a cache miss.

Two Modes of Caching

Mode	How	Best for
Automatic	`cache_control: {"type": "ephemeral"}` at top level of request	Multi-turn conversations where the growing message history should cache automatically
Explicit	`cache_control` on individual content blocks	Fine-grained control over what gets cached, multi-tenant scenarios, stable system prompts with volatile messages

2. Pricing and ROI Math

Pricing by Model (per million tokens)

Model	Base Input	5m Cache Write	1h Cache Write	Cache Read
Claude Opus 4.7	$5.00	$6.25 (1.25×)	$10.00 (2×)	$0.50 (0.1×)
Claude Sonnet 4.6	$3.00	$3.75 (1.25×)	$6.00 (2×)	$0.30 (0.1×)
Claude Haiku 4.5	$1.00	$1.25 (1.25×)	$2.00 (2×)	$0.10 (0.1×)

Breakeven Analysis

The cache write surcharge means caching only pays off if you read the same prefix enough times within the TTL:

TTL	Write cost	Reads to break even vs. uncached
5 minutes	1.25× base	~2 reads (1.25 + 0.1 = 1.35 < 2.0 for two uncached)
1 hour	2.0× base	~11 reads (2.0 + 1.0 = 3.0, need 11 to beat 11× base)

⚠️ If your call pattern is "one-shot, cold cache every time," prompt caching makes you slightly worse off — you pay the 1.25× write surcharge and never get a read.

3. Minimum Token Thresholds

Caching silently does nothing if your prefix is below the model-specific minimum. The API accepts cache_control without error — it just returns cache_creation_input_tokens: 0 and cache_read_input_tokens: 0.

Model	Minimum cacheable prefix
Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5	4,096 tokens
Sonnet 4.6, Sonnet 4.5, Opus 4.1	1,024 tokens
Haiku 3.5 (retired, except Vertex AI)	2,048 tokens

Gotcha: If you built a workflow on Sonnet with a 2,500-token cached prefix and then upgraded to Opus 4.7 (4,096 minimum), your cache hit rate drops to zero with no error message. Use POST /v1/messages/count_tokens to verify your prefix length.

4. Best Practices

4.1 Structure Your Prompt for Stability

Place static content at the beginning, dynamic content at the end:

┌─────────────────────────────┐
│  TOOL DEFINITIONS           │  ← Most stable. Rarely changes.
│  (cache_control here)       │
├─────────────────────────────┤
│  SYSTEM PROMPT              │  ← Slow-moving. Instructions, persona.
│  (cache_control here)       │
├─────────────────────────────┤
│  FEW-SHOT EXAMPLES / RAG    │  ← Semi-static. Changes per-session or per-day.
│  (cache_control here)       │
├─────────────────────────────┤
│  USER MESSAGE               │  ← Volatile. Never cached.
└─────────────────────────────┘

Rules: - Never put timestamps, user IDs, or per-request variables in the cached prefix - Never interpolate dynamic values like {{CURRENT_TIME}} into the system prompt - Sort tool definitions deterministically (by name) so JSON ordering is stable - Use the same separator characters consistently (\n\n vs \n changes the cache key)

4.2 What to Cache vs. What Not to Cache

✅ Put in the cached block	❌ Keep out of the cached block
System prompt (instructions, persona)	The user's actual question
Tool definitions (sorted, deterministic)	Per-user IDs, session IDs, request IDs
Knowledge base / RAG context	Today's date or `Date.now()`
Few-shot examples	Per-customer variables (name, account, plan)
Style guides, format specs	Anything that changes between runs

4.3 Place Breakpoints Correctly

The #1 mistake is placing cache_control on a block that changes every request.

❌ WRONG:
  [system prompt] [knowledge base] [timestamp: 2026-05-17T14:32:01Z] ← cache_control here
  → The timestamp changes every call. No cache hit ever occurs.
  → The lookback does NOT find the stable content behind it.

✅ RIGHT:
  [system prompt] [knowledge base] ← cache_control here
  [timestamp: 2026-05-17T14:32:01Z] [user question]
  → Stable prefix cached. Timestamp is outside the cached block.

Why this matters: Cache reads look backward from the breakpoint for entries that prior requests already wrote at their own breakpoints. The lookback does NOT scan for stable content behind the breakpoint and cache it. If you never wrote a cache entry at the position of the stable content, the lookback finds nothing.

4.4 The 20-Block Lookback Window

The system checks at most 20 positions per breakpoint when looking for a prior cache write. In a growing conversation:

Turn 1: 10 blocks, breakpoint at block 10 → write
Turn 2: 15 blocks, breakpoint at 15 → lookback finds block 10 → hit
Turn 3: 35 blocks, breakpoint at 35 → lookback checks blocks 35→16 → nothing found (block 10 is outside the window)

Fix: Add a second breakpoint partway through the conversation so there's always a cache write within 20 blocks of the current breakpoint.

4.5 Multi-Turn Conversations

For multi-turn chat, use automatic caching (top-level cache_control):

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # Automatic caching
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"},
        {"role": "user", "content": "What's my name?"},  # Cache moves here
    ],
)

The cache point moves forward automatically as the conversation grows. No need to update breakpoints.

4.6 The Two-Segment Trick (Multi-Tenant)

For multi-tenant applications, use multiple independent cache segments:

system = [
    {
        "type": "text",
        "text": SYSTEM_PROMPT,  # ~1200 tokens, identical for ALL tenants
        "cache_control": {"type": "ephemeral", "ttl": "1h"},
    },
    {
        "type": "text",
        "text": tenant_context,  # ~600 tokens, per-tenant
        "cache_control": {"type": "ephemeral"},
    },
]

Why two segments? - Segment 1 (system prompt) caches globally across all tenants → hit rate near 100% - Segment 2 (tenant context) caches per-tenant → changes for tenant A don't invalidate tenant B's cache - If you combine them, tenant A's context change invalidates the system prompt cache for everyone

Order matters: The more-static segment must come first. If per-tenant content comes first, the system prompt's cache key includes it — making the most cacheable segment uncacheable across tenants.

4.7 TTL Strategy

Scenario	Recommended TTL	Why
Active chat (user typing)	5 minutes	Cache stays warm from traffic
Agent loops with pauses	5 minutes	Most tool calls happen within 5 minutes
Batch/eval harnesses	1 hour	Same prefix hit 100+ times across a wave
Daily batch jobs	1 hour	Prevents cold-start between runs
Cross-tenant shared prefix	1 hour	Maximizes reuse window

When mixing TTLs: The longer-TTL block must appear before the shorter-TTL block in the hierarchy:

tools (1h cache) → system (1h cache) → messages (5m cache)

4.8 Pre-Warming the Cache

Before real traffic, seed the cache with a warmup request:

client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=0,  # No output needed
    system=shared_system,  # Your cached prefix
    messages=[{"role": "user", "content": "warmup"}],
)

Then subsequent real requests hit a warm cache immediately.

4.9 On-Demand Tool Loading (Agent Loops)

The tension: agent loops want many tools (more capability), but caching wants a stable tool catalog (changes invalidate the prefix).

Solution: Keep a small, stable core tool catalog in the cached prefix, and load additional tools on demand:

# Core tools — always present, always cached
core_tools = [read_tool, write_tool, exec_tool, search_tool, status_tool, find_tool]

# When agent needs a specialized tool:
# Call find_tool → returns a tool_reference appended to messages (not tools array)
# Cache on core tools stays warm; specialized tools don't break the prefix

Each on-demand tool adds ~150-300 tokens to the message stream, but saves the cache miss from a 60-tool catalog.

5. Common Pitfalls That Kill ROI

Pitfall 1: Timestamps in the System Prompt

# ❌ WRONG — changes every call, kills the cache
system = f"You are a helpful assistant. Current time: {datetime.now()}"

# ✅ RIGHT — dynamic content goes in the user message
system = "You are a helpful assistant."  # Stable, cached
messages = [{"role": "user", "content": f"[Time: {datetime.now()}] My question..."}]

Pitfall 2: Adding/Removing Tools Mid-Session

Adding a single tool invalidates the entire prefix (tools → system → messages). Use on-demand tool loading or the tool_reference pattern instead.

Pitfall 3: Switching Models Mid-Conversation

Caches are model-specific. A cache built for Sonnet 4.6 is not portable to Opus 4.7 (different tokenizers). If your agent fails over between models, every fallback is a cold start.

Pitfall 4: Schema Churn During Development

Every edit to your JSON output schema or tool definitions invalidates every cache write. Batch schema changes into stabilization sweeps rather than continuous tweaks.

Pitfall 5: Workspace Boundaries

Since February 2026, caches are isolated per workspace, not per organization. Dev and prod workspaces don't share cache even with identical prompts.

Pitfall 6: Parallel Requests Before Cache Is Written

A cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent ones.

Pitfall 7: Unstable JSON Key Ordering

Languages like Go and Swift may randomize JSON key order during serialization. Use sorted/ordered maps for tool_use blocks to ensure consistent cache keys.

Pitfall 8: Tool Choice Changes

Changing the tool_choice parameter between calls invalidates the message cache even if tools and system prompt are identical.

6. Monitoring Cache Performance

Track these response fields on every call:

usage = response.usage
cache_read = usage.cache_read_input_tokens      # Tokens read from cache (cheapest)
cache_write = usage.cache_creation_input_tokens  # Tokens written to cache (premium)
uncached = usage.input_tokens                    # Tokens after last breakpoint (base price)
total_input = cache_read + cache_write + uncached

Key metrics: - Hit rate = cache_read / total_input — target >70% for a healthy cache - Read-to-write ratio — must be >1.25 for 5m TTL, >11 for 1h TTL to break even - If both cache fields are 0, your prefix is below the minimum threshold

Treat hit rate as a product metric, not a vanity metric. Display it in the same dashboard as latency and error rate. Below 70% deserves investigation.

7. Quick-Start Code Examples

Automatic Caching (Simplest)

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # That's it — automatic caching
    system="You are an expert assistant with deep knowledge of...",
    messages=[
        {"role": "user", "content": "Explain quantum computing"},
    ],
)
print(response.usage.model_dump_json())

Explicit Breakpoints (Production)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # ~2000 tokens
            "cache_control": {"type": "ephemeral", "ttl": "1h"},
        },
        {
            "type": "text",
            "text": knowledge_base,  # ~8000 tokens
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        {"role": "user", "content": f"[Context: {user_context}]\n\n{user_question}"},
    ],
)

Caching Tool Definitions

tools = [
    {"name": "read_file", "description": "...", "input_schema": {...}},
    {"name": "search", "description": "...", "input_schema": {...}},
    # Last tool gets the cache breakpoint
    {"name": "execute", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}},
]

Counterpoints

Caching can increase costs for low-frequency workloads. If your API calls are infrequent (>5 min apart) or one-shot, the 1.25× write surcharge with zero reads makes caching a net loss. Not every workload benefits — instrument before you ship.
The 5-minute TTL is too short for many real-world patterns. Anthropic shifted the default from 60 minutes to 5 minutes in March 2026, which was a 30-60% cost increase for agent workloads with human-in-the-loop pauses. The 1-hour TTL helps but at 2× write cost, requiring 11 reads to break even.
The prefix-only design limits flexibility. You cannot cache a frequently-used middle section of your prompt if the beginning varies. This forces architectural compromises where all "shared" content must be hoisted to the top of every request.
Multi-tenant applications require careful segmentation. The two-segment trick works but adds complexity. Each additional breakpoint is one more thing that can break, and the 4-breakpoint limit constrains how many independent cache segments you can have.
Cache isolation is per-workspace, not per-org. Teams that share prompts across dev/staging/prod workspaces don't share cache, reducing the effective hit rate in environments with lower traffic.

Sources

Anthropic Official Docs: Prompt Caching — Canonical reference for all specifications
Tanay Shah — What I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production — Production ROI math, five gotchas, on-demand tool loading pattern
Tariq Osmani — Prompt Caching with Claude: How I Cut Our Automation Bills by 70% — n8n workflow optimization, what to cache vs. not cache
Culprit — Anthropic Prompt Caching Cut Our RCA Cost by 90% — Two-segment trick for multi-tenant caching, production cost numbers
Denis Sergeevitch — Agents Best Practices: Prompt Caching and Cost — Community reference for agent caching strategies