CLAUDE LABJP
MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskMODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
Articles/API & SDK
API & SDK/2026-06-23Advanced

When Claude API Prompt Caching Quietly Stops Hitting in Production — Field Notes on TTL and Measured Savings

Prompt caching works beautifully the day you ship it, then quietly stops hitting in production. The five things that break the prefix, how to choose between 5-minute and 1-hour TTL, and how to measure real savings from usage instead of guessing.

prompt-caching8cost-optimization21api37production102observability14

Premium Article

The dashboard on day one of prompt caching usually looks great. cache_read_input_tokens climbs cleanly, input cost drops. The trouble starts the following week. Traffic shifts shape a little, you slip a dynamic value into the system prompt, a deploy reorders your tool definitions — each harmless on its own, yet the cache hit rate slides down quietly. And because few people watch cache_read_input_tokens on every request, the way you find out costs went back up is the end-of-month invoice.

I lean on this feature hard across the automated operation of four technical blogs I run as an indie developer under Dolice Labs, and I have been burned more than once by "I assumed it was working, but only half the requests were actually hitting." This piece is not a feature tour. It is about what to monitor to keep the hit rate up, where to switch TTL, and how to measure the savings instead of assuming them.

Caching only fires on a byte-for-byte identical prefix

The first thing to internalize: prompt caching matches on an exact, byte-for-byte identical prefix, not on semantic similarity. It walks system, then tools, then messages from the front, and as long as everything up to the cache breakpoint is character-for-character the same, that span is billed at the cache-read rate (about 10% of normal input). The flip side is unforgiving — change a single character near the front and everything from that breakpoint onward is a cache miss.

That "prefix from the front" property explains almost every production hit-rate problem. Something that varies per request has crept into the prefix.

# ❌ A classic miss: a dynamic value lands inside the system prompt
system = [
    {
        "type": "text",
        # Embedding the current time means the prefix changes every call — it can never hit
        "text": f"You are a support AI. The current time is {datetime.now()}.\n\n## Knowledge base\n...",
        "cache_control": {"type": "ephemeral"},
    }
]
 
# ✅ Push dynamic values into messages, behind the breakpoint
system = [
    {
        "type": "text",
        "text": "You are a support AI.\n\n## Knowledge base\n...",  # stable part only
        "cache_control": {"type": "ephemeral"},
    }
]
messages = [
    {"role": "user", "content": f"(current time: {datetime.now()}) What's the latest incident status?"},
]

Timestamps are the obvious case, but the real culprits are sneakier: a user name or session ID injected into the system prompt, an A/B test whose branch sits at the front of the prefix, a knowledge base assembled dynamically whose array order isn't stable. None of these read as "this changes" when you skim the code. You only see them once you log.

Five forces that silently break the prefix

Here is a tidy list of the prefix-breakers I have either tripped over myself or found while debugging someone else's bill.

ForceWhat happensHow easy to miss
Dynamic values (time, IDs, random)Miss every time. Cache never firesHigh
Tool definition order driftBuilding from a dict in a loop reorders and missesHigh
Non-deterministic knowledge concatenationset/dict iteration order changes the tailHigh
TTL expiry (left idle)Gaps over 5 min force a rewrite each timeMedium
Model string or beta header swapsCache namespace changes, forcing a rewriteMedium

Order drift is especially nasty. Python dicts preserve insertion order, so problems are rare, but the moment you build a list from a set, or merge several sources into a dict before serializing to JSON, the ordering can shift subtly between runs. A prefix needs more than "same content" — it needs "same order." Wherever you assemble cacheable material, sort it explicitly.

# Anything cacheable — tool defs, knowledge fragments — should be order-stable
def build_cacheable_tools(tool_specs: dict[str, dict]) -> list[dict]:
    # Sort by key so the ordering is identical on every run
    return [tool_specs[name] for name in sorted(tool_specs)]

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Five forces that silently break the cached prefix, and the log signals that reveal the decay
How to choose between 5-minute and 1-hour TTL based on request spacing, with cost math
Instrumentation code that turns 'we cut costs 90%' from a guess into a measured number
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-03-26
Claude API Cost Optimization Production Guide — Combining Batch API, Prompt Caching, and Adaptive Thinking for Up to 90% Savings
Learn practical implementation patterns to cut Claude API costs by up to 90%. Covers Batch API, Prompt Caching, and Adaptive Thinking strategies, plus production monitoring and budget management.
API & SDK2026-05-29
Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops
Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.
API & SDK2026-04-29
Production Semantic Cache for Claude API — Similarity Thresholds, Pollution Defense, and What to Track
A production playbook for adding a semantic cache in front of Claude API — threshold tuning, multi-tenant isolation, pollution prevention, fallbacks, and the metrics that actually prove it works.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →