●MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K output●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●MCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first login●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●CODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running work●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K output●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●MCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first login●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●CODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running work●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
When Claude API Prompt Caching Quietly Stops Hitting in Production — Field Notes on TTL and Measured Savings
Prompt caching works beautifully the day you ship it, then quietly stops hitting in production. The five things that break the prefix, how to choose between 5-minute and 1-hour TTL, and how to measure real savings from usage instead of guessing.
The dashboard on day one of prompt caching usually looks great. cache_read_input_tokens climbs cleanly, input cost drops. The trouble starts the following week. Traffic shifts shape a little, you slip a dynamic value into the system prompt, a deploy reorders your tool definitions — each harmless on its own, yet the cache hit rate slides down quietly. And because few people watch cache_read_input_tokens on every request, the way you find out costs went back up is the end-of-month invoice.
I lean on this feature hard across the automated operation of four technical blogs I run as an indie developer under Dolice Labs, and I have been burned more than once by "I assumed it was working, but only half the requests were actually hitting." This piece is not a feature tour. It is about what to monitor to keep the hit rate up, where to switch TTL, and how to measure the savings instead of assuming them.
Caching only fires on a byte-for-byte identical prefix
The first thing to internalize: prompt caching matches on an exact, byte-for-byte identical prefix, not on semantic similarity. It walks system, then tools, then messages from the front, and as long as everything up to the cache breakpoint is character-for-character the same, that span is billed at the cache-read rate (about 10% of normal input). The flip side is unforgiving — change a single character near the front and everything from that breakpoint onward is a cache miss.
That "prefix from the front" property explains almost every production hit-rate problem. Something that varies per request has crept into the prefix.
# ❌ A classic miss: a dynamic value lands inside the system promptsystem = [ { "type": "text", # Embedding the current time means the prefix changes every call — it can never hit "text": f"You are a support AI. The current time is {datetime.now()}.\n\n## Knowledge base\n...", "cache_control": {"type": "ephemeral"}, }]# ✅ Push dynamic values into messages, behind the breakpointsystem = [ { "type": "text", "text": "You are a support AI.\n\n## Knowledge base\n...", # stable part only "cache_control": {"type": "ephemeral"}, }]messages = [ {"role": "user", "content": f"(current time: {datetime.now()}) What's the latest incident status?"},]
Timestamps are the obvious case, but the real culprits are sneakier: a user name or session ID injected into the system prompt, an A/B test whose branch sits at the front of the prefix, a knowledge base assembled dynamically whose array order isn't stable. None of these read as "this changes" when you skim the code. You only see them once you log.
Five forces that silently break the prefix
Here is a tidy list of the prefix-breakers I have either tripped over myself or found while debugging someone else's bill.
Force
What happens
How easy to miss
Dynamic values (time, IDs, random)
Miss every time. Cache never fires
High
Tool definition order drift
Building from a dict in a loop reorders and misses
High
Non-deterministic knowledge concatenation
set/dict iteration order changes the tail
High
TTL expiry (left idle)
Gaps over 5 min force a rewrite each time
Medium
Model string or beta header swaps
Cache namespace changes, forcing a rewrite
Medium
Order drift is especially nasty. Python dicts preserve insertion order, so problems are rare, but the moment you build a list from a set, or merge several sources into a dict before serializing to JSON, the ordering can shift subtly between runs. A prefix needs more than "same content" — it needs "same order." Wherever you assemble cacheable material, sort it explicitly.
# Anything cacheable — tool defs, knowledge fragments — should be order-stabledef build_cacheable_tools(tool_specs: dict[str, dict]) -> list[dict]: # Sort by key so the ordering is identical on every run return [tool_specs[name] for name in sorted(tool_specs)]
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Five forces that silently break the cached prefix, and the log signals that reveal the decay
✦How to choose between 5-minute and 1-hour TTL based on request spacing, with cost math
✦Instrumentation code that turns 'we cut costs 90%' from a guess into a measured number
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The cache expires. The standard ephemeral cache lapses roughly 5 minutes after the last access, and each access within that window resets the clock. So as long as the next request keeps arriving within 5 minutes, the cache stays warm; let the gap stretch and it expires, and the next request pays the write cost again (about 1.25x normal input).
In June 2026 the Developer Platform added a prompt cache of up to one hour, letting you hold a longer TTL. That helps workloads where requests arrive sporadically but reuse the same large prefix. The catch: the 1-hour cache carries a higher retention cost, so "make everything 1 hour" is the wrong reflex. Decide by request spacing.
Request spacing
Recommended TTL
Why
Seconds to minutes (chat, tight loops)
5 minutes
Each access resets the clock; you pay the write only once
5 minutes to an hour (sporadic API, low-frequency jobs)
1 hour
5 minutes would expire; fewer write rewrites
Over an hour (daily batches)
No cache, or batch it
Any TTL expires; resubmit in a tight window instead
In my own setup, a chat endpoint that runs all day keeps hitting fine on 5 minutes. A low-frequency job — an article-integrity check that only fires every hour or two — was meaningless on the 5-minute cache because it rewrote from scratch every run. Switching that one to the 1-hour cache visibly cut the rewrites of the same prefix. Splitting TTL by "always on versus occasional" has been the simplest guideline that holds up.
Turning "it should be working" into a measurement
This is the part I most want to land. The dangerous thing about prompt caching is that breaking it throws no exception. A miss still succeeds — it just quietly gets billed at full price. So you have to replace the assumption "it should be working" with a real number from usage, and you have to instrument that from the start.
The response usage carries three figures: cache_creation_input_tokens (writes), cache_read_input_tokens (reads, i.e. hits), and input_tokens (uncached normal input). Record those three on every request and you can compute hit rate and savings accurately after the fact.
import anthropic# Approximate Sonnet 4.6 unit prices (USD / 1M tokens). Keep current in production.PRICE_INPUT = 3.00 # normal inputPRICE_CACHE_WRITE = 3.75 # cache write (1.25x)PRICE_CACHE_READ = 0.30 # cache read (0.1x)def record_usage(usage) -> dict: """Convert one request's usage into actual cost and a no-cache baseline.""" read = getattr(usage, "cache_read_input_tokens", 0) or 0 write = getattr(usage, "cache_creation_input_tokens", 0) or 0 normal = usage.input_tokens # What you actually paid actual = ( normal * PRICE_INPUT + write * PRICE_CACHE_WRITE + read * PRICE_CACHE_READ ) / 1_000_000 # Hypothetical cost with no caching (read/write billed at normal rate too) baseline = (normal + write + read) * PRICE_INPUT / 1_000_000 cached_tokens = read + write hit_rate = read / cached_tokens if cached_tokens else 0.0 return { "actual_usd": round(actual, 6), "baseline_usd": round(baseline, 6), "saved_usd": round(baseline - actual, 6), "cache_hit_rate": round(hit_rate, 3), }
Run record_usage on every request and stream saved_usd and cache_hit_rate as a time series, and shifts become visible — "hit rate dropped from 0.9 to 0.4 last Tuesday." Adding this is exactly how I discovered that one deploy had reordered my tool-definition assembly and started missing. No exception, no warning; without the instrumentation I wouldn't have known until the invoice.
Wire a threshold into an alert and operations get noticeably calmer.
def check_cache_health(metrics: list[dict], min_hit_rate: float = 0.6) -> str | None: """Detect a hit-rate drop from recent metrics. Return a reason string if there's a problem.""" recent = metrics[-50:] cached = [m for m in recent if m["cache_hit_rate"] > 0 or m["actual_usd"] > 0] if not cached: return None avg_hit = sum(m["cache_hit_rate"] for m in cached) / len(cached) if avg_hit < min_hit_rate: return f"Cache hit rate degraded: recent avg {avg_hit:.1%} (threshold {min_hit_rate:.0%})" return None
A "normal" hit rate depends on the workload, so the practical move is to learn your system's baseline over the first week or two and then set a loose threshold that fires only on a clear drop. Set it tight from day one and natural traffic jitter will keep it ringing until everyone tunes it out.
Place breakpoints along a "staircase of change frequency"
A real lever for lifting the hit rate is ordering breakpoints by how often each part changes. You get up to four breakpoints, but order matters more than count: put the rarely changing things first and the frequently changing things last, in a staircase.
system = [ { "type": "text", "text": BASE_INSTRUCTIONS, # nearly immutable; touched maybe twice a year "cache_control": {"type": "ephemeral"}, # BP1 }, { "type": "text", "text": weekly_knowledge_base, # updated weekly; changes more than the base "cache_control": {"type": "ephemeral"}, # BP2 },]# tools are stable week to week, so put a third breakpoint at their tailtools[-1]["cache_control"] = {"type": "ephemeral"} # BP3
This way, updating the weekly knowledge base invalidates from BP2 onward but keeps BP1 (the base instructions) hitting. Collapse everything into one block instead, and a small edit at the tail forces the whole thing to rewrite, taking the stable part down with it. Splitting blocks "by the granularity at which things change" is what keeps caching effective over time.
One more note: if you combine a tool-side optimization flag such as Token-Efficient Tool Use, changing the header or beta spec can shift the cache namespace and force a rewrite. Do that kind of flag experimentation outside the hours when your production cache is warm.
A minimal pre-ship check
Finally, the items I verify every time I start using prompt caching on a new endpoint. Are there any dynamic values inside the cached span (system, tools, the front of messages)? Is the assembly of cacheable material order-stable? Does the TTL choice match the request spacing? And are the three usage figures flowing to logs so you can see cache_hit_rate and saved_usd? With those four in place, when the hit rate drops you notice it on a dashboard rather than on an invoice.
Prompt caching is a feature where keeping the savings weeks later is far harder than getting them on day one. Build the monitoring in alongside it — from the experience of leaking cost in exactly the same spot, that is the part I now consider most worth doing. Thank you for reading.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.