●CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshops●LIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scale●DESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editing●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshops●LIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scale●DESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editing●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
I Edited One Line of a Tool Description and the Whole Prompt Cache Rebuilt — Where to Place cache_control Breakpoints
Hit rate suddenly flatlined at zero because a volatile block sat upstream of stable ones. This walks through how prefix-cache cascade invalidation works, how to reorder blocks from stable to volatile, and where to spend your four cache_control breakpoints — with code and decision tables.
One morning, going over the API costs of the automated publishing pipeline I run across four sites, I stopped. The request volume and input token counts had barely moved, yet cache_read_input_tokens was pinned near zero. The week before, most of the prefix had been served from cache.
Only one change came to mind. I had tweaked a single line in a tool's description — just smoothing out the wording. That one line had dragged the system prompt and the few-shot examples below it into a full cache rebuild. The cost numbers reminded me of something obvious: prompt caching does not work per "block." It works on the contiguous match from the front — the prefix.
This article is about why that cascade invalidation happens, and how ordering your blocks and placing your breakpoints can keep the cache on the stable parts alive even when you edit the volatile ones.
A prefix cache matches "from the front," not per block
This is the easiest part to misread. Marking a block with cache_control does not cache that block in isolation. What gets cached is the entire contiguous prefix from the start of the request up to and including the breakpoint.
Requests are concatenated in a fixed order: tools → system → messages. On every call, matching runs from the front of this concatenated sequence, and the cache serves up to the longest prefix that exactly matches the previous request. One differing character anywhere, and everything from that point onward stops matching and is rewritten (cache creation).
That is exactly what bit me. tools sits at the very top. Editing one line of a tool definition broke the match partway through tools, so the system and messages that followed became "non-matching prefix" too. I had changed nothing downstream, yet the upstream edit took everything down with it. That is cascade invalidation.
The flip side is encouraging: just by putting volatile things downstream and stable things upstream, you can prevent most of this collateral damage.
Read the four numbers in usage correctly
To judge whether a reordering helped, you have to read usage well. These are the four I check every time.
Field
Meaning
Rough billing feel
cache_creation_input_tokens
Tokens newly written to cache
Pricier than base input (about 1.25x for 5-min TTL, ~2x for 1-hour TTL)
cache_read_input_tokens
Tokens served from cache
Cheap — about 0.1x of base input
input_tokens
Uncached raw input
Base price
output_tokens
Generated output
Output price
In a healthy state, the second request onward should show a large cache_read_input_tokens and a small cache_creation_input_tokens (just the changed tail). During my incident it was reversed: cache_creation_input_tokens ballooned every call and cache_read_input_tokens was near zero. I treat "creation keeps showing up every call" as a sign that something upstream changes every call — that is the first thing I suspect.
def cache_health(usage) -> dict: created = usage.cache_creation_input_tokens or 0 read = usage.cache_read_input_tokens or 0 fresh = usage.input_tokens or 0 cacheable = created + read # of the cacheable portion, how much was actually read read_ratio = read / cacheable if cacheable else 0.0 return { "read_ratio": round(read_ratio, 3), "created": created, "read": read, "uncached": fresh, }
If read_ratio stays low even on later calls, it is either TTL expiry (a timing problem) or upstream volatility (an ordering problem). If it improves when you tighten the interval, blame the TTL. If it stays low regardless of interval, suspect the ordering.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Understand cascade invalidation — why changing one upstream block invalidates every cached block below it — alongside how to read cache_creation_input_tokens vs cache_read_input_tokens in usage
✦Reorder blocks from stable to volatile (tools → system → large constants → dynamic reference → conversation) and take home a decision table for where to spend your four breakpoints
✦Drop in a minimal guard that flags a hit-rate regression (cache_read ratio threshold) so an unattended pipeline notices when an innocent edit breaks caching
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The whole design comes down to this: the less often something changes, the further upstream it goes. In my pipeline it settled into roughly this order.
Position
Block
Change frequency
1 (top)
tools (tool definitions)
Nearly never (only when adding capability)
2
system (role, output rules, permanent policy)
Low
3
Large constant context (style guide, fixed reference tables)
Low
4
Dynamic reference data (today's news, mutable facts)
High (e.g. daily)
5 (bottom)
Conversation history, this turn's user input
Every call
My original mistake was baking position 4 — the day's dynamic reference data — plus a current timestamp into position 2, the system prompt. The system prompt should be a very stable block, but I was injecting a daily-changing string into it, so everything from system downward died every day.
The fix was simple. I evicted every mutable value from system and passed the dynamic reference data as a late user message instead. Tools and system became fixed strings, with the first breakpoint at the boundary between them. That alone brought read_ratio back from near zero to the 0.8 range.
You only get four breakpoints — where to spend them
cache_control is capped at four per request. Rather than scattering them, place them at the steps in stability — the points where "everything up to here should match the previous request."
Breakpoint
Where
What it protects
1st
End of tools
Tool definitions (the most reused)
2nd
End of system
The prefix including permanent rules
3rd
End of the large constant context
A long prefix that includes the style guide, etc.
4th
End of the "settled" conversation history
History that grows across a multi-turn run
With several breakpoints, Claude automatically reads the longest prefix that matches the previous request. So the earlier breakpoints act as insurance for when a deeper cache expires. If the third (just before the dynamic reference data) is invalidated by today's update, the hits up to the first and second (tools and system) still survive. It is graded protection, not all-or-nothing.
One caveat: there is a minimum cacheable prefix length, and prefixes that are too short are not cached at all (roughly 1,024 tokens for most models, longer for Haiku). When your tool definitions are small, the first breakpoint may not take effect, so confirm whether cache_creation_input_tokens was ever recorded there — that tells you the spot is actually being cached.
Implementation: build messages in "stability layers"
Translating the design into code looks like this. The trick is to push mutable values out to the outermost layer as function arguments and never touch the stable layers.
import anthropicclient = anthropic.Anthropic(api_key="YOUR_API_KEY")# ── Stable layer (do not change daily) ──────────────TOOLS = [ { "name": "publish_article", "description": "Registers an article into the target repository.", # editing this invalidates everything "input_schema": { "type": "object", "properties": {"slug": {"type": "string"}, "locale": {"type": "string"}}, "required": ["slug", "locale"], }, # breakpoint at the end of tools (1st) "cache_control": {"type": "ephemeral"}, }]SYSTEM = [ { "type": "text", "text": "You are a technical blog editor. The output rules and permanent policy are ... (fixed string only)", # breakpoint at the end of system (2nd) "cache_control": {"type": "ephemeral", "ttl": "1h"}, }]def build_messages(daily_reference: str, user_input: str) -> list: # ── Volatile layer (changes every call / every day) — placed last, in user return [ { "role": "user", "content": [ {"type": "text", "text": f"Today's reference data:\n{daily_reference}"}, {"type": "text", "text": user_input}, ], } ]def run(daily_reference: str, user_input: str): resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, tools=TOOLS, system=SYSTEM, messages=build_messages(daily_reference, user_input), ) return resp
The point is to not put daily_reference (the daily-changing string, like the day's news) inside SYSTEM. Keep SYSTEM and TOOLS as fixed strings and route mutable values through the build_messages arguments into the trailing user message only. Then you can swap the reference data every day and the tools and system caches survive.
When reordering still does not help
Sometimes read_ratio will not recover even after you fix the order. Here are the ones I have hit.
First, leaving a timestamp or UUID inside system. A single line like "Last updated: 2026-06-24 19:45" makes system a different object every time. As a rule, evict anything that looks dynamic from system.
Second, rebuilding tool definitions from a dict per request with unstable key order. If the JSON serialization order wobbles, the content is identical but the string differs, and the match breaks. Build tool definitions once as a module-level constant.
Third, switching models. Caches are separate per model, so a pipeline that bounces between claude-sonnet-4-6 and Haiku warms each cache independently. If your design changes models on fallback, accept that read_ratio will dip temporarily.
Make a hit-rate regression something you'll notice
Even after a good reordering, a casual one-line edit later can put you right back into cascade invalidation. Since I run this unattended, I keep a tiny watchdog that logs whenever the hit rate drops below a threshold.
def assert_cache_healthy(usage, *, min_read_ratio=0.5, warmed=True): h = cache_health(usage) # on later (warmed) calls, falling below this ratio suggests an ordering regression if warmed and h["read_ratio"] < min_read_ratio: print( f"⚠️ cache read_ratio={h['read_ratio']} " f"(created={h['created']}, read={h['read']}) — check upstream volatility" ) return False return True
Watching the number lets you catch the moment a "just smoothing the wording" edit shows up in the bill. As an indie developer running this without a babysitter, it took me a few days to notice the first incident — once this watchdog was in place, I could touch the system prompt and tool definitions without holding my breath.
For your next step, log cache_creation_input_tokens and cache_read_input_tokens from one of your own API responses. If creation is still large on the second request, a volatile string is hiding somewhere upstream. Move it to the tail, and the cost numbers answer back honestly.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.