⬡ API & SDK/2026-06-24Advanced

I Edited One Line of a Tool Description and the Whole Prompt Cache Rebuilt — Where to Place cache_control Breakpoints

Hit rate suddenly flatlined at zero because a volatile block sat upstream of stable ones. This walks through how prefix-cache cascade invalidation works, how to reorder blocks from stable to volatile, and where to spend your four cache_control breakpoints — with code and decision tables.

claude-api⁷¹ prompt-caching¹⁰ cost-optimization²³ context-management⁴ performance⁴

✦ Premium Article

One morning, going over the API costs of the automated publishing pipeline I run across four sites, I stopped. The request volume and input token counts had barely moved, yet cache_read_input_tokens was pinned near zero. The week before, most of the prefix had been served from cache.

Only one change came to mind. I had tweaked a single line in a tool's description — just smoothing out the wording. That one line had dragged the system prompt and the few-shot examples below it into a full cache rebuild. The cost numbers reminded me of something obvious: prompt caching does not work per "block." It works on the contiguous match from the front — the prefix.

This article is about why that cascade invalidation happens, and how ordering your blocks and placing your breakpoints can keep the cache on the stable parts alive even when you edit the volatile ones.

A prefix cache matches "from the front," not per block

This is the easiest part to misread. Marking a block with cache_control does not cache that block in isolation. What gets cached is the entire contiguous prefix from the start of the request up to and including the breakpoint.

Requests are concatenated in a fixed order: tools → system → messages. On every call, matching runs from the front of this concatenated sequence, and the cache serves up to the longest prefix that exactly matches the previous request. One differing character anywhere, and everything from that point onward stops matching and is rewritten (cache creation).

That is exactly what bit me. tools sits at the very top. Editing one line of a tool definition broke the match partway through tools, so the system and messages that followed became "non-matching prefix" too. I had changed nothing downstream, yet the upstream edit took everything down with it. That is cascade invalidation.

The flip side is encouraging: just by putting volatile things downstream and stable things upstream, you can prevent most of this collateral damage.

Read the four numbers in usage correctly

To judge whether a reordering helped, you have to read usage well. These are the four I check every time.

Field	Meaning	Rough billing feel
cache_creation_input_tokens	Tokens newly written to cache	Pricier than base input (about 1.25x for 5-min TTL, ~2x for 1-hour TTL)
cache_read_input_tokens	Tokens served from cache	Cheap — about 0.1x of base input
input_tokens	Uncached raw input	Base price
output_tokens	Generated output	Output price

In a healthy state, the second request onward should show a large cache_read_input_tokens and a small cache_creation_input_tokens (just the changed tail). During my incident it was reversed: cache_creation_input_tokens ballooned every call and cache_read_input_tokens was near zero. I treat "creation keeps showing up every call" as a sign that something upstream changes every call — that is the first thing I suspect.

def cache_health(usage) -> dict:
    created = usage.cache_creation_input_tokens or 0
    read = usage.cache_read_input_tokens or 0
    fresh = usage.input_tokens or 0
    cacheable = created + read
    # of the cacheable portion, how much was actually read
    read_ratio = read / cacheable if cacheable else 0.0
    return {
        "read_ratio": round(read_ratio, 3),
        "created": created,
        "read": read,
        "uncached": fresh,
    }

If read_ratio stays low even on later calls, it is either TTL expiry (a timing problem) or upstream volatility (an ordering problem). If it improves when you tighten the interval, blame the TTL. If it stays low regardless of interval, suspect the ordering.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Understand cascade invalidation — why changing one upstream block invalidates every cached block below it — alongside how to read cache_creation_input_tokens vs cache_read_input_tokens in usage

✦Reorder blocks from stable to volatile (tools → system → large constants → dynamic reference → conversation) and take home a decision table for where to spend your four breakpoints

✦Drop in a minimal guard that flags a hit-rate regression (cache_read ratio threshold) so an unattended pipeline notices when an innocent edit breaks caching

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Reorder blocks by stability

The whole design comes down to this: the less often something changes, the further upstream it goes. In my pipeline it settled into roughly this order.

Position	Block	Change frequency
1 (top)	tools (tool definitions)	Nearly never (only when adding capability)
2	system (role, output rules, permanent policy)	Low
3	Large constant context (style guide, fixed reference tables)	Low
4	Dynamic reference data (today's news, mutable facts)	High (e.g. daily)
5 (bottom)	Conversation history, this turn's user input	Every call

My original mistake was baking position 4 — the day's dynamic reference data — plus a current timestamp into position 2, the system prompt. The system prompt should be a very stable block, but I was injecting a daily-changing string into it, so everything from system downward died every day.

The fix was simple. I evicted every mutable value from system and passed the dynamic reference data as a late user message instead. Tools and system became fixed strings, with the first breakpoint at the boundary between them. That alone brought read_ratio back from near zero to the 0.8 range.

You only get four breakpoints — where to spend them

cache_control is capped at four per request. Rather than scattering them, place them at the steps in stability — the points where "everything up to here should match the previous request."

Breakpoint	Where	What it protects
1st	End of tools	Tool definitions (the most reused)
2nd	End of system	The prefix including permanent rules
3rd	End of the large constant context	A long prefix that includes the style guide, etc.
4th	End of the "settled" conversation history	History that grows across a multi-turn run

With several breakpoints, Claude automatically reads the longest prefix that matches the previous request. So the earlier breakpoints act as insurance for when a deeper cache expires. If the third (just before the dynamic reference data) is invalidated by today's update, the hits up to the first and second (tools and system) still survive. It is graded protection, not all-or-nothing.

One caveat: there is a minimum cacheable prefix length, and prefixes that are too short are not cached at all (roughly 1,024 tokens for most models, longer for Haiku). When your tool definitions are small, the first breakpoint may not take effect, so confirm whether cache_creation_input_tokens was ever recorded there — that tells you the spot is actually being cached.

Implementation: build messages in "stability layers"

Translating the design into code looks like this. The trick is to push mutable values out to the outermost layer as function arguments and never touch the stable layers.

import anthropic
 
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
# ── Stable layer (do not change daily) ──────────────
TOOLS = [
    {
        "name": "publish_article",
        "description": "Registers an article into the target repository.",  # editing this invalidates everything
        "input_schema": {
            "type": "object",
            "properties": {"slug": {"type": "string"}, "locale": {"type": "string"}},
            "required": ["slug", "locale"],
        },
        # breakpoint at the end of tools (1st)
        "cache_control": {"type": "ephemeral"},
    }
]
 
SYSTEM = [
    {
        "type": "text",
        "text": "You are a technical blog editor. The output rules and permanent policy are ... (fixed string only)",
        # breakpoint at the end of system (2nd)
        "cache_control": {"type": "ephemeral", "ttl": "1h"},
    }
]
 
 
def build_messages(daily_reference: str, user_input: str) -> list:
    # ── Volatile layer (changes every call / every day) — placed last, in user
    return [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": f"Today's reference data:\n{daily_reference}"},
                {"type": "text", "text": user_input},
            ],
        }
    ]
 
 
def run(daily_reference: str, user_input: str):
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=TOOLS,
        system=SYSTEM,
        messages=build_messages(daily_reference, user_input),
    )
    return resp

The point is to not put daily_reference (the daily-changing string, like the day's news) inside SYSTEM. Keep SYSTEM and TOOLS as fixed strings and route mutable values through the build_messages arguments into the trailing user message only. Then you can swap the reference data every day and the tools and system caches survive.

When reordering still does not help

Sometimes read_ratio will not recover even after you fix the order. Here are the ones I have hit.

First, leaving a timestamp or UUID inside system. A single line like "Last updated: 2026-06-24 19:45" makes system a different object every time. As a rule, evict anything that looks dynamic from system.

Second, rebuilding tool definitions from a dict per request with unstable key order. If the JSON serialization order wobbles, the content is identical but the string differs, and the match breaks. Build tool definitions once as a module-level constant.

Third, switching models. Caches are separate per model, so a pipeline that bounces between claude-sonnet-4-6 and Haiku warms each cache independently. If your design changes models on fallback, accept that read_ratio will dip temporarily.

Make a hit-rate regression something you'll notice

Even after a good reordering, a casual one-line edit later can put you right back into cascade invalidation. Since I run this unattended, I keep a tiny watchdog that logs whenever the hit rate drops below a threshold.

def assert_cache_healthy(usage, *, min_read_ratio=0.5, warmed=True):
    h = cache_health(usage)
    # on later (warmed) calls, falling below this ratio suggests an ordering regression
    if warmed and h["read_ratio"] < min_read_ratio:
        print(
            f"⚠️ cache read_ratio={h['read_ratio']} "
            f"(created={h['created']}, read={h['read']}) — check upstream volatility"
        )
        return False
    return True

Watching the number lets you catch the moment a "just smoothing the wording" edit shows up in the bill. As an indie developer running this without a babysitter, it took me a few days to notice the first incident — once this watchdog was in place, I could touch the system prompt and tool definitions without holding my breath.

For your next step, log cache_creation_input_tokens and cache_read_input_tokens from one of your own API responses. If creation is still large on the second request, a volatile string is hiding somewhere upstream. Move it to the tail, and the cost numbers answer back honestly.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.