CLAUDE LABJP
CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshopsLIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scaleDESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editingSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversMODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K outputLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskCONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshopsLIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scaleDESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editingSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversMODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K outputLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
Articles/API & SDK
API & SDK/2026-06-24Advanced

I Edited One Line of a Tool Description and the Whole Prompt Cache Rebuilt — Where to Place cache_control Breakpoints

Hit rate suddenly flatlined at zero because a volatile block sat upstream of stable ones. This walks through how prefix-cache cascade invalidation works, how to reorder blocks from stable to volatile, and where to spend your four cache_control breakpoints — with code and decision tables.

claude-api71prompt-caching10cost-optimization23context-management4performance4

Premium Article

One morning, going over the API costs of the automated publishing pipeline I run across four sites, I stopped. The request volume and input token counts had barely moved, yet cache_read_input_tokens was pinned near zero. The week before, most of the prefix had been served from cache.

Only one change came to mind. I had tweaked a single line in a tool's description — just smoothing out the wording. That one line had dragged the system prompt and the few-shot examples below it into a full cache rebuild. The cost numbers reminded me of something obvious: prompt caching does not work per "block." It works on the contiguous match from the front — the prefix.

This article is about why that cascade invalidation happens, and how ordering your blocks and placing your breakpoints can keep the cache on the stable parts alive even when you edit the volatile ones.

A prefix cache matches "from the front," not per block

This is the easiest part to misread. Marking a block with cache_control does not cache that block in isolation. What gets cached is the entire contiguous prefix from the start of the request up to and including the breakpoint.

Requests are concatenated in a fixed order: toolssystemmessages. On every call, matching runs from the front of this concatenated sequence, and the cache serves up to the longest prefix that exactly matches the previous request. One differing character anywhere, and everything from that point onward stops matching and is rewritten (cache creation).

That is exactly what bit me. tools sits at the very top. Editing one line of a tool definition broke the match partway through tools, so the system and messages that followed became "non-matching prefix" too. I had changed nothing downstream, yet the upstream edit took everything down with it. That is cascade invalidation.

The flip side is encouraging: just by putting volatile things downstream and stable things upstream, you can prevent most of this collateral damage.

Read the four numbers in usage correctly

To judge whether a reordering helped, you have to read usage well. These are the four I check every time.

FieldMeaningRough billing feel
cache_creation_input_tokensTokens newly written to cachePricier than base input (about 1.25x for 5-min TTL, ~2x for 1-hour TTL)
cache_read_input_tokensTokens served from cacheCheap — about 0.1x of base input
input_tokensUncached raw inputBase price
output_tokensGenerated outputOutput price

In a healthy state, the second request onward should show a large cache_read_input_tokens and a small cache_creation_input_tokens (just the changed tail). During my incident it was reversed: cache_creation_input_tokens ballooned every call and cache_read_input_tokens was near zero. I treat "creation keeps showing up every call" as a sign that something upstream changes every call — that is the first thing I suspect.

def cache_health(usage) -> dict:
    created = usage.cache_creation_input_tokens or 0
    read = usage.cache_read_input_tokens or 0
    fresh = usage.input_tokens or 0
    cacheable = created + read
    # of the cacheable portion, how much was actually read
    read_ratio = read / cacheable if cacheable else 0.0
    return {
        "read_ratio": round(read_ratio, 3),
        "created": created,
        "read": read,
        "uncached": fresh,
    }

If read_ratio stays low even on later calls, it is either TTL expiry (a timing problem) or upstream volatility (an ordering problem). If it improves when you tighten the interval, blame the TTL. If it stays low regardless of interval, suspect the ordering.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Understand cascade invalidation — why changing one upstream block invalidates every cached block below it — alongside how to read cache_creation_input_tokens vs cache_read_input_tokens in usage
Reorder blocks from stable to volatile (tools → system → large constants → dynamic reference → conversation) and take home a decision table for where to spend your four breakpoints
Drop in a minimal guard that flags a hit-rate regression (cache_read ratio threshold) so an unattended pipeline notices when an innocent edit breaks caching
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-06-24
My Morning Batch Was Missing the Prompt Cache Every Time — Warming Cadence and the Break-Even Math for the 1-Hour TTL
Jobs that run a few hours apart cold-miss the prompt cache even with a 1-hour TTL. Here is how to back out the right warming interval from the TTL, and how to write the break-even formula that decides whether warming pays off — with numbers from a four-site daily generation pipeline.
API & SDK2026-06-21
Don't Carry Search Results Twice: Trimming Consumed Blocks with response_inclusion
When an agent runs dynamic filtering, output tokens balloon because the raw search-result blocks a code execution call already consumed get echoed back into the response. Here is when response_inclusion: excluded is safe to use, when you must keep full, with implementation and a decision table.
API & SDK2026-05-29
Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops
Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →