CLAUDE LABJP
MODEL — Claude Sonnet 5 becomes the default across all plans, with stronger planning, tool use, and autonomyPRICE — Sonnet 5 launches at $2 input / $10 output per million tokens through August 31MODEL — Sonnet 5 nears Opus 4.8 performance at a lower price for always-on agentsCODE — Claude Code adopts Sonnet 5 as default with a native 1M-token context windowCODE — Claude Code adds sandbox credential blocking and org-level model restrictionsCLOUD — Claude is generally available in Microsoft Foundry on Azure with Azure-native accessMODEL — Claude Sonnet 5 becomes the default across all plans, with stronger planning, tool use, and autonomyPRICE — Sonnet 5 launches at $2 input / $10 output per million tokens through August 31MODEL — Sonnet 5 nears Opus 4.8 performance at a lower price for always-on agentsCODE — Claude Code adopts Sonnet 5 as default with a native 1M-token context windowCODE — Claude Code adds sandbox credential blocking and org-level model restrictionsCLOUD — Claude is generally available in Microsoft Foundry on Azure with Azure-native access
Articles/API & SDK
API & SDK/2026-07-02Advanced

Your Cache Hit Rate Resets to Zero the Morning You Switch Models — Prompt Cache Rewarm Design for the Opus 4.8 to Sonnet 5 Cutover

Prompt caches are scoped per model, so day one of a model migration starts at a 0% hit rate. Why percentage-based rollouts break cache economics twice over, and how cohort cutover by task family preserves them — with working measurement code.

prompt-caching12claude-sonnet-5claude-opus-4-8model-migration3cost-optimization25

Premium Article

On the morning of July 1st, I read the announcement of Claude Sonnet 5's introductory pricing — $2 per million input tokens, $10 per million output — and started planning a gradual move of my nightly batch jobs from Opus 4.8.

As an indie developer, I run scheduled pipelines for several sites, all built on prompt caching. I pointed a single job family at Sonnet 5 as a trial. That night's logs showed cache_read_input_tokens collapsing to zero, replaced by cache_creation_input_tokens on every single run. On the very evening my unit price supposedly dropped by 60%, that family's bill went up. That inversion on day one is where this article starts.

Prompt caches live in separate worlds per model. If your migration plan does not account for that, the switch you made to save money will cost you more for a while. Let me walk through it.

What happens on cutover morning — caches are scoped per model

Anthropic's prompt caching keys cached prefixes to the model. A prefix warmed on claude-opus-4-8 is invisible to requests hitting claude-sonnet-5. A hit requires the organization, the model, and the prefix content to all match.

It helps to recall the pricing structure. A cache write costs 1.25x the base input rate (for the 5-minute TTL; the 1-hour TTL costs 2x), and a cache read costs 0.1x. So on cutover day, every prefix in every family flips from "read at 0.1x" to "rewritten at 1.25x." Looking at the cached portion alone, your effective unit price jumps 12.5x at that moment.

Concretely, for a family sharing an 8,000-token prefix on Sonnet 5's introductory pricing:

  • Warm run — reads 8,000 tokens at $0.2/MTok, about $0.0016
  • Cold run — writes 8,000 tokens at $2.5/MTok, about $0.02

Per run the difference looks tiny. With ten families and hundreds of daily runs, though, your migration strategy decides whether that difference lasts one morning or several weeks. A cold write should, in principle, happen once per family. How many times you actually pay it is a property of the rollout plan.

Why percentage-based rollouts break cache economics twice

The standard playbook for model migration is a request-level percentage split: 10% first, then 50%, then 100%. As quality risk management, that is reasonable. From a prompt caching perspective, it is the worst possible partition. It breaks things in two distinct ways.

First, double cold writes. With a percentage split, every prefix flows to both models. Twelve prefix families means twenty-four cold writes, not twelve — and every TTL expiry triggers rewarming on both sides, indefinitely.

Second, and more damaging: TTL starvation. The 5-minute TTL is refreshed on each use, but only on the model that served the request. Consider a monitoring task that fires every 4 minutes. At 100% on one model, every run after the first lands inside the TTL and hits. Split it 50/50 and the average inter-arrival gap seen by each model stretches to 8 minutes — past the 5-minute TTL every time. The result: nearly every run on both sides becomes a cold write. A family that sat at a 100% hit rate drops to roughly 0% on both models the moment you split it.

Here is the comparison in one view.

Migration strategyCold writesHit rate impactQuality validation
Big-bang switchOne per familyBrief dip right after cutoverBets every family at once
Percentage split (per request)Every family times both models, recurring on TTL expirySparse families fall toward 0% on both sidesFinely controllable
Cohort cutover (per task family)One per migrated familyPreserved for both migrated and pending familiesStaged, per family

Cohort cutover keeps the one thing worth keeping from percentage rollouts — staged validation — while avoiding both failure modes.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
How prompt caches are isolated per model, and how to watch cache_creation_input_tokens spike on cutover day using the usage block from real responses
The math behind why percentage-based rollouts cause double cold writes and TTL starvation that drives hit rates toward 0% on both sides, plus a cohort cutover design in working TypeScript
A migration cost simulator working backward from the introductory pricing deadline (2026-08-31), and the break-even logic for using the 1-hour TTL during cutover week
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-06-29
When Context Editing Made My Agent Re-run the Same Search — Field Notes on Clear Boundaries and Cache Invalidation
After turning on Context Editing to auto-clear tool results, the agent forgot what it had just read, re-ran the same tool, and the cache rebuilt every turn so costs went up. Field notes on instrumenting the silent regression and setting trigger, keep, and clear_at_least from measured data.
API & SDK2026-06-24
I Edited One Line of a Tool Description and the Whole Prompt Cache Rebuilt — Where to Place cache_control Breakpoints
Hit rate suddenly flatlined at zero because a volatile block sat upstream of stable ones. This walks through how prefix-cache cascade invalidation works, how to reorder blocks from stable to volatile, and where to spend your four cache_control breakpoints — with code and decision tables.
API & SDK2026-06-24
My Morning Batch Was Missing the Prompt Cache Every Time — Warming Cadence and the Break-Even Math for the 1-Hour TTL
Jobs that run a few hours apart cold-miss the prompt cache even with a 1-hour TTL. Here is how to back out the right warming interval from the TTL, and how to write the break-even formula that decides whether warming pays off — with numbers from a four-site daily generation pipeline.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →