⬡ API & SDK/2026-07-02Advanced

Your Cache Hit Rate Resets to Zero the Morning You Switch Models — Prompt Cache Rewarm Design for the Opus 4.8 to Sonnet 5 Cutover

Prompt caches are scoped per model, so day one of a model migration starts at a 0% hit rate. Why percentage-based rollouts break cache economics twice over, and how cohort cutover by task family preserves them — with working measurement code.

prompt-caching¹² claude-sonnet-5 claude-opus-4-8 model-migration³ cost-optimization²⁵

✦ Premium Article

On the morning of July 1st, I read the announcement of Claude Sonnet 5's introductory pricing — $2 per million input tokens, $10 per million output — and started planning a gradual move of my nightly batch jobs from Opus 4.8.

As an indie developer, I run scheduled pipelines for several sites, all built on prompt caching. I pointed a single job family at Sonnet 5 as a trial. That night's logs showed cache_read_input_tokens collapsing to zero, replaced by cache_creation_input_tokens on every single run. On the very evening my unit price supposedly dropped by 60%, that family's bill went up. That inversion on day one is where this article starts.

Prompt caches live in separate worlds per model. If your migration plan does not account for that, the switch you made to save money will cost you more for a while. Let me walk through it.

What happens on cutover morning — caches are scoped per model

Anthropic's prompt caching keys cached prefixes to the model. A prefix warmed on claude-opus-4-8 is invisible to requests hitting claude-sonnet-5. A hit requires the organization, the model, and the prefix content to all match.

It helps to recall the pricing structure. A cache write costs 1.25x the base input rate (for the 5-minute TTL; the 1-hour TTL costs 2x), and a cache read costs 0.1x. So on cutover day, every prefix in every family flips from "read at 0.1x" to "rewritten at 1.25x." Looking at the cached portion alone, your effective unit price jumps 12.5x at that moment.

Concretely, for a family sharing an 8,000-token prefix on Sonnet 5's introductory pricing:

Warm run — reads 8,000 tokens at $0.2/MTok, about $0.0016
Cold run — writes 8,000 tokens at $2.5/MTok, about $0.02

Per run the difference looks tiny. With ten families and hundreds of daily runs, though, your migration strategy decides whether that difference lasts one morning or several weeks. A cold write should, in principle, happen once per family. How many times you actually pay it is a property of the rollout plan.

Why percentage-based rollouts break cache economics twice

The standard playbook for model migration is a request-level percentage split: 10% first, then 50%, then 100%. As quality risk management, that is reasonable. From a prompt caching perspective, it is the worst possible partition. It breaks things in two distinct ways.

First, double cold writes. With a percentage split, every prefix flows to both models. Twelve prefix families means twenty-four cold writes, not twelve — and every TTL expiry triggers rewarming on both sides, indefinitely.

Second, and more damaging: TTL starvation. The 5-minute TTL is refreshed on each use, but only on the model that served the request. Consider a monitoring task that fires every 4 minutes. At 100% on one model, every run after the first lands inside the TTL and hits. Split it 50/50 and the average inter-arrival gap seen by each model stretches to 8 minutes — past the 5-minute TTL every time. The result: nearly every run on both sides becomes a cold write. A family that sat at a 100% hit rate drops to roughly 0% on both models the moment you split it.

Here is the comparison in one view.

Migration strategy	Cold writes	Hit rate impact	Quality validation
Big-bang switch	One per family	Brief dip right after cutover	Bets every family at once
Percentage split (per request)	Every family times both models, recurring on TTL expiry	Sparse families fall toward 0% on both sides	Finely controllable
Cohort cutover (per task family)	One per migrated family	Preserved for both migrated and pending families	Staged, per family

Cohort cutover keeps the one thing worth keeping from percentage rollouts — staged validation — while avoiding both failure modes.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How prompt caches are isolated per model, and how to watch cache_creation_input_tokens spike on cutover day using the usage block from real responses

✦The math behind why percentage-based rollouts cause double cold writes and TTL starvation that drives hit rates toward 0% on both sides, plus a cohort cutover design in working TypeScript

✦A migration cost simulator working backward from the introductory pricing deadline (2026-08-31), and the break-even logic for using the 1-hour TTL during cutover week

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Cutting over by task family — the cohort design

There is exactly one principle: never split a group of runs that shares a prefix across two models. The unit of migration is the task family, not the request.

The configuration needs nothing more than this:

// cutover.ts — resolve the model per task family
type ModelId = "claude-opus-4-8" | "claude-sonnet-5";
 
interface TaskFamily {
  id: string;              // e.g. "nightly-digest", "log-triage"
  prefixTokens: number;    // approximate shared prefix size
  runsPerDay: number;
  avgIntervalMin: number;  // used for TTL reasoning
  model: ModelId;          // current home model
  cutoverAt?: string;      // ISO date; newModel applies from here on
  newModel?: ModelId;
}
 
export function resolveModel(f: TaskFamily, now = new Date()): ModelId {
  if (f.cutoverAt && f.newModel && now >= new Date(f.cutoverAt)) {
    return f.newModel;
  }
  return f.model;
}
 
// Guard: mechanically reject configs that could split a family
export function assertNoSplit(families: TaskFamily[]): void {
  for (const f of families) {
    if (f.cutoverAt && !f.newModel) {
      throw new Error(`${f.id}: cutoverAt is set but newModel is missing`);
    }
  }
}

Callers simply pass model: resolveModel(family). The point is that each family carries its own cutover date, which turns migration order into a design variable.

For ordering, I prefer these criteria:

Migrate low-reuse families first. A family that runs a handful of times per day gets little from the cache anyway, so the cold-write loss is negligible. It makes an ideal observation deck for output quality.
Families whose run interval already exceeds the TTL go early. If every run is cold anyway, migration loses nothing.
High-reuse backbone families go last, once several days of observations have accumulated and you can move them with confidence.

One more trick that pays off on cutover morning: a prewarm request. Just before the batch window, send a single minimal request to the new model to write the prefix, so the first real run does not eat the cold-write latency.

// prewarm.ts — run once before the first batch on cutover day
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
export async function prewarmPrefix(model: ModelId, systemPrompt: string) {
  await client.messages.create({
    model,
    max_tokens: 1,
    system: [
      { type: "text", text: systemPrompt, cache_control: { type: "ephemeral" } },
    ],
    messages: [{ role: "user", content: "ok" }],
  });
}

Note that families below the minimum cacheable prefix length (1,024 tokens on the Sonnet and Opus lines) sit outside this whole discussion. Exclude them from the family list before you plan the order.

Measuring hit rate per family from the usage block

Migration decisions should rest on the usage block of real responses, not on intuition. Three fields matter:

// cache-metrics.ts — hit rate aggregated by family and model
interface RunUsage {
  familyId: string;
  model: string;
  inputTokens: number;               // usage.input_tokens
  cacheRead: number;                 // usage.cache_read_input_tokens
  cacheWrite: number;                // usage.cache_creation_input_tokens
}
 
export function hitRate(runs: RunUsage[]): Map<string, number> {
  const acc = new Map<string, { read: number; total: number }>();
  for (const r of runs) {
    const key = `${r.familyId}/${r.model}`;
    const cur = acc.get(key) ?? { read: 0, total: 0 };
    cur.read += r.cacheRead;
    cur.total += r.cacheRead + r.cacheWrite;
    acc.set(key, cur);
  }
  return new Map(
    [...acc].map(([k, v]) => [k, v.total === 0 ? 0 : v.read / v.total])
  );
}

In my own operation this aggregation lands in a daily log, and I watch for duplicated family-model rows. If the same familyId shows rows for two models on the same day, a split has crept in — usually a request built somewhere that bypasses resolveModel with a hard-coded model string. This table catches it. I wait until the hit rate returns to its pre-migration level (around 90% in my measurements) for two or three days before moving the next cohort.

A migration cost simulator — working backward from August 31st

Sonnet 5's introductory pricing runs through 2026-08-31, after which the rate returns to $3 / $15. Since both the cache write multiplier (1.25x) and the read multiplier (0.1x) apply to the base input rate, the migration validation itself is roughly 33% cheaper if done in August. One less reason to postpone, the way I see it.

Comparing strategies by hand gets tedious, so a small simulator does the job:

// simulate.ts — rough 30-day cost per migration strategy
interface Pricing { inPerMTok: number; outPerMTok: number }
const SONNET5_INTRO: Pricing = { inPerMTok: 2, outPerMTok: 10 };
const OPUS48: Pricing = { inPerMTok: 5, outPerMTok: 25 };
 
interface Family { prefix: number; fresh: number; out: number; runs: number; warmRatio: number }
 
function dailyCost(f: Family, p: Pricing, coldWritesPerDay: number): number {
  const warmRuns = f.runs * f.warmRatio;
  const read = (warmRuns * f.prefix / 1e6) * p.inPerMTok * 0.1;
  const write = (coldWritesPerDay * f.prefix / 1e6) * p.inPerMTok * 1.25;
  const fresh = (f.runs * f.fresh / 1e6) * p.inPerMTok;
  const out = (f.runs * f.out / 1e6) * p.outPerMTok;
  return read + write + fresh + out;
}
 
// Example: 8k prefix, 2k fresh input, 1.5k output, 40 runs/day, 90% steady-state hit rate
const fam: Family = { prefix: 8000, fresh: 2000, out: 1500, runs: 40, warmRatio: 0.9 };
 
console.log("Stay on Opus 4.8   :", dailyCost(fam, OPUS48, 4).toFixed(3));
console.log("Sonnet 5, cohort   :", dailyCost(fam, SONNET5_INTRO, 4).toFixed(3));
// Percentage split: warmRatio drops on both models, cold writes rise on both
console.log(
  "50/50 request split:",
  (
    dailyCost({ ...fam, runs: 20, warmRatio: 0.3 }, OPUS48, 14) +
    dailyCost({ ...fam, runs: 20, warmRatio: 0.3 }, SONNET5_INTRO, 14)
  ).toFixed(3)
);

Running this against my own families, staying on Opus 4.8 came out around $1.90 per day, a cohort cutover to Sonnet 5 around $0.76, and the 50/50 request split stalled at about $1.44. The split is "using the cheaper model for half the traffic yet saving only 24% versus staying put" — while diluting quality observation across both sides. Once the numbers are in front of you, there is little reason left to choose it.

Operational lessons the multiplier table does not tell you

A few details that only surfaced after actually switching.

Consider the 1-hour TTL for cutover week only. The 1h TTL write costs 2x the base rate versus 1.25x for the 5-minute TTL, so it looks worse on paper. But for families running at 5-to-60-minute intervals, the 5-minute TTL means a cold write every single run (1.25x each time), while the 1-hour TTL means one 2x write followed by 0.1x reads. If a family runs at least twice per hour, the 1h TTL wins. It is especially useful for families you have deliberately slowed down during the observation period.

Rollback is cheap — do not let it distort your judgment. Minutes after cutover, the old model's cache has already expired, so the feeling that "rolling back wastes the warm cache" is an illusion. The cost of rolling back is one cold write per family. The rollback trigger should be output-quality regression (against your golden dataset, for instance), never cost.

Never ride the platform default — always pin the model explicitly. Since June 30th the platform default has been moving to Sonnet 5. Any request path that omits the model will split your families independently of your plan. Even at indie scale, I recommend concentrating model IDs in one config file (the TaskFamily definition above) and banning model string literals from request-building code.

As a next step, aggregate cache_read_input_tokens and cache_creation_input_tokens per family from your own logs, and pick one family whose run interval already exceeds the TTL as your first cohort. Starting where there is no cache left to lose turns the observation period into pure upside.

I hope this helps you plan your own cutover. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.