●CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshops●LIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scale●DESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editing●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshops●LIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scale●DESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editing●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
My Morning Batch Was Missing the Prompt Cache Every Time — Warming Cadence and the Break-Even Math for the 1-Hour TTL
Jobs that run a few hours apart cold-miss the prompt cache even with a 1-hour TTL. Here is how to back out the right warming interval from the TTL, and how to write the break-even formula that decides whether warming pays off — with numbers from a four-site daily generation pipeline.
import { Callout } from '@/components/ui/callout';
Most advice on prompt caching is about raising the hit rate. The thing I missed for longer than I would like to admit sat one step earlier: the lifetime of the cache and the interval between my jobs simply did not line up. The pipeline that generates articles across the four Dolice Labs sites runs each site staggered by a few hours. The shared reference data and policy prompt at the front add up to more than ten thousand tokens, so caching that prefix should have been an easy win. But when I read the billing breakdown, cache_read_input_tokens was barely moving. I was rewriting the whole prefix on almost every run.
The reason was mundane: the gap between runs was longer than an hour. A 1-hour TTL does nothing for a job that fires every six hours. Even with an identical prefix, any run that lands after the TTL expires is billed as a fresh write. This article records how I worked out what warming interval bridges that time gap, and whether warming actually pays off, by turning my own numbers into a formula.
ℹ️
As a baseline, if uncached input is 1.0x, a 5-minute cache write is roughly 1.25x, a 1-hour cache write is roughly 2.0x, and a cache read is roughly 0.1x. The estimates here use those multipliers. Real prices vary by model and region, so always reconcile against your own bill.
The cache lifetime and the run interval were never aligned
A prompt cache extends its TTL every time a request hits it. Put the other way around: if no access arrives within the TTL window, the cache quietly disappears. This is where interactive apps and scheduled jobs behave very differently.
An interactive chat gets a follow-up every few seconds or minutes, so even a 5-minute TTL keeps getting refreshed on its own. A scheduled job is the opposite: if the interval between runs is longer than the TTL, every run is treated as the first one. Laid out side by side, the relationship looks like this.
Pattern
Typical interval
5-minute TTL
1-hour TTL
Interactive chat
seconds to minutes
almost always hits
always hits
Frequent batch
1-3 minutes
mostly hits
always hits
Hourly batch
60 minutes
mostly misses
coin flip at the boundary
Staggered schedule
several hours
always misses
always misses
I was living in the bottom row. I had assumed a 1-hour TTL would be safe, but an hour of life never reaches a job that runs every six hours. Noticing that "run interval > TTL" relationship was the real starting point.
Measure first — log the write-to-read ratio on every run
Before reaching for a fix, pin down how much your jobs are actually rewriting. The usage object splits write and read tokens apart, so logging them on every run is enough to see reality.
import jsonimport timedef log_cache_usage(resp, job_name: str) -> dict: u = resp.usage created = getattr(u, "cache_creation_input_tokens", 0) or 0 read = getattr(u, "cache_read_input_tokens", 0) or 0 fresh = getattr(u, "input_tokens", 0) or 0 total_prefix = created + read hit_rate = (read / total_prefix) if total_prefix else 0.0 record = { "ts": time.time(), "job": job_name, "cache_creation": created, # the expensive write "cache_read": read, # the cheap read "uncached_input": fresh, "prefix_hit_rate": round(hit_rate, 3), } print(json.dumps(record, ensure_ascii=False)) return record
If prefix_hit_rate sits near zero day after day, the cache has never been kept alive. In my case, aggregating a week of generation logs across the four sites put the average hit rate at 0.04. The ten-thousand-token prefix I was loading was being written from scratch on nearly every call. That was the moment it clicked that this was a lifetime problem, not a hit-rate problem.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Understand why scheduled jobs that run hours apart structurally cold-miss the cache even on a 1-hour TTL, framed as a gap between run interval and TTL
✦Get the procedure and code to back out the minimum warming interval from the TTL without changing the cached prefix
✦Get a break-even formula that folds in the read cost of warming, plus a function that decides whether to warm or to cluster your runs instead
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
What warming is — extend the TTL without touching the content
Warming means slipping in a minimal, content-free request that reuses the same cached prefix before the TTL expires. A cache resets its TTL every time it is read, so a single cheap read rewinds the clock. You do not need any output, so you set max_tokens to the minimum and keep the message body tiny.
def warm_cache(client, model: str, cached_prefix: list): """cached_prefix must be byte-for-byte identical to production, including the cache_control on the trailing block.""" return client.messages.create( model=model, max_tokens=1, # throw the output away system=cached_prefix, # not one byte different messages=[{"role": "user", "content": "ok"}], )
The critical part is that the prefix you pass while warming is an exact match for the production request. A single differing character makes it a different cache, and instead of extending anything you just pay for another write. I share one build_prefix() function between the production code and the warming code so there is no room for a difference to creep in.
Back the warming interval out of the TTL
The interval you need for warming is the TTL minus a safety margin. I decide it like this.
Confirm the cache TTL (60 minutes for a 1-hour TTL).
Allow for scheduler jitter, API latency, and timezone drift by applying a 0.8 safety factor (60 x 0.8 = 48 minutes).
Make sure at least one access lands within that 48 minutes by setting the warming interval to 48 minutes or less.
The production job is itself an access, so skip warming in time windows where a real run fires within 48 minutes.
def warming_interval_minutes(ttl_minutes: int, safety: float = 0.8) -> int: return max(1, int(ttl_minutes * safety))# 1-hour TTL -> refresh every 48 min to bridge even a six-hour gapprint(warming_interval_minutes(60)) # -> 48
This back-of-the-envelope shows that bridging a six-hour gap takes about seven refreshes at 48-minute spacing. Doing the same with a 5-minute TTL would need roughly 90 refreshes at 4-minute spacing, which is not realistic. So the first conclusion from the math is that warming is only on the table with a 1-hour TTL.
Break-even — write the formula that says whether warming pays off
Warming costs read tokens too. Whether it is worth it comes down to comparing what you pay in warming reads against the write premium you avoid on cold misses. With a prefix of P tokens, one cold miss costs you the difference between a write (2.0x on the 1-hour tier) and a read (0.1x), about 1.9 x P extra. One warming touch is a read, 0.1 x P.
def warming_economics( prefix_tokens: int, runs_per_day: int, gap_minutes: int, ttl_minutes: int = 60, write_mult: float = 2.0, # 1h write read_mult: float = 0.1, # read): # Do nothing: if the interval exceeds the TTL, every run writes cold cold_per_run = write_mult if gap_minutes > ttl_minutes else read_mult cost_cold = runs_per_day * cold_per_run * prefix_tokens # Warm: write once, then every real run and refresh is a read interval = max(1, int(ttl_minutes * 0.8)) warm_touches = max(0, (24 * 60) // interval - runs_per_day) cost_warm = ( write_mult * prefix_tokens # one initial write + read_mult * prefix_tokens * (runs_per_day - 1) # production hits + read_mult * prefix_tokens * warm_touches # refresh hits ) saving = cost_cold - cost_warm return { "cost_cold_units": round(cost_cold), "cost_warm_units": round(cost_warm), "warm_touches_per_day": warm_touches, "should_warm": saving > 0, "saving_ratio": round(saving / cost_cold, 3) if cost_cold else 0, }print(warming_economics(prefix_tokens=12000, runs_per_day=4, gap_minutes=360))
With these parameters — a twelve-thousand-token prefix, four runs a day, six hours apart — staying cold costs about 96,000 units while warming costs about 57,000, roughly a 40% cut. Even as the number of refreshes grows, a read is a twentieth of a write, so the break-even has plenty of headroom. On paper, you stay in the black as long as the refreshes per real run do not exceed "avoided write premium / read price", that is 1.9 / 0.1 = 19.
What I settled on across four sites, and when I deliberately do not warm
The numbers were in the black, but what I ultimately chose was not warming — it was clustering the runs. I pulled the four staggered generations into nearby time windows so they fall inside one TTL. Then the second run onward is itself a refresh, and a separate warming request is barely needed. Warming became the insurance for "when I genuinely cannot move the schedule."
My priority order ends up like this.
First, can the runs be pulled inside one TTL window? That refreshes the cache for free.
If there is a reason they cannot move (load spreading, external timing), check the break-even with the function above.
If it is in the black and the prefix is stable, add 48-minute refreshes.
If the prefix changes often, drop warming and narrow the cached region to the static layer only.
Running four sites alone as an indie developer, I would rather not add an always-on warming process to a problem that a small schedule nudge already solves. Trying the cheapest option first turns out to be the most robust, which is the honest takeaway.
Three ways warming quietly costs more
To close, the traps I hit in production. All of them are "I thought I was extending the cache but was actually adding writes."
The first is prefix drift. Mix even one variable value such as a date or run ID into the front, and every call becomes a different cache. Warming requests then mass-produce fresh writes and the cost spikes. The rule is to place variable values after the cache breakpoint.
The second is a misjudged safety factor. Underestimate scheduler jitter and set the interval right up against the TTL, and an occasional delay crosses the boundary, turning that run into a full write. Refreshes only count if they reliably land before the cache expires.
The third is exceeding the breakpoint limit. There are at most four cache breakpoints per request. Slicing blocks too finely to warm more of them hits the ceiling, and the block you expected may not be written — or may be written more than once. Keep warming to the static layer that genuinely pays off.
Start by dropping log_cache_usage() into a day of jobs and checking whether prefix_hit_rate sits near zero across the days. Whether that gap is filled or not lets you decide — from numbers rather than a guess — whether to cluster your runs or to warm.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.