⬡ API & SDK/2026-06-24Advanced

My Morning Batch Was Missing the Prompt Cache Every Time — Warming Cadence and the Break-Even Math for the 1-Hour TTL

Jobs that run a few hours apart cold-miss the prompt cache even with a 1-hour TTL. Here is how to back out the right warming interval from the TTL, and how to write the break-even formula that decides whether warming pays off — with numbers from a four-site daily generation pipeline.

claude-api⁶⁹ prompt-caching⁹ cache-control² scheduled-jobs² cost-optimization²²

✦ Premium Article

import { Callout } from '@/components/ui/callout';

Most advice on prompt caching is about raising the hit rate. The thing I missed for longer than I would like to admit sat one step earlier: the lifetime of the cache and the interval between my jobs simply did not line up. The pipeline that generates articles across the four Dolice Labs sites runs each site staggered by a few hours. The shared reference data and policy prompt at the front add up to more than ten thousand tokens, so caching that prefix should have been an easy win. But when I read the billing breakdown, cache_read_input_tokens was barely moving. I was rewriting the whole prefix on almost every run.

The reason was mundane: the gap between runs was longer than an hour. A 1-hour TTL does nothing for a job that fires every six hours. Even with an identical prefix, any run that lands after the TTL expires is billed as a fresh write. This article records how I worked out what warming interval bridges that time gap, and whether warming actually pays off, by turning my own numbers into a formula.

ℹ️

As a baseline, if uncached input is 1.0x, a 5-minute cache write is roughly 1.25x, a 1-hour cache write is roughly 2.0x, and a cache read is roughly 0.1x. The estimates here use those multipliers. Real prices vary by model and region, so always reconcile against your own bill.

The cache lifetime and the run interval were never aligned

A prompt cache extends its TTL every time a request hits it. Put the other way around: if no access arrives within the TTL window, the cache quietly disappears. This is where interactive apps and scheduled jobs behave very differently.

An interactive chat gets a follow-up every few seconds or minutes, so even a 5-minute TTL keeps getting refreshed on its own. A scheduled job is the opposite: if the interval between runs is longer than the TTL, every run is treated as the first one. Laid out side by side, the relationship looks like this.

Pattern	Typical interval	5-minute TTL	1-hour TTL
Interactive chat	seconds to minutes	almost always hits	always hits
Frequent batch	1-3 minutes	mostly hits	always hits
Hourly batch	60 minutes	mostly misses	coin flip at the boundary
Staggered schedule	several hours	always misses	always misses

I was living in the bottom row. I had assumed a 1-hour TTL would be safe, but an hour of life never reaches a job that runs every six hours. Noticing that "run interval > TTL" relationship was the real starting point.

Measure first — log the write-to-read ratio on every run

Before reaching for a fix, pin down how much your jobs are actually rewriting. The usage object splits write and read tokens apart, so logging them on every run is enough to see reality.

import json
import time
 
def log_cache_usage(resp, job_name: str) -> dict:
    u = resp.usage
    created = getattr(u, "cache_creation_input_tokens", 0) or 0
    read = getattr(u, "cache_read_input_tokens", 0) or 0
    fresh = getattr(u, "input_tokens", 0) or 0
    total_prefix = created + read
    hit_rate = (read / total_prefix) if total_prefix else 0.0
    record = {
        "ts": time.time(),
        "job": job_name,
        "cache_creation": created,   # the expensive write
        "cache_read": read,          # the cheap read
        "uncached_input": fresh,
        "prefix_hit_rate": round(hit_rate, 3),
    }
    print(json.dumps(record, ensure_ascii=False))
    return record

If prefix_hit_rate sits near zero day after day, the cache has never been kept alive. In my case, aggregating a week of generation logs across the four sites put the average hit rate at 0.04. The ten-thousand-token prefix I was loading was being written from scratch on nearly every call. That was the moment it clicked that this was a lifetime problem, not a hit-rate problem.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Understand why scheduled jobs that run hours apart structurally cold-miss the cache even on a 1-hour TTL, framed as a gap between run interval and TTL

✦Get the procedure and code to back out the minimum warming interval from the TTL without changing the cached prefix

✦Get a break-even formula that folds in the read cost of warming, plus a function that decides whether to warm or to cluster your runs instead

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

What warming is — extend the TTL without touching the content

Warming means slipping in a minimal, content-free request that reuses the same cached prefix before the TTL expires. A cache resets its TTL every time it is read, so a single cheap read rewinds the clock. You do not need any output, so you set max_tokens to the minimum and keep the message body tiny.

def warm_cache(client, model: str, cached_prefix: list):
    """cached_prefix must be byte-for-byte identical to production,
    including the cache_control on the trailing block."""
    return client.messages.create(
        model=model,
        max_tokens=1,                       # throw the output away
        system=cached_prefix,               # not one byte different
        messages=[{"role": "user", "content": "ok"}],
    )

The critical part is that the prefix you pass while warming is an exact match for the production request. A single differing character makes it a different cache, and instead of extending anything you just pay for another write. I share one build_prefix() function between the production code and the warming code so there is no room for a difference to creep in.

Back the warming interval out of the TTL

The interval you need for warming is the TTL minus a safety margin. I decide it like this.

Confirm the cache TTL (60 minutes for a 1-hour TTL).
Allow for scheduler jitter, API latency, and timezone drift by applying a 0.8 safety factor (60 x 0.8 = 48 minutes).
Make sure at least one access lands within that 48 minutes by setting the warming interval to 48 minutes or less.
The production job is itself an access, so skip warming in time windows where a real run fires within 48 minutes.

def warming_interval_minutes(ttl_minutes: int, safety: float = 0.8) -> int:
    return max(1, int(ttl_minutes * safety))
 
# 1-hour TTL -> refresh every 48 min to bridge even a six-hour gap
print(warming_interval_minutes(60))  # -> 48

This back-of-the-envelope shows that bridging a six-hour gap takes about seven refreshes at 48-minute spacing. Doing the same with a 5-minute TTL would need roughly 90 refreshes at 4-minute spacing, which is not realistic. So the first conclusion from the math is that warming is only on the table with a 1-hour TTL.

Break-even — write the formula that says whether warming pays off

Warming costs read tokens too. Whether it is worth it comes down to comparing what you pay in warming reads against the write premium you avoid on cold misses. With a prefix of P tokens, one cold miss costs you the difference between a write (2.0x on the 1-hour tier) and a read (0.1x), about 1.9 x P extra. One warming touch is a read, 0.1 x P.

def warming_economics(
    prefix_tokens: int,
    runs_per_day: int,
    gap_minutes: int,
    ttl_minutes: int = 60,
    write_mult: float = 2.0,   # 1h write
    read_mult: float = 0.1,    # read
):
    # Do nothing: if the interval exceeds the TTL, every run writes cold
    cold_per_run = write_mult if gap_minutes > ttl_minutes else read_mult
    cost_cold = runs_per_day * cold_per_run * prefix_tokens
 
    # Warm: write once, then every real run and refresh is a read
    interval = max(1, int(ttl_minutes * 0.8))
    warm_touches = max(0, (24 * 60) // interval - runs_per_day)
    cost_warm = (
        write_mult * prefix_tokens                       # one initial write
        + read_mult * prefix_tokens * (runs_per_day - 1) # production hits
        + read_mult * prefix_tokens * warm_touches       # refresh hits
    )
    saving = cost_cold - cost_warm
    return {
        "cost_cold_units": round(cost_cold),
        "cost_warm_units": round(cost_warm),
        "warm_touches_per_day": warm_touches,
        "should_warm": saving > 0,
        "saving_ratio": round(saving / cost_cold, 3) if cost_cold else 0,
    }
 
print(warming_economics(prefix_tokens=12000, runs_per_day=4, gap_minutes=360))

With these parameters — a twelve-thousand-token prefix, four runs a day, six hours apart — staying cold costs about 96,000 units while warming costs about 57,000, roughly a 40% cut. Even as the number of refreshes grows, a read is a twentieth of a write, so the break-even has plenty of headroom. On paper, you stay in the black as long as the refreshes per real run do not exceed "avoided write premium / read price", that is 1.9 / 0.1 = 19.

What I settled on across four sites, and when I deliberately do not warm

The numbers were in the black, but what I ultimately chose was not warming — it was clustering the runs. I pulled the four staggered generations into nearby time windows so they fall inside one TTL. Then the second run onward is itself a refresh, and a separate warming request is barely needed. Warming became the insurance for "when I genuinely cannot move the schedule."

My priority order ends up like this.

First, can the runs be pulled inside one TTL window? That refreshes the cache for free.
If there is a reason they cannot move (load spreading, external timing), check the break-even with the function above.
If it is in the black and the prefix is stable, add 48-minute refreshes.
If the prefix changes often, drop warming and narrow the cached region to the static layer only.

Running four sites alone as an indie developer, I would rather not add an always-on warming process to a problem that a small schedule nudge already solves. Trying the cheapest option first turns out to be the most robust, which is the honest takeaway.

Three ways warming quietly costs more

To close, the traps I hit in production. All of them are "I thought I was extending the cache but was actually adding writes."

The first is prefix drift. Mix even one variable value such as a date or run ID into the front, and every call becomes a different cache. Warming requests then mass-produce fresh writes and the cost spikes. The rule is to place variable values after the cache breakpoint.

The second is a misjudged safety factor. Underestimate scheduler jitter and set the interval right up against the TTL, and an occasional delay crosses the boundary, turning that run into a full write. Refreshes only count if they reliably land before the cache expires.

The third is exceeding the breakpoint limit. There are at most four cache breakpoints per request. Slicing blocks too finely to warm more of them hits the ceiling, and the block you expected may not be written — or may be written more than once. Keep warming to the static layer that genuinely pays off.

Start by dropping log_cache_usage() into a day of jobs and checking whether prefix_hit_rate sits near zero across the days. Whether that gap is filled or not lets you decide — from numbers rather than a guess — whether to cluster your runs or to warm.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

✦Copy-paste ready implementation code
✦New advanced guides published daily
✦$5/mo or $10 for lifetime access

View Membership →

⬡ API & SDK2026-05-29

Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops

Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.

⬡ API & SDK2026-04-28

Diagnosing Claude API Prompt Cache Misses — How to Read the usage Field

If your Claude API prompt cache isn't reducing your bill, the usage field is where to start. This guide walks through the five most common reasons cache_read_input_tokens stays at zero and how to fix each one.

⬡ API & SDK2026-04-26

How I Cut My Claude API Bill in Half With Prompt Caching

Done right, Anthropic's prompt caching can roughly halve your monthly API spend on workloads with long, repeated system prompts. Here is the design playbook I use after six months of running it in production.

📚RECOMMENDED BOOKS

Build a Large Language Model (From Scratch)

Sebastian Raschka

LLM Dev

Prompt Engineering for LLMs

* Contains affiliate links

See all →