CLAUDE LABJP
CONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshopsLIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scaleDESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editingSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversMODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K outputLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskCONFERENCE — Code w/ Claude, the annual developer conference, kicked off June 22 with keynotes, sessions, and workshopsLIMITS — Claude Code rate limits doubled and Opus API limits rose, making it easier to build reliably at scaleDESIGN — Claude Design updates add design-system alignment, tighter Claude Code sync, and direct canvas editingSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversMODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K outputLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
Articles/API & SDK
API & SDK/2026-06-24Advanced

My Morning Batch Was Missing the Prompt Cache Every Time — Warming Cadence and the Break-Even Math for the 1-Hour TTL

Jobs that run a few hours apart cold-miss the prompt cache even with a 1-hour TTL. Here is how to back out the right warming interval from the TTL, and how to write the break-even formula that decides whether warming pays off — with numbers from a four-site daily generation pipeline.

claude-api69prompt-caching9cache-control2scheduled-jobs2cost-optimization22

Premium Article

import { Callout } from '@/components/ui/callout';

Most advice on prompt caching is about raising the hit rate. The thing I missed for longer than I would like to admit sat one step earlier: the lifetime of the cache and the interval between my jobs simply did not line up. The pipeline that generates articles across the four Dolice Labs sites runs each site staggered by a few hours. The shared reference data and policy prompt at the front add up to more than ten thousand tokens, so caching that prefix should have been an easy win. But when I read the billing breakdown, cache_read_input_tokens was barely moving. I was rewriting the whole prefix on almost every run.

The reason was mundane: the gap between runs was longer than an hour. A 1-hour TTL does nothing for a job that fires every six hours. Even with an identical prefix, any run that lands after the TTL expires is billed as a fresh write. This article records how I worked out what warming interval bridges that time gap, and whether warming actually pays off, by turning my own numbers into a formula.

ℹ️
As a baseline, if uncached input is 1.0x, a 5-minute cache write is roughly 1.25x, a 1-hour cache write is roughly 2.0x, and a cache read is roughly 0.1x. The estimates here use those multipliers. Real prices vary by model and region, so always reconcile against your own bill.

The cache lifetime and the run interval were never aligned

A prompt cache extends its TTL every time a request hits it. Put the other way around: if no access arrives within the TTL window, the cache quietly disappears. This is where interactive apps and scheduled jobs behave very differently.

An interactive chat gets a follow-up every few seconds or minutes, so even a 5-minute TTL keeps getting refreshed on its own. A scheduled job is the opposite: if the interval between runs is longer than the TTL, every run is treated as the first one. Laid out side by side, the relationship looks like this.

PatternTypical interval5-minute TTL1-hour TTL
Interactive chatseconds to minutesalmost always hitsalways hits
Frequent batch1-3 minutesmostly hitsalways hits
Hourly batch60 minutesmostly missescoin flip at the boundary
Staggered scheduleseveral hoursalways missesalways misses

I was living in the bottom row. I had assumed a 1-hour TTL would be safe, but an hour of life never reaches a job that runs every six hours. Noticing that "run interval > TTL" relationship was the real starting point.

Measure first — log the write-to-read ratio on every run

Before reaching for a fix, pin down how much your jobs are actually rewriting. The usage object splits write and read tokens apart, so logging them on every run is enough to see reality.

import json
import time
 
def log_cache_usage(resp, job_name: str) -> dict:
    u = resp.usage
    created = getattr(u, "cache_creation_input_tokens", 0) or 0
    read = getattr(u, "cache_read_input_tokens", 0) or 0
    fresh = getattr(u, "input_tokens", 0) or 0
    total_prefix = created + read
    hit_rate = (read / total_prefix) if total_prefix else 0.0
    record = {
        "ts": time.time(),
        "job": job_name,
        "cache_creation": created,   # the expensive write
        "cache_read": read,          # the cheap read
        "uncached_input": fresh,
        "prefix_hit_rate": round(hit_rate, 3),
    }
    print(json.dumps(record, ensure_ascii=False))
    return record

If prefix_hit_rate sits near zero day after day, the cache has never been kept alive. In my case, aggregating a week of generation logs across the four sites put the average hit rate at 0.04. The ten-thousand-token prefix I was loading was being written from scratch on nearly every call. That was the moment it clicked that this was a lifetime problem, not a hit-rate problem.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Understand why scheduled jobs that run hours apart structurally cold-miss the cache even on a 1-hour TTL, framed as a gap between run interval and TTL
Get the procedure and code to back out the minimum warming interval from the TTL without changing the cached prefix
Get a break-even formula that folds in the read cost of warming, plus a function that decides whether to warm or to cluster your runs instead
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-05-29
Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops
Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.
API & SDK2026-04-28
Diagnosing Claude API Prompt Cache Misses — How to Read the usage Field
If your Claude API prompt cache isn't reducing your bill, the usage field is where to start. This guide walks through the five most common reasons cache_read_input_tokens stays at zero and how to fix each one.
API & SDK2026-04-26
How I Cut My Claude API Bill in Half With Prompt Caching
Done right, Anthropic's prompt caching can roughly halve your monthly API spend on workloads with long, repeated system prompts. Here is the design playbook I use after six months of running it in production.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →