⬡ API & SDK/2026-06-23Advanced

When Claude API Prompt Caching Quietly Stops Hitting in Production — Field Notes on TTL and Measured Savings

Prompt caching works beautifully the day you ship it, then quietly stops hitting in production. The five things that break the prefix, how to choose between 5-minute and 1-hour TTL, and how to measure real savings from usage instead of guessing.

prompt-caching⁸ cost-optimization²¹ api³⁷ production¹⁰² observability¹⁴

✦ Premium Article

The dashboard on day one of prompt caching usually looks great. cache_read_input_tokens climbs cleanly, input cost drops. The trouble starts the following week. Traffic shifts shape a little, you slip a dynamic value into the system prompt, a deploy reorders your tool definitions — each harmless on its own, yet the cache hit rate slides down quietly. And because few people watch cache_read_input_tokens on every request, the way you find out costs went back up is the end-of-month invoice.

I lean on this feature hard across the automated operation of four technical blogs I run as an indie developer under Dolice Labs, and I have been burned more than once by "I assumed it was working, but only half the requests were actually hitting." This piece is not a feature tour. It is about what to monitor to keep the hit rate up, where to switch TTL, and how to measure the savings instead of assuming them.

Caching only fires on a byte-for-byte identical prefix

The first thing to internalize: prompt caching matches on an exact, byte-for-byte identical prefix, not on semantic similarity. It walks system, then tools, then messages from the front, and as long as everything up to the cache breakpoint is character-for-character the same, that span is billed at the cache-read rate (about 10% of normal input). The flip side is unforgiving — change a single character near the front and everything from that breakpoint onward is a cache miss.

That "prefix from the front" property explains almost every production hit-rate problem. Something that varies per request has crept into the prefix.

# ❌ A classic miss: a dynamic value lands inside the system prompt
system = [
    {
        "type": "text",
        # Embedding the current time means the prefix changes every call — it can never hit
        "text": f"You are a support AI. The current time is {datetime.now()}.\n\n## Knowledge base\n...",
        "cache_control": {"type": "ephemeral"},
    }
]
 
# ✅ Push dynamic values into messages, behind the breakpoint
system = [
    {
        "type": "text",
        "text": "You are a support AI.\n\n## Knowledge base\n...",  # stable part only
        "cache_control": {"type": "ephemeral"},
    }
]
messages = [
    {"role": "user", "content": f"(current time: {datetime.now()}) What's the latest incident status?"},
]

Timestamps are the obvious case, but the real culprits are sneakier: a user name or session ID injected into the system prompt, an A/B test whose branch sits at the front of the prefix, a knowledge base assembled dynamically whose array order isn't stable. None of these read as "this changes" when you skim the code. You only see them once you log.

Five forces that silently break the prefix

Here is a tidy list of the prefix-breakers I have either tripped over myself or found while debugging someone else's bill.

Force	What happens	How easy to miss
Dynamic values (time, IDs, random)	Miss every time. Cache never fires	High
Tool definition order drift	Building from a dict in a loop reorders and misses	High
Non-deterministic knowledge concatenation	set/dict iteration order changes the tail	High
TTL expiry (left idle)	Gaps over 5 min force a rewrite each time	Medium
Model string or beta header swaps	Cache namespace changes, forcing a rewrite	Medium

Order drift is especially nasty. Python dicts preserve insertion order, so problems are rare, but the moment you build a list from a set, or merge several sources into a dict before serializing to JSON, the ordering can shift subtly between runs. A prefix needs more than "same content" — it needs "same order." Wherever you assemble cacheable material, sort it explicitly.

# Anything cacheable — tool defs, knowledge fragments — should be order-stable
def build_cacheable_tools(tool_specs: dict[str, dict]) -> list[dict]:
    # Sort by key so the ordering is identical on every run
    return [tool_specs[name] for name in sorted(tool_specs)]

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Five forces that silently break the cached prefix, and the log signals that reveal the decay

✦How to choose between 5-minute and 1-hour TTL based on request spacing, with cost math

✦Instrumentation code that turns 'we cut costs 90%' from a guess into a measured number

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Choosing between 5-minute and 1-hour TTL

The cache expires. The standard ephemeral cache lapses roughly 5 minutes after the last access, and each access within that window resets the clock. So as long as the next request keeps arriving within 5 minutes, the cache stays warm; let the gap stretch and it expires, and the next request pays the write cost again (about 1.25x normal input).

In June 2026 the Developer Platform added a prompt cache of up to one hour, letting you hold a longer TTL. That helps workloads where requests arrive sporadically but reuse the same large prefix. The catch: the 1-hour cache carries a higher retention cost, so "make everything 1 hour" is the wrong reflex. Decide by request spacing.

Request spacing	Recommended TTL	Why
Seconds to minutes (chat, tight loops)	5 minutes	Each access resets the clock; you pay the write only once
5 minutes to an hour (sporadic API, low-frequency jobs)	1 hour	5 minutes would expire; fewer write rewrites
Over an hour (daily batches)	No cache, or batch it	Any TTL expires; resubmit in a tight window instead

In my own setup, a chat endpoint that runs all day keeps hitting fine on 5 minutes. A low-frequency job — an article-integrity check that only fires every hour or two — was meaningless on the 5-minute cache because it rewrote from scratch every run. Switching that one to the 1-hour cache visibly cut the rewrites of the same prefix. Splitting TTL by "always on versus occasional" has been the simplest guideline that holds up.

Turning "it should be working" into a measurement

This is the part I most want to land. The dangerous thing about prompt caching is that breaking it throws no exception. A miss still succeeds — it just quietly gets billed at full price. So you have to replace the assumption "it should be working" with a real number from usage, and you have to instrument that from the start.

The response usage carries three figures: cache_creation_input_tokens (writes), cache_read_input_tokens (reads, i.e. hits), and input_tokens (uncached normal input). Record those three on every request and you can compute hit rate and savings accurately after the fact.

import anthropic
 
# Approximate Sonnet 4.6 unit prices (USD / 1M tokens). Keep current in production.
PRICE_INPUT = 3.00          # normal input
PRICE_CACHE_WRITE = 3.75    # cache write (1.25x)
PRICE_CACHE_READ = 0.30     # cache read (0.1x)
 
 
def record_usage(usage) -> dict:
    """Convert one request's usage into actual cost and a no-cache baseline."""
    read = getattr(usage, "cache_read_input_tokens", 0) or 0
    write = getattr(usage, "cache_creation_input_tokens", 0) or 0
    normal = usage.input_tokens
 
    # What you actually paid
    actual = (
        normal * PRICE_INPUT
        + write * PRICE_CACHE_WRITE
        + read * PRICE_CACHE_READ
    ) / 1_000_000
 
    # Hypothetical cost with no caching (read/write billed at normal rate too)
    baseline = (normal + write + read) * PRICE_INPUT / 1_000_000
 
    cached_tokens = read + write
    hit_rate = read / cached_tokens if cached_tokens else 0.0
 
    return {
        "actual_usd": round(actual, 6),
        "baseline_usd": round(baseline, 6),
        "saved_usd": round(baseline - actual, 6),
        "cache_hit_rate": round(hit_rate, 3),
    }

Run record_usage on every request and stream saved_usd and cache_hit_rate as a time series, and shifts become visible — "hit rate dropped from 0.9 to 0.4 last Tuesday." Adding this is exactly how I discovered that one deploy had reordered my tool-definition assembly and started missing. No exception, no warning; without the instrumentation I wouldn't have known until the invoice.

Wire a threshold into an alert and operations get noticeably calmer.

def check_cache_health(metrics: list[dict], min_hit_rate: float = 0.6) -> str | None:
    """Detect a hit-rate drop from recent metrics. Return a reason string if there's a problem."""
    recent = metrics[-50:]
    cached = [m for m in recent if m["cache_hit_rate"] > 0 or m["actual_usd"] > 0]
    if not cached:
        return None
    avg_hit = sum(m["cache_hit_rate"] for m in cached) / len(cached)
    if avg_hit < min_hit_rate:
        return f"Cache hit rate degraded: recent avg {avg_hit:.1%} (threshold {min_hit_rate:.0%})"
    return None

A "normal" hit rate depends on the workload, so the practical move is to learn your system's baseline over the first week or two and then set a loose threshold that fires only on a clear drop. Set it tight from day one and natural traffic jitter will keep it ringing until everyone tunes it out.

Place breakpoints along a "staircase of change frequency"

A real lever for lifting the hit rate is ordering breakpoints by how often each part changes. You get up to four breakpoints, but order matters more than count: put the rarely changing things first and the frequently changing things last, in a staircase.

system = [
    {
        "type": "text",
        "text": BASE_INSTRUCTIONS,   # nearly immutable; touched maybe twice a year
        "cache_control": {"type": "ephemeral"},   # BP1
    },
    {
        "type": "text",
        "text": weekly_knowledge_base,   # updated weekly; changes more than the base
        "cache_control": {"type": "ephemeral"},   # BP2
    },
]
# tools are stable week to week, so put a third breakpoint at their tail
tools[-1]["cache_control"] = {"type": "ephemeral"}   # BP3

This way, updating the weekly knowledge base invalidates from BP2 onward but keeps BP1 (the base instructions) hitting. Collapse everything into one block instead, and a small edit at the tail forces the whole thing to rewrite, taking the stable part down with it. Splitting blocks "by the granularity at which things change" is what keeps caching effective over time.

One more note: if you combine a tool-side optimization flag such as Token-Efficient Tool Use, changing the header or beta spec can shift the cache namespace and force a rewrite. Do that kind of flag experimentation outside the hours when your production cache is warm.

A minimal pre-ship check

Finally, the items I verify every time I start using prompt caching on a new endpoint. Are there any dynamic values inside the cached span (system, tools, the front of messages)? Is the assembly of cacheable material order-stable? Does the TTL choice match the request spacing? And are the three usage figures flowing to logs so you can see cache_hit_rate and saved_usd? With those four in place, when the hit rate drops you notice it on a dashboard rather than on an invoice.

Prompt caching is a feature where keeping the savings weeks later is far harder than getting them on day one. Build the monitoring in alongside it — from the experience of leaking cost in exactly the same spot, that is the part I now consider most worth doing. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.