⬡ API & SDK/2026-06-22Advanced

When Your Claude API Cost Math Doesn't Match the Bill: Accounting for the Four Token Buckets

Turn on prompt caching and your homegrown cost tally drifts from the console bill. Here is how to weight the four token buckets the usage object returns and build a ledger you can reconcile.

Claude API⁸² Cost Management³ Prompt Caching⁵ Operations⁴

✦ Premium Article

As an indie developer, I run a daily digest across several of my own apps on Claude, and one month my own cost tally was off from the console bill by roughly ten percent.

I traced the logs. Request counts were right. Token counts were right. It still didn't add up.

There was exactly one cause. I was computing cost from input_tokens + output_tokens only. The moment I enabled prompt caching, that simple formula broke silently.

Cached tokens are not in input_tokens

This was my first wrong assumption.

The usage object splits input tokens by role. Tokens served from cache are not included in input_tokens. They land in separate fields.

# What usage actually looks like with caching on
usage = {
    "input_tokens": 412,                  # only uncached, regular input
    "cache_creation_input_tokens": 18500, # writes to cache (expensive)
    "cache_read_input_tokens": 17800,     # reads from cache (cheap)
    "output_tokens": 1240,
}

So if you cache a long system prompt, its body never shows up in input_tokens at all. Pricing off input_tokens alone misses the tens of thousands of tokens sitting in the cache.

In my case the shared digest prompt is about 18,000 tokens. It gets written as cache_creation on every cold start, then read back as cache_read on later calls. The naive formula ignored both.

Each bucket bills at a different rate

The key to matching the books is understanding that the four buckets do not share one rate.

Cache reads and writes bill as a multiple of the base input rate. The multipliers are stable; even when prices change, the ratios rarely do.

Bucket	usage field	Multiplier on base input rate
Regular input	input_tokens	1.0×
Cache write (5-min TTL)	cache_creation_input_tokens	1.25×
Cache write (1-hour TTL)	cache_creation_input_tokens	2.0×
Cache read	cache_read_input_tokens	0.1×
Output	output_tokens	output rate (separate)

Reads are a tenth of base input. Writes are 1.25× to 2×. Treat them uniformly and the calls where caching is working drift the most. Over-price the reads and you over-count; price the writes at base and you under-count.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why cached tokens never appear in input_tokens, and how that quietly breaks naive cost math

✦A Python implementation that accounts for cache writes and reads at their correct multipliers

✦Logging one ledger row per call and reconciling against the console bill each month

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

A function that turns usage into cost

First, put the rate table in one place. Always confirm the actual rates on the current pricing page and put them here. The numbers below are illustrative, to show the structure.

from dataclasses import dataclass
 
# Price per million tokens (USD) — replace with current official pricing
RATES = {
    "claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
    "claude-haiku-4-5":  {"input": 0.80, "output": 4.00},
}
 
# Cache multipliers (ratio to base input rate, fairly stable)
CACHE_WRITE_5M = 1.25
CACHE_WRITE_1H = 2.00
CACHE_READ     = 0.10
 
 
@dataclass
class CostBreakdown:
    input: float
    cache_write: float
    cache_read: float
    output: float
 
    @property
    def total(self) -> float:
        return self.input + self.cache_write + self.cache_read + self.output
 
 
def cost_from_usage(usage: dict, model: str, cache_ttl: str = "5m") -> CostBreakdown:
    """Return per-bucket cost in USD for a single response's usage."""
    if model not in RATES:
        raise ValueError(f"Unregistered model: {model} (add its rate to RATES)")
 
    in_rate = RATES[model]["input"] / 1_000_000
    out_rate = RATES[model]["output"] / 1_000_000
    write_mult = CACHE_WRITE_1H if cache_ttl == "1h" else CACHE_WRITE_5M
 
    return CostBreakdown(
        input=usage.get("input_tokens", 0) * in_rate,
        cache_write=usage.get("cache_creation_input_tokens", 0) * in_rate * write_mult,
        cache_read=usage.get("cache_read_input_tokens", 0) * in_rate * CACHE_READ,
        output=usage.get("output_tokens", 0) * out_rate,
    )

The point here is that the four buckets are summed at independent rates. input_tokens and cache_creation_input_tokens are both "input," but they bill differently, so you must never add them together before multiplying.

Run it on the usage from earlier:

usage = {
    "input_tokens": 412,
    "cache_creation_input_tokens": 18500,
    "cache_read_input_tokens": 17800,
    "output_tokens": 1240,
}
 
b = cost_from_usage(usage, "claude-sonnet-4-6")
print(f"regular input : ${b.input:.5f}")
print(f"cache write   : ${b.cache_write:.5f}")
print(f"cache read    : ${b.cache_read:.5f}")
print(f"output        : ${b.output:.5f}")
print(f"total         : ${b.total:.5f}")

A naive formula (counting only 412 + 1240) drops the ~18,500 cache-write tokens entirely. Stacked across every cold start, that becomes a real gap by month-end. That single omission was almost my whole discrepancy.

Handle 1-hour TTL and the cache_creation breakdown

If you mix in a 1-hour TTL, the write multiplier changes. On newer API responses, cache_creation may come back with a breakdown.

# usage when the breakdown is present
usage = {
    "input_tokens": 412,
    "cache_creation": {
        "ephemeral_5m_input_tokens": 12000,
        "ephemeral_1h_input_tokens": 6500,
    },
    "cache_read_input_tokens": 17800,
    "output_tokens": 1240,
}
 
 
def cache_write_cost(usage: dict, in_rate: float) -> float:
    """Account for 5-min and 1-hour writes at their own multipliers."""
    detail = usage.get("cache_creation")
    if isinstance(detail, dict):
        five_m = detail.get("ephemeral_5m_input_tokens", 0)
        one_h = detail.get("ephemeral_1h_input_tokens", 0)
        return five_m * in_rate * CACHE_WRITE_5M + one_h * in_rate * CACHE_WRITE_1H
    # No breakdown: apply a single multiplier to the flat total
    flat = usage.get("cache_creation_input_tokens", 0)
    return flat * in_rate * CACHE_WRITE_5M

When the breakdown is there, split by TTL and add each at its own multiplier; when it isn't, fall back to the flat total as before. This two-path design survives mixed response shapes. Pin cache_creation at a flat 1.25× and you under-price the 1-hour portion by about forty percent.

Log one ledger row per call

Once cost is accurate, keep it in a form you can reconcile. I append one row per response.

import json, time
 
def log_cost_row(path: str, request_id: str, model: str,
                 usage: dict, breakdown: CostBreakdown,
                 feature: str) -> None:
    row = {
        "ts": time.time(),
        "request_id": request_id,   # put response.id here
        "model": model,
        "feature": feature,         # which feature made the call (for per-feature rollups)
        "tokens": {
            "input": usage.get("input_tokens", 0),
            "cache_write": usage.get("cache_creation_input_tokens", 0),
            "cache_read": usage.get("cache_read_input_tokens", 0),
            "output": usage.get("output_tokens", 0),
        },
        "usd": round(breakdown.total, 6),
    }
    with open(path, "a") as f:
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

Putting response.id in request_id is the load-bearing detail. When you reconcile against the console or other logs, you can trace any row to a single call. Adding feature lets you answer "which feature ate most of the cost" at month-end. For me, this rollup was the first time I saw that the digest's summarization stage was seventy percent of the total.

Monthly reconciliation is simple: sum the ledger's usd, line it up against the console bill. Within a few percent means your accounting is sound. A large gap is usually a swapped cache multiplier or a missing rate for an unregistered model.

What the pricing page doesn't tell you

A few things that are easy to miss until you run this in production.

First, cache writes cost you for what you wrote, hit or miss. If you keep changing what you cache, the write cost can exceed the read savings. Cache only stable system prompts as a baseline.

Second, when input_tokens is tiny but cost won't drop, cache_creation is the culprit. That's a sign of frequent cold starts, so revisit your TTL choice.

Third, if you mix models, having RATES raise ValueError on an unregistered model is a safety net. It stops the classic accident of migrating to a new model while still costing it at the old rate.

Recommendations by situation

If your calls are essentially one feature and you don't cache, the naive input_tokens + output_tokens formula is plenty. Don't build a ledger you don't need.

The moment you enable caching, switch to four-bucket accounting. That's the dividing line.

If several features or apps share one API key, add the feature-tagged ledger from day one. Reconstructing per-feature spend after the fact is nearly impossible.

Start by pulling one recent response and printing the four usage fields. If cache_read_input_tokens carries a number that your tally never reflects, that's where your gap begins.

If your own tally keeps drifting from the bill, I hope this gives you a concrete first step toward reconciling it.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.