●MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasks●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)●MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directive●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)●MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasks●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)●MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directive●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)
When Your Claude API Cost Math Doesn't Match the Bill: Accounting for the Four Token Buckets
Turn on prompt caching and your homegrown cost tally drifts from the console bill. Here is how to weight the four token buckets the usage object returns and build a ledger you can reconcile.
As an indie developer, I run a daily digest across several of my own apps on Claude, and one month my own cost tally was off from the console bill by roughly ten percent.
I traced the logs. Request counts were right. Token counts were right. It still didn't add up.
There was exactly one cause. I was computing cost from input_tokens + output_tokens only. The moment I enabled prompt caching, that simple formula broke silently.
Cached tokens are not in input_tokens
This was my first wrong assumption.
The usage object splits input tokens by role. Tokens served from cache are not included in input_tokens. They land in separate fields.
# What usage actually looks like with caching onusage = { "input_tokens": 412, # only uncached, regular input "cache_creation_input_tokens": 18500, # writes to cache (expensive) "cache_read_input_tokens": 17800, # reads from cache (cheap) "output_tokens": 1240,}
So if you cache a long system prompt, its body never shows up in input_tokens at all. Pricing off input_tokens alone misses the tens of thousands of tokens sitting in the cache.
In my case the shared digest prompt is about 18,000 tokens. It gets written as cache_creation on every cold start, then read back as cache_read on later calls. The naive formula ignored both.
Each bucket bills at a different rate
The key to matching the books is understanding that the four buckets do not share one rate.
Cache reads and writes bill as a multiple of the base input rate. The multipliers are stable; even when prices change, the ratios rarely do.
Bucket
usage field
Multiplier on base input rate
Regular input
input_tokens
1.0×
Cache write (5-min TTL)
cache_creation_input_tokens
1.25×
Cache write (1-hour TTL)
cache_creation_input_tokens
2.0×
Cache read
cache_read_input_tokens
0.1×
Output
output_tokens
output rate (separate)
Reads are a tenth of base input. Writes are 1.25× to 2×. Treat them uniformly and the calls where caching is working drift the most. Over-price the reads and you over-count; price the writes at base and you under-count.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Why cached tokens never appear in input_tokens, and how that quietly breaks naive cost math
✦A Python implementation that accounts for cache writes and reads at their correct multipliers
✦Logging one ledger row per call and reconciling against the console bill each month
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, put the rate table in one place. Always confirm the actual rates on the current pricing page and put them here. The numbers below are illustrative, to show the structure.
from dataclasses import dataclass# Price per million tokens (USD) — replace with current official pricingRATES = { "claude-sonnet-4-6": {"input": 3.00, "output": 15.00}, "claude-haiku-4-5": {"input": 0.80, "output": 4.00},}# Cache multipliers (ratio to base input rate, fairly stable)CACHE_WRITE_5M = 1.25CACHE_WRITE_1H = 2.00CACHE_READ = 0.10@dataclassclass CostBreakdown: input: float cache_write: float cache_read: float output: float @property def total(self) -> float: return self.input + self.cache_write + self.cache_read + self.outputdef cost_from_usage(usage: dict, model: str, cache_ttl: str = "5m") -> CostBreakdown: """Return per-bucket cost in USD for a single response's usage.""" if model not in RATES: raise ValueError(f"Unregistered model: {model} (add its rate to RATES)") in_rate = RATES[model]["input"] / 1_000_000 out_rate = RATES[model]["output"] / 1_000_000 write_mult = CACHE_WRITE_1H if cache_ttl == "1h" else CACHE_WRITE_5M return CostBreakdown( input=usage.get("input_tokens", 0) * in_rate, cache_write=usage.get("cache_creation_input_tokens", 0) * in_rate * write_mult, cache_read=usage.get("cache_read_input_tokens", 0) * in_rate * CACHE_READ, output=usage.get("output_tokens", 0) * out_rate, )
The point here is that the four buckets are summed at independent rates. input_tokens and cache_creation_input_tokens are both "input," but they bill differently, so you must never add them together before multiplying.
A naive formula (counting only 412 + 1240) drops the ~18,500 cache-write tokens entirely. Stacked across every cold start, that becomes a real gap by month-end. That single omission was almost my whole discrepancy.
Handle 1-hour TTL and the cache_creation breakdown
If you mix in a 1-hour TTL, the write multiplier changes. On newer API responses, cache_creation may come back with a breakdown.
# usage when the breakdown is presentusage = { "input_tokens": 412, "cache_creation": { "ephemeral_5m_input_tokens": 12000, "ephemeral_1h_input_tokens": 6500, }, "cache_read_input_tokens": 17800, "output_tokens": 1240,}def cache_write_cost(usage: dict, in_rate: float) -> float: """Account for 5-min and 1-hour writes at their own multipliers.""" detail = usage.get("cache_creation") if isinstance(detail, dict): five_m = detail.get("ephemeral_5m_input_tokens", 0) one_h = detail.get("ephemeral_1h_input_tokens", 0) return five_m * in_rate * CACHE_WRITE_5M + one_h * in_rate * CACHE_WRITE_1H # No breakdown: apply a single multiplier to the flat total flat = usage.get("cache_creation_input_tokens", 0) return flat * in_rate * CACHE_WRITE_5M
When the breakdown is there, split by TTL and add each at its own multiplier; when it isn't, fall back to the flat total as before. This two-path design survives mixed response shapes. Pin cache_creation at a flat 1.25× and you under-price the 1-hour portion by about forty percent.
Log one ledger row per call
Once cost is accurate, keep it in a form you can reconcile. I append one row per response.
import json, timedef log_cost_row(path: str, request_id: str, model: str, usage: dict, breakdown: CostBreakdown, feature: str) -> None: row = { "ts": time.time(), "request_id": request_id, # put response.id here "model": model, "feature": feature, # which feature made the call (for per-feature rollups) "tokens": { "input": usage.get("input_tokens", 0), "cache_write": usage.get("cache_creation_input_tokens", 0), "cache_read": usage.get("cache_read_input_tokens", 0), "output": usage.get("output_tokens", 0), }, "usd": round(breakdown.total, 6), } with open(path, "a") as f: f.write(json.dumps(row, ensure_ascii=False) + "\n")
Putting response.id in request_id is the load-bearing detail. When you reconcile against the console or other logs, you can trace any row to a single call. Adding feature lets you answer "which feature ate most of the cost" at month-end. For me, this rollup was the first time I saw that the digest's summarization stage was seventy percent of the total.
Monthly reconciliation is simple: sum the ledger's usd, line it up against the console bill. Within a few percent means your accounting is sound. A large gap is usually a swapped cache multiplier or a missing rate for an unregistered model.
What the pricing page doesn't tell you
A few things that are easy to miss until you run this in production.
First, cache writes cost you for what you wrote, hit or miss. If you keep changing what you cache, the write cost can exceed the read savings. Cache only stable system prompts as a baseline.
Second, when input_tokens is tiny but cost won't drop, cache_creation is the culprit. That's a sign of frequent cold starts, so revisit your TTL choice.
Third, if you mix models, having RATES raise ValueError on an unregistered model is a safety net. It stops the classic accident of migrating to a new model while still costing it at the old rate.
Recommendations by situation
If your calls are essentially one feature and you don't cache, the naive input_tokens + output_tokens formula is plenty. Don't build a ledger you don't need.
The moment you enable caching, switch to four-bucket accounting. That's the dividing line.
If several features or apps share one API key, add the feature-tagged ledger from day one. Reconstructing per-feature spend after the fact is nearly impossible.
Start by pulling one recent response and printing the four usage fields. If cache_read_input_tokens carries a number that your tally never reflects, that's where your gap begins.
If your own tally keeps drifting from the bill, I hope this gives you a concrete first step toward reconciling it.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.