CLAUDE LABJP
BILLING — Day two after the Jun 15 change: Agent SDK, headless runs, GitHub Actions, and third-party agents now bill against separate monthly credits ($20/$100/$200) at full API rates with no rollover, making first-day cost measurements the basis for any reworkREGULATED — TCS partnered with Anthropic to bring Claude to banks, airlines, and other regulated industries, while DXC integrates Claude into the core systems those sectors rely onRETIRED — Sonnet 4 and Opus 4 left the API on Jun 15; confirm via your logs that scripts referencing them have moved to the latest generation such as Opus 4.8EXPORT — Claude Fable 5 and Mythos 5 remain suspended under a US export-control directive (since Jun 12); Anthropic says it is working to restore accessSAFE — Only the two new Mythos-class models are affected; every other model including Opus 4.8 keeps running normallySUBAGENTS — Claude Code sub-agents can spawn their own sub-agents up to five levels deep, widening the design space for multi-stage delegationBILLING — Day two after the Jun 15 change: Agent SDK, headless runs, GitHub Actions, and third-party agents now bill against separate monthly credits ($20/$100/$200) at full API rates with no rollover, making first-day cost measurements the basis for any reworkREGULATED — TCS partnered with Anthropic to bring Claude to banks, airlines, and other regulated industries, while DXC integrates Claude into the core systems those sectors rely onRETIRED — Sonnet 4 and Opus 4 left the API on Jun 15; confirm via your logs that scripts referencing them have moved to the latest generation such as Opus 4.8EXPORT — Claude Fable 5 and Mythos 5 remain suspended under a US export-control directive (since Jun 12); Anthropic says it is working to restore accessSAFE — Only the two new Mythos-class models are affected; every other model including Opus 4.8 keeps running normallySUBAGENTS — Claude Code sub-agents can spawn their own sub-agents up to five levels deep, widening the design space for multi-stage delegation
Articles/Claude Code
Claude Code/2026-06-16Intermediate

Two Days Into the Billing Change: How Far My Headless Costs Drifted From the Estimate

On June 15 the billing change moved headless execution onto separate monthly credits. Two days in, I broke down cost per pipeline stage and found one stage running at roughly double my estimate. Here is how I measured it, and why I moved one stage back to my subscription.

Claude Code151headless9cost management4automation63indie developer13

On the morning of June 15, I opened the pipeline logs as usual, and the first thing I checked was the credit balance.

The day before, the billing change had taken effect. Headless claude -p runs, the Agent SDK, and GitHub Actions had moved off my subscription limits and onto a separate pool of monthly credits, with no rollover. Up to the night before, I had trusted my own estimate. I had eyeballed a rough cost for each stage and assumed the month would fit inside the budget.

Then I tallied the numbers on the second morning, and one stage alone was burning through roughly twice what I had estimated.

What bothered me was not the number itself, but why I had misread it. So here is a plain record of how I measured cost per stage, and the decision that followed: moving exactly one stage back to my subscription. If you are also rethinking the cost of an automated setup, I hope this gives you something to go on.

You cannot measure anything until you split the stages

I run the four sites of Dolice Labs on an automated pipeline. Article generation looks like a single script, but inside it splits into a few independent stages.

  • Topic selection (read reference data, avoid overlap with existing posts, decide a theme)
  • Body generation (two drafts, Japanese and English)
  • Quality gates (template detection, integrity checks)
  • Push and cleanup (commit, log the run)

Before the billing change, I treated all of this as "X per article." As long as it ran inside my subscription, there was no reason to know the breakdown.

A separate pool of monthly credits changes that. Spending the budget without knowing which stage is heavy is like watching only your total electricity bill while having no idea which appliance is eating it. I needed to split the stages first, then attribute credits to each one.

A small wrapper that attributes cost per stage

I did not build a heavy measurement system. I simply wrapped each place that calls claude -p with a thin shell wrapper. It takes a stage name before the call and appends one line with elapsed time and an estimated cost when the call finishes.

#!/usr/bin/env bash
# run_stage.sh — run claude -p with a stage label and log usage in one line
# usage: ./run_stage.sh "topic-select" "prompt text"
set -euo pipefail
 
STAGE="$1"
PROMPT="$2"
LOG="${HOME}/pipeline_cost/$(TZ=Asia/Tokyo date +%Y-%m-%d).tsv"
mkdir -p "$(dirname "$LOG")"
 
start=$(date +%s)
 
# --output-format json returns usage so we can separate it from the result
result=$(claude -p "$PROMPT" --output-format json)
 
end=$(date +%s)
elapsed=$((end - start))
 
# pull token counts out of usage (confirm the field names in your setup)
in_tok=$(echo "$result"  | jq -r '.usage.input_tokens  // 0')
out_tok=$(echo "$result" | jq -r '.usage.output_tokens // 0')
cache_read=$(echo "$result" | jq -r '.usage.cache_read_input_tokens // 0')
 
printf '%s\t%s\t%ss\t%s\t%s\t%s\n' \
  "$(TZ=Asia/Tokyo date +%H:%M:%S)" "$STAGE" "$elapsed" \
  "$in_tok" "$out_tok" "$cache_read" >> "$LOG"
 
# pass only the real result downstream
echo "$result" | jq -r '.result'

Two things matter here.

First, --output-format json lets you separate the generated result from usage. Without it, you are left estimating token counts after the fact.

Second, keep cache_read_input_tokens in its own column. Reads from the prompt cache are priced very differently, so mixing them into the input tokens makes it impossible to tell later what actually drove the cost. The cache reads were exactly where my estimate went wrong.

The measurement itself had a small trap, too. The usage from --output-format json can come back with the field you expected empty, depending on the stage. At first I silently counted empty values as zero, so my first-day tally looked lighter than reality. Now I check for the field first with jq -r '.usage // empty', and any missing line is logged explicitly as "usage missing" rather than patched over with an average. Before you trust the total, suspect the holes in your measurement first. It looks like a detour, but it was the shortest path.

Folding two days of logs by stage

Once a day of logs piles up, sum it by stage. A short Python script was enough.

# aggregate_cost.py — sum tokens per stage and print an estimated cost
import csv, sys
from collections import defaultdict
 
# always confirm rates against your own plan's pricing (these are example numbers)
PRICE_IN  = 3.00 / 1_000_000   # input, per million tokens
PRICE_OUT = 15.00 / 1_000_000  # output, per million tokens
PRICE_CACHE_READ = 0.30 / 1_000_000  # cache reads are an order of magnitude cheaper
 
agg = defaultdict(lambda: {"in": 0, "out": 0, "cache": 0, "runs": 0})
 
for path in sys.argv[1:]:
    with open(path) as f:
        for row in csv.reader(f, delimiter="\t"):
            _, stage, _elapsed, in_tok, out_tok, cache = row
            a = agg[stage]
            a["in"]    += int(in_tok)
            a["out"]   += int(out_tok)
            a["cache"] += int(cache)
            a["runs"]  += 1
 
print(f"{'stage':24} {'runs':>5} {'in':>10} {'out':>10} {'cache':>12} {'$est':>8}")
for stage, a in sorted(agg.items(), key=lambda kv: -(kv[1]['in'])):
    cost = a["in"]*PRICE_IN + a["out"]*PRICE_OUT + a["cache"]*PRICE_CACHE_READ
    print(f"{stage:24} {a['runs']:5d} {a['in']:10d} {a['out']:10d} {a['cache']:12d} {cost:8.2f}")

Feeding it two days of TSV gives you actual consumption per stage. I lined that up next to the estimate I had written the night before the change, and looked at the two together.

The biggest gap was in topic selection

The surprise was that the heaviest stage was not body generation but topic selection.

I had expected body generation to cost the most because of its output volume, and that part matched the estimate. Where I had misread things was the input side.

The topic selection stage reads reference data, the list of existing article titles, and the reader persona on every run. I had assumed this was cheap "because the cache covers it." But the logs showed the cache only helped during back-to-back runs. Because I stagger the four sites across the day, the calls that read the same reference data are spaced far apart, past the cache lifetime, so each one paid full input price again.

In other words, what pushed costs up was not output volume but the mismatch between my staggered schedule and the cache lifetime. That is the kind of gap you can never spot from the total alone. It feels a lot like breaking AdMob revenue down by country and time of day before the real bottleneck finally shows itself.

I moved exactly one stage back to my subscription

The fix was simple. I pulled topic selection out of headless and ran it inside a normal subscription session instead.

I based the decision on a few questions I worked out after reading through how the monthly credits behave.

  • Does it have to run fully unattended? If yes, keep it on headless (monthly credits).
  • Is its input heavy and poorly served by the cache? If yes, it is a candidate to move back to the subscription.
  • Is output the main act, with real value in running unattended? If yes, keep it on headless.
  • Can I run it while reviewing in one batch in the morning? If yes, the subscription is plenty.

Topic selection matched two of these: heavy input that the cache barely helped, and something I can review in a morning batch. Body generation and the quality gates, by contrast, are worth completing unattended overnight, so they stay on headless.

The important part, I think, is not to move everything at once. Move one stage, then confirm it again with the next day's logs. In indie operations, the more strictly you hold to "change one thing and measure," the easier it is to trace causes later.

What I will confirm over the next few days

Two days of measurement is only an early reading. The budget looks generous at the start of the month, so I cannot judge whether the burn rate is right until a few more days of logs accumulate.

Three things are on my list. After moving topic selection back to the subscription, whether the headless credit burn lands on the estimated line. How much input cost drops if I tighten the schedule with cache lifetime in mind. And how closely Anthropic's dashboard consumption matches my own local tally.

Once the numbers settle, I plan to redraw the stage table. Looking at it by stage rather than by total — now that the billing model has changed, that small extra step is where the leverage is. That is my honest takeaway on day two.

Thank you for reading this far. If you are also wrestling with how to design around monthly credits, I hope this offers a small foothold.

Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

Claude Code2026-06-14
Measuring a Week of Headless Usage the Night Before the Billing Change
With headless Claude Code moving to monthly credits on June 15, I spent a week logging how many tokens my unattended runs actually consume, so I could pick a plan based on numbers instead of a guess.
Claude Code2026-06-13
What the June 15 Claude Code Billing Change Means for Headless Runs
From June 15, 2026, the Agent SDK, headless claude -p, GitHub Actions, and third-party agents move to monthly credits. Here's how a solo developer running automation decided what to keep and what to cut.
Claude Code2026-06-13
Context Budgets for Nested Subagents: Designing Contracts So 5-Level Delegation Doesn't Lose Quality
Once subagents could nest, deeper delegation made summaries thinner and reruns more frequent. Here is how I rebuilt quality by adding four contracts between layers: token budgets, a handoff schema, failure isolation, and an independent grader.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →