Two Days Into the Billing Change: How Far My Headless Costs Drifted From the Estimate

On the morning of June 15, I opened the pipeline logs as usual, and the first thing I checked was the credit balance.

The day before, the billing change had taken effect. Headless claude -p runs, the Agent SDK, and GitHub Actions had moved off my subscription limits and onto a separate pool of monthly credits, with no rollover. Up to the night before, I had trusted my own estimate. I had eyeballed a rough cost for each stage and assumed the month would fit inside the budget.

Then I tallied the numbers on the second morning, and one stage alone was burning through roughly twice what I had estimated.

What bothered me was not the number itself, but why I had misread it. So here is a plain record of how I measured cost per stage, and the decision that followed: moving exactly one stage back to my subscription. If you are also rethinking the cost of an automated setup, I hope this gives you something to go on.

You cannot measure anything until you split the stages

I run the four sites of Dolice Labs on an automated pipeline. Article generation looks like a single script, but inside it splits into a few independent stages.

Topic selection (read reference data, avoid overlap with existing posts, decide a theme)
Body generation (two drafts, Japanese and English)
Quality gates (template detection, integrity checks)
Push and cleanup (commit, log the run)

Before the billing change, I treated all of this as "X per article." As long as it ran inside my subscription, there was no reason to know the breakdown.

A separate pool of monthly credits changes that. Spending the budget without knowing which stage is heavy is like watching only your total electricity bill while having no idea which appliance is eating it. I needed to split the stages first, then attribute credits to each one.

A small wrapper that attributes cost per stage

I did not build a heavy measurement system. I simply wrapped each place that calls claude -p with a thin shell wrapper. It takes a stage name before the call and appends one line with elapsed time and an estimated cost when the call finishes.

#!/usr/bin/env bash
# run_stage.sh — run claude -p with a stage label and log usage in one line
# usage: ./run_stage.sh "topic-select" "prompt text"
set -euo pipefail
 
STAGE="$1"
PROMPT="$2"
LOG="${HOME}/pipeline_cost/$(TZ=Asia/Tokyo date +%Y-%m-%d).tsv"
mkdir -p "$(dirname "$LOG")"
 
start=$(date +%s)
 
# --output-format json returns usage so we can separate it from the result
result=$(claude -p "$PROMPT" --output-format json)
 
end=$(date +%s)
elapsed=$((end - start))
 
# pull token counts out of usage (confirm the field names in your setup)
in_tok=$(echo "$result"  | jq -r '.usage.input_tokens  // 0')
out_tok=$(echo "$result" | jq -r '.usage.output_tokens // 0')
cache_read=$(echo "$result" | jq -r '.usage.cache_read_input_tokens // 0')
 
printf '%s\t%s\t%ss\t%s\t%s\t%s\n' \
  "$(TZ=Asia/Tokyo date +%H:%M:%S)" "$STAGE" "$elapsed" \
  "$in_tok" "$out_tok" "$cache_read" >> "$LOG"
 
# pass only the real result downstream
echo "$result" | jq -r '.result'

Two things matter here.

First, --output-format json lets you separate the generated result from usage. Without it, you are left estimating token counts after the fact.

Second, keep cache_read_input_tokens in its own column. Reads from the prompt cache are priced very differently, so mixing them into the input tokens makes it impossible to tell later what actually drove the cost. The cache reads were exactly where my estimate went wrong.

The measurement itself had a small trap, too. The usage from --output-format json can come back with the field you expected empty, depending on the stage. At first I silently counted empty values as zero, so my first-day tally looked lighter than reality. Now I check for the field first with jq -r '.usage // empty', and any missing line is logged explicitly as "usage missing" rather than patched over with an average. Before you trust the total, suspect the holes in your measurement first. It looks like a detour, but it was the shortest path.

Folding two days of logs by stage

Once a day of logs piles up, sum it by stage. A short Python script was enough.

# aggregate_cost.py — sum tokens per stage and print an estimated cost
import csv, sys
from collections import defaultdict
 
# always confirm rates against your own plan's pricing (these are example numbers)
PRICE_IN  = 3.00 / 1_000_000   # input, per million tokens
PRICE_OUT = 15.00 / 1_000_000  # output, per million tokens
PRICE_CACHE_READ = 0.30 / 1_000_000  # cache reads are an order of magnitude cheaper
 
agg = defaultdict(lambda: {"in": 0, "out": 0, "cache": 0, "runs": 0})
 
for path in sys.argv[1:]:
    with open(path) as f:
        for row in csv.reader(f, delimiter="\t"):
            _, stage, _elapsed, in_tok, out_tok, cache = row
            a = agg[stage]
            a["in"]    += int(in_tok)
            a["out"]   += int(out_tok)
            a["cache"] += int(cache)
            a["runs"]  += 1
 
print(f"{'stage':24} {'runs':>5} {'in':>10} {'out':>10} {'cache':>12} {'$est':>8}")
for stage, a in sorted(agg.items(), key=lambda kv: -(kv[1]['in'])):
    cost = a["in"]*PRICE_IN + a["out"]*PRICE_OUT + a["cache"]*PRICE_CACHE_READ
    print(f"{stage:24} {a['runs']:5d} {a['in']:10d} {a['out']:10d} {a['cache']:12d} {cost:8.2f}")

Feeding it two days of TSV gives you actual consumption per stage. I lined that up next to the estimate I had written the night before the change, and looked at the two together.

The biggest gap was in topic selection

The surprise was that the heaviest stage was not body generation but topic selection.

I had expected body generation to cost the most because of its output volume, and that part matched the estimate. Where I had misread things was the input side.

The topic selection stage reads reference data, the list of existing article titles, and the reader persona on every run. I had assumed this was cheap "because the cache covers it." But the logs showed the cache only helped during back-to-back runs. Because I stagger the four sites across the day, the calls that read the same reference data are spaced far apart, past the cache lifetime, so each one paid full input price again.

In other words, what pushed costs up was not output volume but the mismatch between my staggered schedule and the cache lifetime. That is the kind of gap you can never spot from the total alone. It feels a lot like breaking AdMob revenue down by country and time of day before the real bottleneck finally shows itself.

I moved exactly one stage back to my subscription

The fix was simple. I pulled topic selection out of headless and ran it inside a normal subscription session instead.

I based the decision on a few questions I worked out after reading through how the monthly credits behave.

Does it have to run fully unattended? If yes, keep it on headless (monthly credits).
Is its input heavy and poorly served by the cache? If yes, it is a candidate to move back to the subscription.
Is output the main act, with real value in running unattended? If yes, keep it on headless.
Can I run it while reviewing in one batch in the morning? If yes, the subscription is plenty.

Topic selection matched two of these: heavy input that the cache barely helped, and something I can review in a morning batch. Body generation and the quality gates, by contrast, are worth completing unattended overnight, so they stay on headless.

The important part, I think, is not to move everything at once. Move one stage, then confirm it again with the next day's logs. In indie operations, the more strictly you hold to "change one thing and measure," the easier it is to trace causes later.

What I will confirm over the next few days

Two days of measurement is only an early reading. The budget looks generous at the start of the month, so I cannot judge whether the burn rate is right until a few more days of logs accumulate.

Three things are on my list. After moving topic selection back to the subscription, whether the headless credit burn lands on the estimated line. How much input cost drops if I tighten the schedule with cache lifetime in mind. And how closely Anthropic's dashboard consumption matches my own local tally.

Once the numbers settle, I plan to redraw the stage table. Looking at it by stage rather than by total — now that the billing model has changed, that small extra step is where the leverage is. That is my honest takeaway on day two.

Thank you for reading this far. If you are also wrestling with how to design around monthly credits, I hope this offers a small foothold.