Measuring My Automation's Real Cost on Day Two of the Billing Change

The morning after the June 15 billing change took effect, the first thing I thought when I opened the dashboard was: this is harder to read than I expected. Agent SDK, headless claude -p, GitHub Actions, and third-party agents had moved off the old subscription cap and onto separate monthly credits, billed at full API rate with no rollover. I had read the announcement several times, but seeing the actual numbers lined up changed the picture.

I build apps as an indie developer, and alongside that I run several sites under Dolice Labs on an automated publishing pipeline. Generation, quality gates, and the push step are all separate stages in a loop. Until now I had assembled the whole thing on the loose assumption that everything sat inside my subscription. On day two, with the first full day of real numbers finally in, I want to leave a plain record of how I redrew the line. If you have also been "just running it all headless," maybe this helps you decide.

Why I measured on day two, not day one

The short version: I decided not to trust day-one numbers. On the day the change took effect, credit accounting had not settled, and even in my own environment the per-run figures looked different in the morning than in the afternoon. Right after a billing model switches over, that kind of transitional noise is always mixed in.

Instead, I used the morning of day two — after a full 0:00–24:00 JST cycle of scheduled runs had completed once — as my first trustworthy reference point. Scheduled jobs vary by day of week and time of day, so ideally I would watch a full week. But waiting that long would push the line-drawing too far back. So I made "one full day-two cycle" my provisional yardstick, with a plan to correct it again over the weekend.

Color-coding each stage: credit spend vs. inside subscription

The first thing I did was lay every pipeline stage out in a single table and color-code which ones consume the new monthly credits. I thought I already knew this in my head, but writing it out told a different story. The biggest blind spot: a stage I thought of as "just generate one article" was actually spawning several subagents, and all of that landed on the credit side.

I kept the sorting rule simple.

Stages that run headless and non-interactive (the scheduled job itself, generation via claude -p, checks on GitHub Actions) → credit spend
Stages I run in my own interactive session (touch-up edits, design discussions, manuscript review) → subscription

Splitting it into those two columns alone made it clear where cutting would actually matter. In my case, the heaviest credit consumer was the cluster of research subagents I ran ahead of the quality gate. The automated research around generation was heavier than generation itself.

The measurement is just "have the log emit usage"

Measuring real cost did not require special instrumentation. I only appended one line of usage on exit to the wrapper script around my headless runs. This is the whole addition.

#!/usr/bin/env bash
# Publishing wrapper: record one stage's usage to a log
# Usage: ./run_step.sh <stage_name> -- claude -p "..."
 
STEP_NAME="$1"; shift
[ "$1" = "--" ] && shift
 
LOG="$HOME/cost_log/$(date +%Y-%m-%d).tsv"
mkdir -p "$(dirname "$LOG")"
 
START=$(date +%s)
# Run the real command (stdout flows downstream unchanged)
"$@"
STATUS=$?
END=$(date +%s)
 
# Append one tab-separated line: stage, time, duration, exit code
printf '%s\t%s\t%ds\t%d\n' \
  "$STEP_NAME" "$(TZ=Asia/Tokyo date +%H:%M)" "$((END-START))" "$STATUS" \
  >> "$LOG"
 
exit $STATUS

A full day collects into cost_log/2026-06-16.tsv, one line per run with stage, time, duration, and result. Aggregate by stage name and you can see which stages ran how many times and how much time they ate. I read the actual credit amount on the dashboard side, then cross-reference this log to find where the spend is skewed. Duration is not cost itself, but heavier stages tend to take longer, so as a first-pass signal it was enough.

The aggregation fits in a one-liner.

# Run count and total seconds per stage, descending
awk -F'\t' '{c[$1]++; s[$1]+=$3+0} END{for(k in c) printf "%-28s %3dx %5ds\n", k, c[k], s[k]}' \
  "$HOME/cost_log/2026-06-16.tsv" | sort -k3 -rn

How I redrew the line

After looking at the numbers, I decided on only three changes.

First, I switched the research subagents from "always on" to "only when needed." I used to run research before every generation, but on days when I already had the reference data on hand, I skipped that whole stage. This is where the credit-spend peak dropped the most.

Second, I moved stages that interactive work handles fine back into the subscription. For final article review, reading and fixing in my own session is faster and produces better quality anyway. Forcing it into headless was the assumption that "automating it is more virtuous" — and it paid off on neither cost nor quality. Automation is a tool, not a goal; the numbers reminded me of that.

Third, I narrowed GitHub Actions checks from "everything on every push" to "only what changed." That was less about the billing change itself and more a side effect of treating credits as finite. Constraints make waste easier to notice.

This way of thinking about cost design follows directly from notes I wrote earlier on practical Anthropic API cost optimization and on how to forecast monthly token cost. This billing change simply forced me to take inventory of those assumptions again. For the GitHub Actions side, I narrowed down the same setup I covered in wiring AI checks into CI/CD.

What I'll do from here

If you want one concrete first step, write your own pipeline stages into two columns: "runs headless" versus "runs interactively." That alone surfaces which stages are eating the new credits. I plan to re-measure over a full week this weekend and turn my provisional line into a settled one.

I tense up every time a billing model changes, but this one turned out to be a good chance to see, in actual numbers, which parts of my pipeline were heavy for the first time. Constraints hand you a reason to revisit things. If you are living through the same transition, I would love to compare notes.