●BILLING — Day two after the Jun 15 change: Agent SDK, headless runs, GitHub Actions, and third-party agents now bill against separate monthly credits ($20/$100/$200) at full API rates with no rollover, making first-day cost measurements the basis for any rework●REGULATED — TCS partnered with Anthropic to bring Claude to banks, airlines, and other regulated industries, while DXC integrates Claude into the core systems those sectors rely on●RETIRED — Sonnet 4 and Opus 4 left the API on Jun 15; confirm via your logs that scripts referencing them have moved to the latest generation such as Opus 4.8●EXPORT — Claude Fable 5 and Mythos 5 remain suspended under a US export-control directive (since Jun 12); Anthropic says it is working to restore access●SAFE — Only the two new Mythos-class models are affected; every other model including Opus 4.8 keeps running normally●SUBAGENTS — Claude Code sub-agents can spawn their own sub-agents up to five levels deep, widening the design space for multi-stage delegation●BILLING — Day two after the Jun 15 change: Agent SDK, headless runs, GitHub Actions, and third-party agents now bill against separate monthly credits ($20/$100/$200) at full API rates with no rollover, making first-day cost measurements the basis for any rework●REGULATED — TCS partnered with Anthropic to bring Claude to banks, airlines, and other regulated industries, while DXC integrates Claude into the core systems those sectors rely on●RETIRED — Sonnet 4 and Opus 4 left the API on Jun 15; confirm via your logs that scripts referencing them have moved to the latest generation such as Opus 4.8●EXPORT — Claude Fable 5 and Mythos 5 remain suspended under a US export-control directive (since Jun 12); Anthropic says it is working to restore access●SAFE — Only the two new Mythos-class models are affected; every other model including Opus 4.8 keeps running normally●SUBAGENTS — Claude Code sub-agents can spawn their own sub-agents up to five levels deep, widening the design space for multi-stage delegation
Keeping Large Claude Code Refactors Revertible One Commit at a Time — Field Notes on Checkpoints and Rollback Detection
Hand a big refactor to Claude Code and the speed hides a real cost: review-proof, oversized diffs. Here are the field notes I actually run — declaring checkpoints in a manifest, enforcing commit granularity with a pre-push hook, and tying rollback calls to observability.
You ask Claude Code to "rewrite this whole directory into a different structure," and forty seconds later a 2,000-line diff lands and your hands stop moving. I have lived that moment several times, on personal apps and on client work alike. Generation is fast, but the focused attention review needs grows roughly exponentially with the size of the diff.
For a while I powered through reviews on willpower. Once I started running several projects in parallel as an indie developer, that approach plainly broke down. Now I invert the order. Before I let Claude Code generate anything, I design where I can safely roll back to, and I make Claude Code honor that granularity. These are the field notes for the manifest, hook, and rollback detection I actually use — written so you can copy them.
Estimate the layer that breaks even when tests are green
Claude Code is smart, so most refactors come back as working code. The trouble is that the gap between "it runs" and "it runs correctly" widens with the size of the diff.
The failures I have actually hit lived in a layer unit tests cannot reach. A database connection's initialization order was off by one line, and connections only exhausted during idle periods after deploy. On another project, existing code quietly relied on "swallow the exception and return a default," and a clean rewrite that simply threw instead took down an entire nightly batch. In both cases a new diff broke a contract the old code held implicitly — and because the tests never expressed that contract, they stayed green.
The takeaway is singular: refactor size and reviewability have to be designed as separate things. Rewriting big is fine. The problem is being handed it all at once and forced to verify it all at once.
Declare checkpoints as a manifest, up front
When I start a refactor, the first thing I do is not generate code — it is mark the points I can return to. Rather than leaving that in comments or my head, I put it in a YAML manifest committed to the repo, so the later hooks and reviews can read it.
# refactor.checkpoints.yml — fixed before the refactor beginstarget: Move OrderService toward a structure with clearer boundariesrollback_signals: p95_latency_ms: 450 # exceed this -> revert to the previous CP error_rate_pct: 1.0checkpoints: - id: CP1 intent: Add interfaces/ and usecases/. Do not change a single line of OrderService invariant: No calls from existing code occur (pure addition only) - id: CP2 intent: Add an adapter so the new UseCase calls the existing OrderService invariant: The entry Controller supports both paths via a flag defaulting to the old route - id: CP3 intent: Port tests to the UseCase side. Keep old tests. Flag defaults to false invariant: Behavior with the flag false exactly matches CP2 - id: CP4 intent: Flip the flag to true for part of production traffic; if clean, invert the default invariant: The flag can be returned to false instantly at any time - id: CP5 intent: Delete the old OrderService and the flag branch invariant: Predicated on CP4 being stable in production
The part that earns its keep is invariant — the condition that must hold once each checkpoint is done. Writing down not just "what to do" but "what is provably still intact when it finishes" naturally shifts the request to Claude Code from "rewrite the whole thing" to "produce only the diff that satisfies CP1's invariant." The reason a giant diff comes back is not Claude Code; it is that I never defined the granularity. Realizing that was the start of this whole practice.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A workflow for declaring checkpoints in a YAML manifest before you start, mapping each commit to a single revertible point
✦A pre-push hook that mechanically rejects commits over 300 lines, missing a checkpoint ID, or failing to build — stopping oversized diffs at the door
✦A Python snippet that compares metric series before and after to automate the rollback decision, plus how to set the thresholds
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Once the manifest is set, I hand Claude Code exactly one checkpoint's worth of scope. My template explicitly asks it to "stop at a proposal rather than implement" when the work threatens to spill over.
You will refactor this repository following refactor.checkpoints.yml.This task is the scope of CP1 only.Constraints:1. Produce a diff that satisfies only CP1's intent and invariant. Do not change a single line of the existing OrderService.2. If you judge a change beyond CP1's scope is needed, do not implement it. Write "Proposing this as CP2" with a 2-3 line reason.3. Return your output as this JSON: { "diff_summary": "summary of changes", "invariant_check": "evidence CP1's invariant is not violated", "almost_broke": "where you nearly reached out of scope but stopped (at least one)" }
Forcing at least one almost_broke entry is the point. Claude Code likes to fold in adjacent improvements, and because the code looks cleaner you are tempted to accept them. Allow that and a single commit's meaning swells, and reversibility erodes. Do not reward out-of-scope improvements; route them to another checkpoint. Across client projects, the ones where I held that line finished the refactor faster overall — because when something went wrong, git revert worked one commit at a time.
Stop oversized diffs mechanically with a pre-push hook
Asking nicely in the prompt fails the moment a human waves one through with "eh, just this once." So I make oversized diffs and checkpoint-less commits physically un-pushable. This hook lives in .git/hooks/pre-push.
#!/usr/bin/env bash# .git/hooks/pre-push — enforce refactor granularity mechanicallyset -euo pipefailMAX_LINES=300range="origin/main..HEAD"# 1) Does every commit reference a checkpoint ID?bad_msg=$(git log --format='%H %s' $range | grep -viE 'CP[0-9]+' || true)if [ -n "$bad_msg" ]; then echo "x Commits without a checkpoint ID (e.g. CP1):" echo "$bad_msg" exit 1fi# 2) Does any single commit exceed the line ceiling?while read -r sha; do lines=$(git show --stat --format='' "$sha" | tail -1 | grep -oE '[0-9]+ (insertion|deletion)' \ | grep -oE '[0-9]+' | paste -sd+ - | bc) lines=${lines:-0} if [ "$lines" -gt "$MAX_LINES" ]; then echo "x Commit ${sha:0:8} is ${lines} lines (ceiling ${MAX_LINES}). Split it." exit 1 fidone < <(git rev-list $range)# 3) Does each commit build? (optional; push to CI if heavy)echo "v granularity check passed"
The 300-line ceiling is a rule of thumb. In my experience the diff I can read without losing focus tops out around 300-500 lines, and erring toward the low end keeps me on the safe side. The "CP number required in the commit message" constraint quietly pulls weight too: slip in work that is not in the manifest and the push stops. Heavy build verification belongs in CI; locally, a light check on line count and checkpoint alignment proved the most practical.
When a giant diff appears, ask for a split
Even so, diffs over 300 lines happen routinely. Then I ask Claude Code to split — and the constraint that earns its keep here is "each commit must build and start on its own."
This diff is about XXX lines and hard to review. Propose a split with these rules:- Commit 1: type definitions and empty implementations only (no behavior change)- Commit 2: move old logic onto the new types, but do not change callers- Commit 3: switch callers to the new abstraction- Commit 4: delete the old implementationRequired: applying each commit on its own leaves the app able to build and start.Do not output a split you cannot guarantee this for.
That one line makes Claude Code choose, on its own, a design that temporarily lets old and new implementations coexist via a "bridge." The commit that later deletes the bridge is short and clear, so both review and rollback get easier. The ideal split rarely arrives in one shot, so I check the first proposal against my manifest and negotiate — "I want one more step between CP2 and CP3." It feels less like a one-shot oracle and more like a refactor partner.
Push self-review into the contract
Even with commit granularity in order, I make Claude Code surface its own weak spots to raise review density. In the same session I follow up:
Self-review this change on three points:1. Contracts the existing code implicitly relied on that this change may break (e.g. exception propagation, log format, init order, null handling)2. Paths with no tests that would hurt if they break in production3. Among 1 and 2, the spots you are not confident about (no zero answers; at least one)
Point 3 — "at least one thing you are unsure about" — is the key. Claude Code is good at replying "no problems," which is useless for review. Forcing out the spots it suspects surfaces exactly where I should read closely. I paste that output into the commit message body so a future me, or another reviewer, can tell that "the author themselves flagged this as uncertain."
Tie the rollback decision to observability
The last piece is field verification. Instead of testing in big batches, I build the habit of verifying one commit at a time. I check three things: tests are green, I walk the main path once with my own eyes in a staging-equivalent, and the shape of the metric series has not visibly changed before versus after.
The third catches the unease tests cannot. I judge it semi-automatically using the manifest's rollback_signals.
# rollback_check.py — decide whether to revert using the manifest thresholdsimport sys, yaml, statisticscp = yaml.safe_load(open("refactor.checkpoints.yml"))sig = cp["rollback_signals"]def p95(series): s = sorted(series) return s[min(len(s) - 1, int(len(s) * 0.95))]# before/after are same-window samples pulled from your monitoring stackbefore_lat = [...] # response time before deploy (ms)after_lat = [...] # response time after deploy (ms)after_err = ... # error rate after deploy (%)reasons = []if p95(after_lat) > sig["p95_latency_ms"]: reasons.append(f"p95={p95(after_lat)}ms > {sig['p95_latency_ms']}ms")if after_err > sig["error_rate_pct"]: reasons.append(f"error_rate={after_err}% > {sig['error_rate_pct']}%")# Watch the shape too: warn if the median shifts right by 20% or moreif statistics.median(after_lat) > statistics.median(before_lat) * 1.2: reasons.append("median worse by >=20% (verify even if tests are green)")if reasons: print("<- candidate to revert to the previous checkpoint:") print("\n".join(reasons)) sys.exit(1)print("v no rollback needed")
Set thresholds (like p95_latency_ms) from the measured distribution of the prior week or two, not from a number in an article. I place the value at roughly 1.3x the normal p95 and, when it is exceeded, revert to the previous commit without hesitation. I do not treat reverting as a cost. The whole reason I kept the granularity tight is so the safety valve fires exactly as designed at this moment.
The speed pays off only when the boring loop runs fast
Hand a large refactor to Claude Code and you are tempted to expect "magic that rewrites it in one shot." But across my own indie projects and client work alike, what I keep feeling is that its real strength is running the boring loop fast. Declare checkpoints in a manifest, enforce granularity with a pre-push hook, split giant diffs through a coexistence bridge, force out weak spots with self-review, and decide rollbacks calmly from observability numbers. None of it is special technique.
Next time you start a refactor, write just five lines of refactor.checkpoints.yml before you let it write code. From there the way you ask Claude Code changes — and the nights you lie awake afraid production broke over the weekend quietly grow fewer. I hope it helps anyone working on the same problem.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.