Keeping Large Claude Code Refactors Revertible One Commit at a Time — Field Notes on Checkpoints and Rollback Detection

Hand a big refactor to Claude Code and the speed hides a real cost: review-proof, oversized diffs. Here are the field notes I actually run — declaring checkpoints in a manifest, enforcing commit granularity with a pre-push hook, and tying rollback calls to observability.

Claude Code¹⁵² Refactoring² Git⁴ Rollback Observability²

✦ Premium Article

You ask Claude Code to "rewrite this whole directory into a different structure," and forty seconds later a 2,000-line diff lands and your hands stop moving. I have lived that moment several times, on personal apps and on client work alike. Generation is fast, but the focused attention review needs grows roughly exponentially with the size of the diff.

For a while I powered through reviews on willpower. Once I started running several projects in parallel as an indie developer, that approach plainly broke down. Now I invert the order. Before I let Claude Code generate anything, I design where I can safely roll back to, and I make Claude Code honor that granularity. These are the field notes for the manifest, hook, and rollback detection I actually use — written so you can copy them.

Estimate the layer that breaks even when tests are green

Claude Code is smart, so most refactors come back as working code. The trouble is that the gap between "it runs" and "it runs correctly" widens with the size of the diff.

The failures I have actually hit lived in a layer unit tests cannot reach. A database connection's initialization order was off by one line, and connections only exhausted during idle periods after deploy. On another project, existing code quietly relied on "swallow the exception and return a default," and a clean rewrite that simply threw instead took down an entire nightly batch. In both cases a new diff broke a contract the old code held implicitly — and because the tests never expressed that contract, they stayed green.

The takeaway is singular: refactor size and reviewability have to be designed as separate things. Rewriting big is fine. The problem is being handed it all at once and forced to verify it all at once.

Declare checkpoints as a manifest, up front

When I start a refactor, the first thing I do is not generate code — it is mark the points I can return to. Rather than leaving that in comments or my head, I put it in a YAML manifest committed to the repo, so the later hooks and reviews can read it.

# refactor.checkpoints.yml — fixed before the refactor begins
target: Move OrderService toward a structure with clearer boundaries
rollback_signals:
  p95_latency_ms: 450      # exceed this -> revert to the previous CP
  error_rate_pct: 1.0
checkpoints:
  - id: CP1
    intent: Add interfaces/ and usecases/. Do not change a single line of OrderService
    invariant: No calls from existing code occur (pure addition only)
  - id: CP2
    intent: Add an adapter so the new UseCase calls the existing OrderService
    invariant: The entry Controller supports both paths via a flag defaulting to the old route
  - id: CP3
    intent: Port tests to the UseCase side. Keep old tests. Flag defaults to false
    invariant: Behavior with the flag false exactly matches CP2
  - id: CP4
    intent: Flip the flag to true for part of production traffic; if clean, invert the default
    invariant: The flag can be returned to false instantly at any time
  - id: CP5
    intent: Delete the old OrderService and the flag branch
    invariant: Predicated on CP4 being stable in production

The part that earns its keep is invariant — the condition that must hold once each checkpoint is done. Writing down not just "what to do" but "what is provably still intact when it finishes" naturally shifts the request to Claude Code from "rewrite the whole thing" to "produce only the diff that satisfies CP1's invariant." The reason a giant diff comes back is not Claude Code; it is that I never defined the granularity. Realizing that was the start of this whole practice.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A workflow for declaring checkpoints in a YAML manifest before you start, mapping each commit to a single revertible point

✦A pre-push hook that mechanically rejects commits over 300 lines, missing a checkpoint ID, or failing to build — stopping oversized diffs at the door

✦A Python snippet that compares metric series before and after to automate the rollback decision, plus how to set the thresholds

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Embed the granularity contract in the prompt

Once the manifest is set, I hand Claude Code exactly one checkpoint's worth of scope. My template explicitly asks it to "stop at a proposal rather than implement" when the work threatens to spill over.

You will refactor this repository following refactor.checkpoints.yml.
This task is the scope of CP1 only.
 
Constraints:
1. Produce a diff that satisfies only CP1's intent and invariant.
   Do not change a single line of the existing OrderService.
2. If you judge a change beyond CP1's scope is needed, do not implement it.
   Write "Proposing this as CP2" with a 2-3 line reason.
3. Return your output as this JSON:
   {
     "diff_summary": "summary of changes",
     "invariant_check": "evidence CP1's invariant is not violated",
     "almost_broke": "where you nearly reached out of scope but stopped (at least one)"
   }

Forcing at least one almost_broke entry is the point. Claude Code likes to fold in adjacent improvements, and because the code looks cleaner you are tempted to accept them. Allow that and a single commit's meaning swells, and reversibility erodes. Do not reward out-of-scope improvements; route them to another checkpoint. Across client projects, the ones where I held that line finished the refactor faster overall — because when something went wrong, git revert worked one commit at a time.

Stop oversized diffs mechanically with a pre-push hook

Asking nicely in the prompt fails the moment a human waves one through with "eh, just this once." So I make oversized diffs and checkpoint-less commits physically un-pushable. This hook lives in .git/hooks/pre-push.

#!/usr/bin/env bash
# .git/hooks/pre-push — enforce refactor granularity mechanically
set -euo pipefail
 
MAX_LINES=300
range="origin/main..HEAD"
 
# 1) Does every commit reference a checkpoint ID?
bad_msg=$(git log --format='%H %s' $range | grep -viE 'CP[0-9]+' || true)
if [ -n "$bad_msg" ]; then
  echo "x Commits without a checkpoint ID (e.g. CP1):"
  echo "$bad_msg"
  exit 1
fi
 
# 2) Does any single commit exceed the line ceiling?
while read -r sha; do
  lines=$(git show --stat --format='' "$sha" | tail -1 | grep -oE '[0-9]+ (insertion|deletion)' \
          | grep -oE '[0-9]+' | paste -sd+ - | bc)
  lines=${lines:-0}
  if [ "$lines" -gt "$MAX_LINES" ]; then
    echo "x Commit ${sha:0:8} is ${lines} lines (ceiling ${MAX_LINES}). Split it."
    exit 1
  fi
done < <(git rev-list $range)
 
# 3) Does each commit build? (optional; push to CI if heavy)
echo "v granularity check passed"

The 300-line ceiling is a rule of thumb. In my experience the diff I can read without losing focus tops out around 300-500 lines, and erring toward the low end keeps me on the safe side. The "CP number required in the commit message" constraint quietly pulls weight too: slip in work that is not in the manifest and the push stops. Heavy build verification belongs in CI; locally, a light check on line count and checkpoint alignment proved the most practical.

When a giant diff appears, ask for a split

Even so, diffs over 300 lines happen routinely. Then I ask Claude Code to split — and the constraint that earns its keep here is "each commit must build and start on its own."

This diff is about XXX lines and hard to review. Propose a split with these rules:
 
- Commit 1: type definitions and empty implementations only (no behavior change)
- Commit 2: move old logic onto the new types, but do not change callers
- Commit 3: switch callers to the new abstraction
- Commit 4: delete the old implementation
 
Required: applying each commit on its own leaves the app able to build and start.
Do not output a split you cannot guarantee this for.

That one line makes Claude Code choose, on its own, a design that temporarily lets old and new implementations coexist via a "bridge." The commit that later deletes the bridge is short and clear, so both review and rollback get easier. The ideal split rarely arrives in one shot, so I check the first proposal against my manifest and negotiate — "I want one more step between CP2 and CP3." It feels less like a one-shot oracle and more like a refactor partner.

Push self-review into the contract

Even with commit granularity in order, I make Claude Code surface its own weak spots to raise review density. In the same session I follow up:

Self-review this change on three points:
1. Contracts the existing code implicitly relied on that this change may break
   (e.g. exception propagation, log format, init order, null handling)
2. Paths with no tests that would hurt if they break in production
3. Among 1 and 2, the spots you are not confident about (no zero answers; at least one)

Point 3 — "at least one thing you are unsure about" — is the key. Claude Code is good at replying "no problems," which is useless for review. Forcing out the spots it suspects surfaces exactly where I should read closely. I paste that output into the commit message body so a future me, or another reviewer, can tell that "the author themselves flagged this as uncertain."

Tie the rollback decision to observability

The last piece is field verification. Instead of testing in big batches, I build the habit of verifying one commit at a time. I check three things: tests are green, I walk the main path once with my own eyes in a staging-equivalent, and the shape of the metric series has not visibly changed before versus after.

The third catches the unease tests cannot. I judge it semi-automatically using the manifest's rollback_signals.

# rollback_check.py — decide whether to revert using the manifest thresholds
import sys, yaml, statistics
 
cp = yaml.safe_load(open("refactor.checkpoints.yml"))
sig = cp["rollback_signals"]
 
def p95(series):
    s = sorted(series)
    return s[min(len(s) - 1, int(len(s) * 0.95))]
 
# before/after are same-window samples pulled from your monitoring stack
before_lat = [...]   # response time before deploy (ms)
after_lat  = [...]   # response time after deploy (ms)
after_err  = ...     # error rate after deploy (%)
 
reasons = []
if p95(after_lat) > sig["p95_latency_ms"]:
    reasons.append(f"p95={p95(after_lat)}ms > {sig['p95_latency_ms']}ms")
if after_err > sig["error_rate_pct"]:
    reasons.append(f"error_rate={after_err}% > {sig['error_rate_pct']}%")
 
# Watch the shape too: warn if the median shifts right by 20% or more
if statistics.median(after_lat) > statistics.median(before_lat) * 1.2:
    reasons.append("median worse by >=20% (verify even if tests are green)")
 
if reasons:
    print("<- candidate to revert to the previous checkpoint:")
    print("\n".join(reasons))
    sys.exit(1)
print("v no rollback needed")

Set thresholds (like p95_latency_ms) from the measured distribution of the prior week or two, not from a number in an article. I place the value at roughly 1.3x the normal p95 and, when it is exceeded, revert to the previous commit without hesitation. I do not treat reverting as a cost. The whole reason I kept the granularity tight is so the safety valve fires exactly as designed at this moment.

The speed pays off only when the boring loop runs fast

Hand a large refactor to Claude Code and you are tempted to expect "magic that rewrites it in one shot." But across my own indie projects and client work alike, what I keep feeling is that its real strength is running the boring loop fast. Declare checkpoints in a manifest, enforce granularity with a pre-push hook, split giant diffs through a coexistence bridge, force out weak spots with self-review, and decide rollbacks calmly from observability numbers. None of it is special technique.

Next time you start a refactor, write just five lines of refactor.checkpoints.yml before you let it write code. From there the way you ask Claude Code changes — and the nights you lie awake afraid production broke over the weekend quietly grow fewer. I hope it helps anyone working on the same problem.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.