●MODEL — Export controls on Claude Fable 5 are lifted, restoring global access starting July 1●MODEL — Fable 5 is available across the Claude Platform, Claude.ai, Claude Code, and Cowork●SCIENCE — Claude Science offers up to $30,000 in credits for research projects; apply by July 15●CODE — Claude Code weekly limits are raised by 50% through July 13●CODE — Dynamic workflows enter research preview with parallel, verified end-to-end task handling●CODE — A self-hosted gateway brings SSO, policy enforcement, and per-user cost attribution●MODEL — Export controls on Claude Fable 5 are lifted, restoring global access starting July 1●MODEL — Fable 5 is available across the Claude Platform, Claude.ai, Claude Code, and Cowork●SCIENCE — Claude Science offers up to $30,000 in credits for research projects; apply by July 15●CODE — Claude Code weekly limits are raised by 50% through July 13●CODE — Dynamic workflows enter research preview with parallel, verified end-to-end task handling●CODE — A self-hosted gateway brings SSO, policy enforcement, and per-user cost attribution
When a Claude Code Refactor Passes Every Test but Behaves Differently in Production — Catching Silent Contract Drift with a Behavior Diff Harness
Hand Claude Code a large refactor and your tests can stay green while production behavior quietly shifts. Here is how I record exception channels, log shape, init order, and return values as a signature, then diff them per commit to catch contract drift before it ships.
You ask Claude Code to reshape a module into a new architecture, you review the diff, the tests are green, you ship with confidence — and a few days later someone on the operations side quietly asks, "did the behavior change?" That nagging feeling, the one that slips through the net of unit tests, is the scariest part of a large refactor for me.
As an indie developer, I run personal apps and the automated publishing backends for several technical blogs, and I stepped straight into this while rewriting one of those pipelines end to end. Every test passed, yet one branch of the publish path silently no-opped, and I didn't notice for days. The cause: surrounding code depended on a "swallow the exception and return a default" contract, and the refactor had made it throw honestly instead. No test verified that contract, so it stayed green while broken.
This article is the tooling I now use to catch "works but broken" in a layer separate from tests. At the center is the idea of a contract snapshot — recording the observable behavior of code as a signature and diffing it before and after — plus a harness you run per commit.
Why a big diff hides "works but broken"
Claude Code is capable, so most refactors come back as working code. The problem is that the larger the diff, the wider the gap grows between "it runs" and "it runs under the same contract as before."
By contract I don't only mean explicit things like type signatures. The nastier ones are implicit contracts — assumptions that function as prerequisites without being written anywhere.
Implicit contract
What breaks when it changes
Throws vs. returns a default
The caller's swallow-and-continue assumption collapses; a batch halts midway
Log line structure (key set, ordering)
A monitoring regex silently stops matching and alerts go quiet
Order of init / connection setup
Load-dependent failures, e.g. connection exhaustion only while idle
null vs. empty, rounding direction
Aggregates shift slightly and a downstream threshold flips
Unit tests are good at checking equality of return values, but they rarely cover these behavior signatures. So we slot in a layer, separate from tests, that records the signature itself before and after the refactor and compares them. That is the contract snapshot.
Contract snapshots — what to sign
A contract snapshot hits the target code path with a set of representative scenarios (probes) and folds the observable behavior of each run into a structured record. The key is to include not just the return value but which channel the result came back through.
For each probe, I record at least this:
The result value (normalized so it is comparable)
The channel the result returned on (a normal return, or an exception — and if an exception, the type name)
The "shape" of the log lines emitted during the probe (not values, but the set of keys and the ordering of levels)
The order of side effects (a labeled sequence of DB connects, external calls, and so on)
Recording shape rather than value is the point. Timestamps and IDs in log bodies change every run, so comparing them directly floods you with noise. By reducing to the skeleton — key set and level ordering — you only raise a diff when the structure your monitoring depends on actually changes.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A contract-snapshot design that folds exception channel, log shape, init order, and normalized return value into one signature and diffs it before and after a refactor
✦A complete, drop-in Python harness that detects drift per commit and blocks it via a pre-push hook and a CI gate
✦A decision rule for when drift appears: whether to revert the change or deliberately update the contract, judged by the shape of the series and the presence of intent
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, at the commit before you start the refactor, record the baseline signature for the target code path. Probes are not about coverage; aim for the boundaries where contracts concentrate and pick a dozen or so. My rule of thumb is "two or three boundary and failure cases per one happy-path case," because implicit contracts show up most densely in the failure cases.
A plain data structure is enough for a probe.
# probes.py — enumerate representative scenarios where contracts concentratefrom dataclasses import dataclassfrom typing import Any, Callable@dataclassclass Probe: name: str run: Callable[[], Any] # hit the target code path once; value may be un-normalizeddef build_probes(order_service) -> list[Probe]: return [ Probe("normal_single_item", lambda: order_service.total(items=[{"sku": "A", "qty": 1, "price": 100}])), # boundary: zero quantity. rounding and empty-set handling is the contract Probe("boundary_zero_qty", lambda: order_service.total(items=[{"sku": "A", "qty": 0, "price": 100}])), # failure: missing inventory. throw vs. default is the heart of the contract Probe("missing_inventory", lambda: order_service.total(items=[{"sku": "GHOST", "qty": 1, "price": 100}])), # failure: malformed input. verifies the caller's swallow assumption Probe("malformed_input", lambda: order_service.total(items=[{"sku": "A"}])), ]
What matters is that you write these probes yourself rather than letting Claude Code do it. The person who has operated the domain understands its implicit contracts best. You will hand the structural changes to Claude Code, but you keep hold of the definition of what must not break.
A harness that detects contract drift
Run the probes once, normalize each record into a signature, and save it as the baseline JSON. After each commit of the refactor, do the same and compare signatures. Here is a minimal harness you can run per commit as is.
# contract_harness.py — capture contract snapshots and diff themimport jsonimport loggingimport sysfrom contextlib import contextmanagerdef _normalize_value(v): """Into a comparable canonical form. Fix rounding and ordering on purpose.""" if isinstance(v, float): # rounding direction is a contract, so sign it at a fixed precision return round(v, 4) if isinstance(v, dict): return {k: _normalize_value(v[k]) for k in sorted(v)} if isinstance(v, (list, tuple)): return [_normalize_value(x) for x in v] return v@contextmanagerdef _capture_logs(): """Collect only the 'shape' of emitted log lines (logger, level, key set).""" buf = [] handler = logging.Handler() handler.emit = lambda r: buf.append( (r.name, r.levelname, tuple(sorted((r.__dict__.get("extra_keys") or [])))) ) root = logging.getLogger() root.addHandler(handler) try: yield buf finally: root.removeHandler(handler)def take_record(probe): """Run one probe; sign value, channel, and log shape into a record.""" with _capture_logs() as logs: try: value = probe.run() channel = "return" exc_type = None except Exception as e: # noqa: BLE001 — detecting channel drift is the goal value = None channel = "raise" exc_type = type(e).__name__ return { "probe": probe.name, "channel": channel, # return / raise — this itself is a contract "exc_type": exc_type, # a changed exception type is contract drift too "value": _normalize_value(value), "log_shape": sorted(set(logs)), # sign the skeleton, not the values }def take_snapshot(probes): return {r["probe"]: r for r in (take_record(p) for p in probes)}def diff_snapshots(baseline: dict, current: dict) -> list[dict]: drifts = [] for name, base in baseline.items(): cur = current.get(name) if cur is None: drifts.append({"probe": name, "kind": "probe_missing"}) continue if base["channel"] != cur["channel"]: drifts.append({"probe": name, "kind": "channel_drift", "from": base["channel"], "to": cur["channel"]}) if base["exc_type"] != cur["exc_type"]: drifts.append({"probe": name, "kind": "exception_drift", "from": base["exc_type"], "to": cur["exc_type"]}) if base["value"] != cur["value"]: drifts.append({"probe": name, "kind": "value_drift", "from": base["value"], "to": cur["value"]}) if base["log_shape"] != cur["log_shape"]: drifts.append({"probe": name, "kind": "log_shape_drift"}) return driftsdef main(): from probes import build_probes from app import build_order_service # your app's assembly entry point probes = build_probes(build_order_service()) snapshot = take_snapshot(probes) mode = sys.argv[1] if len(sys.argv) > 1 else "check" path = ".contract/baseline.json" if mode == "record": with open(path, "w") as f: json.dump(snapshot, f, ensure_ascii=False, indent=2, default=str) print(f"baseline recorded: {len(snapshot)} probes") return 0 with open(path) as f: baseline = json.load(f) drifts = diff_snapshots(baseline, snapshot) if drifts: print("contract drift detected:") print(json.dumps(drifts, ensure_ascii=False, indent=2, default=str)) return 1 print("no contract drift") return 0if __name__ == "__main__": raise SystemExit(main())
What makes this fundamentally different from a test is that it does not judge whether a value is correct. It only judges whether the signature changed from before. So an intended change surfaces as drift too — and that is fine. Judging intent is the human's job; the harness's role is to put the fact that something changed on the table, without omission.
Running it per commit — hooks and a CI gate
Contract snapshots lose their edge if you only run them after a commit grows large. Keep the granularity at one commit equals one reversible change, and diff right after each commit — that is the most efficient rhythm.
Before touching the refactor, record the baseline once and commit it into the repo.
# run once at the commit that was definitely working, before the refactormkdir -p .contractpython contract_harness.py recordgit add .contract/baseline.jsongit commit -m "chore: pin the contract snapshot baseline"
Then always diff before push. As a pre-push hook, it cuts down the accident of shipping drift to a shared branch without ever putting it on the table.
# .git/hooks/pre-push — stop the push if contract drift appears#!/usr/bin/env bashset -euo pipefailif ! python contract_harness.py check; then echo "" echo "Contract drift detected. If the change is intended, update the baseline:" echo " python contract_harness.py record && git add .contract/baseline.json" exit 1fi
In CI, make this a required check on the refactor branch. Have jobs that hit drift keep the diff JSON as an artifact, so reviewers can see at a glance which contract moved and in which direction. Handing that diff JSON to Claude Code for a first pass — "is this channel_drift intentional or an accident?" — is useful, but you make the final call.
Signing log-shape and init-order drift
Value drift is relatively easy to see, but what quietly breaks operations is usually the non-functional side, and the way you sign it needs a little care.
For log shape, don't record the body; reduce it to the "key set and level" of structured logs. The harness above uses a simple extra_keys grab, but in practice pulling the set of key names from a structured (say, JSON) logger record is more reliable. Monitoring regexes and dashboard queries depend on this skeleton, so raising a diff only when the skeleton changes catches the essence without noise.
For init order, sign it as a labeled sequence of side effects. With a thin test hook, record the order of labels like "db_connect," "cache_warm," and "external_ping," then compare the sequence. Order can change while the happy-path value stays the same, so it never surfaces in value drift. That is precisely why it is worth holding as an independent signature. The idle-time connection exhaustion I hit in the past was exactly a case where this init order was off by one step; with an order signature, I would have caught it at that single commit.
When drift appears — revert or update the contract
Once the harness puts drift on the table, the decision converges to two choices: if the contract should have held, revert; if changing the contract is correct, update it explicitly. Not being able to move on while things are ambiguous is this tool's greatest benefit.
Some guidance for the call:
channel_drift (return↔raise) and exception_drift: as a rule, revert first. They are likely to break the caller's swallow assumption and the blast radius is hard to read. If you do change the contract, fix every caller path in the same commit, then update the baseline.
value_drift: confirm intent by the shape of the series. Walk the main flow once in staging and check that the response distribution and error-category distribution in observability haven't changed. If the shape is unchanged, accept it as an intended value change and update the baseline.
log_shape_drift: this is a contract with your monitoring, so if you update, make the monitoring regexes and queries follow in the same change. Updating only one side to go green is the quietest way to break.
Always make the baseline update an independent commit, and leave "why the contract changed" in the message. Then a future you or another reviewer can immediately tell that the signature change was intent, not an accident. The one thing to avoid is silently re-recording the baseline to make drift disappear — that is removing the safety valve with your own hands.
Wiring the harness into your Claude Code requests
Finally, here is how I slot this into the Claude Code workflow. When I ask for a large refactor, I always put this constraint at the top of the prompt.
We are doing a large refactor in this repo. Please follow these constraints.1. Split changes into one-commit-equals-one-reversible-unit, and each commit must build and start on its own.2. After each commit I will run: python contract_harness.py check. If drift appears in channel / exception / value / log_shape, resolve it within that commit, or stop and state why it cannot be resolved.3. Do not modify .contract/baseline.json or probes.py. I manage the definition of the contract.
Claude Code's generation speed is real. But to use that speed with confidence in production, you need a mechanism that guarantees "generated fast" and "runs under the same contract as before" separately. A contract snapshot is a thin layer that takes on only the latter. Keep it as two stances — tests watch "correctness," drift detection watches "sameness with before" — and even at a large rewrite scale, far fewer changes slide into production without ever landing on the table.
The next time you hand a big refactor to Claude Code, try starting — before you write a single line of code — by writing just three failure-case probes and recording the baseline. Those three will be what protects you from the quiet outage a few days later.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.