Articles/Claude Code

⟐ Claude Code/2026-07-03Advanced

When a Claude Code Refactor Passes Every Test but Behaves Differently in Production — Catching Silent Contract Drift with a Behavior Diff Harness

Hand Claude Code a large refactor and your tests can stay green while production behavior quietly shifts. Here is how I record exception channels, log shape, init order, and return values as a signature, then diff them per commit to catch contract drift before it ships.

Claude Code¹⁷⁸ Refactoring³ Contract Testing Observability⁴ Regression Detection

✦ Premium Article

You ask Claude Code to reshape a module into a new architecture, you review the diff, the tests are green, you ship with confidence — and a few days later someone on the operations side quietly asks, "did the behavior change?" That nagging feeling, the one that slips through the net of unit tests, is the scariest part of a large refactor for me.

As an indie developer, I run personal apps and the automated publishing backends for several technical blogs, and I stepped straight into this while rewriting one of those pipelines end to end. Every test passed, yet one branch of the publish path silently no-opped, and I didn't notice for days. The cause: surrounding code depended on a "swallow the exception and return a default" contract, and the refactor had made it throw honestly instead. No test verified that contract, so it stayed green while broken.

This article is the tooling I now use to catch "works but broken" in a layer separate from tests. At the center is the idea of a contract snapshot — recording the observable behavior of code as a signature and diffing it before and after — plus a harness you run per commit.

Why a big diff hides "works but broken"

Claude Code is capable, so most refactors come back as working code. The problem is that the larger the diff, the wider the gap grows between "it runs" and "it runs under the same contract as before."

By contract I don't only mean explicit things like type signatures. The nastier ones are implicit contracts — assumptions that function as prerequisites without being written anywhere.

Implicit contract	What breaks when it changes
Throws vs. returns a default	The caller's swallow-and-continue assumption collapses; a batch halts midway
Log line structure (key set, ordering)	A monitoring regex silently stops matching and alerts go quiet
Order of init / connection setup	Load-dependent failures, e.g. connection exhaustion only while idle
null vs. empty, rounding direction	Aggregates shift slightly and a downstream threshold flips

Unit tests are good at checking equality of return values, but they rarely cover these behavior signatures. So we slot in a layer, separate from tests, that records the signature itself before and after the refactor and compares them. That is the contract snapshot.

Contract snapshots — what to sign

A contract snapshot hits the target code path with a set of representative scenarios (probes) and folds the observable behavior of each run into a structured record. The key is to include not just the return value but which channel the result came back through.

For each probe, I record at least this:

The result value (normalized so it is comparable)
The channel the result returned on (a normal return, or an exception — and if an exception, the type name)
The "shape" of the log lines emitted during the probe (not values, but the set of keys and the ordering of levels)
The order of side effects (a labeled sequence of DB connects, external calls, and so on)

Recording shape rather than value is the point. Timestamps and IDs in log bodies change every run, so comparing them directly floods you with noise. By reducing to the skeleton — key set and level ordering — you only raise a diff when the structure your monitoring depends on actually changes.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A contract-snapshot design that folds exception channel, log shape, init order, and normalized return value into one signature and diffs it before and after a refactor

✦A complete, drop-in Python harness that detects drift per commit and blocks it via a pre-push hook and a CI gate

✦A decision rule for when drift appears: whether to revert the change or deliberately update the contract, judged by the shape of the series and the presence of intent

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Designing baseline probes

First, at the commit before you start the refactor, record the baseline signature for the target code path. Probes are not about coverage; aim for the boundaries where contracts concentrate and pick a dozen or so. My rule of thumb is "two or three boundary and failure cases per one happy-path case," because implicit contracts show up most densely in the failure cases.

A plain data structure is enough for a probe.

# probes.py — enumerate representative scenarios where contracts concentrate
from dataclasses import dataclass
from typing import Any, Callable
 
@dataclass
class Probe:
    name: str
    run: Callable[[], Any]   # hit the target code path once; value may be un-normalized
 
def build_probes(order_service) -> list[Probe]:
    return [
        Probe("normal_single_item",
              lambda: order_service.total(items=[{"sku": "A", "qty": 1, "price": 100}])),
        # boundary: zero quantity. rounding and empty-set handling is the contract
        Probe("boundary_zero_qty",
              lambda: order_service.total(items=[{"sku": "A", "qty": 0, "price": 100}])),
        # failure: missing inventory. throw vs. default is the heart of the contract
        Probe("missing_inventory",
              lambda: order_service.total(items=[{"sku": "GHOST", "qty": 1, "price": 100}])),
        # failure: malformed input. verifies the caller's swallow assumption
        Probe("malformed_input",
              lambda: order_service.total(items=[{"sku": "A"}])),
    ]

What matters is that you write these probes yourself rather than letting Claude Code do it. The person who has operated the domain understands its implicit contracts best. You will hand the structural changes to Claude Code, but you keep hold of the definition of what must not break.

A harness that detects contract drift

Run the probes once, normalize each record into a signature, and save it as the baseline JSON. After each commit of the refactor, do the same and compare signatures. Here is a minimal harness you can run per commit as is.

# contract_harness.py — capture contract snapshots and diff them
import json
import logging
import sys
from contextlib import contextmanager
 
def _normalize_value(v):
    """Into a comparable canonical form. Fix rounding and ordering on purpose."""
    if isinstance(v, float):
        # rounding direction is a contract, so sign it at a fixed precision
        return round(v, 4)
    if isinstance(v, dict):
        return {k: _normalize_value(v[k]) for k in sorted(v)}
    if isinstance(v, (list, tuple)):
        return [_normalize_value(x) for x in v]
    return v
 
@contextmanager
def _capture_logs():
    """Collect only the 'shape' of emitted log lines (logger, level, key set)."""
    buf = []
    handler = logging.Handler()
    handler.emit = lambda r: buf.append(
        (r.name, r.levelname, tuple(sorted((r.__dict__.get("extra_keys") or []))))
    )
    root = logging.getLogger()
    root.addHandler(handler)
    try:
        yield buf
    finally:
        root.removeHandler(handler)
 
def take_record(probe):
    """Run one probe; sign value, channel, and log shape into a record."""
    with _capture_logs() as logs:
        try:
            value = probe.run()
            channel = "return"
            exc_type = None
        except Exception as e:          # noqa: BLE001 — detecting channel drift is the goal
            value = None
            channel = "raise"
            exc_type = type(e).__name__
    return {
        "probe": probe.name,
        "channel": channel,             # return / raise — this itself is a contract
        "exc_type": exc_type,           # a changed exception type is contract drift too
        "value": _normalize_value(value),
        "log_shape": sorted(set(logs)), # sign the skeleton, not the values
    }
 
def take_snapshot(probes):
    return {r["probe"]: r for r in (take_record(p) for p in probes)}
 
def diff_snapshots(baseline: dict, current: dict) -> list[dict]:
    drifts = []
    for name, base in baseline.items():
        cur = current.get(name)
        if cur is None:
            drifts.append({"probe": name, "kind": "probe_missing"})
            continue
        if base["channel"] != cur["channel"]:
            drifts.append({"probe": name, "kind": "channel_drift",
                           "from": base["channel"], "to": cur["channel"]})
        if base["exc_type"] != cur["exc_type"]:
            drifts.append({"probe": name, "kind": "exception_drift",
                           "from": base["exc_type"], "to": cur["exc_type"]})
        if base["value"] != cur["value"]:
            drifts.append({"probe": name, "kind": "value_drift",
                           "from": base["value"], "to": cur["value"]})
        if base["log_shape"] != cur["log_shape"]:
            drifts.append({"probe": name, "kind": "log_shape_drift"})
    return drifts
 
def main():
    from probes import build_probes
    from app import build_order_service     # your app's assembly entry point
    probes = build_probes(build_order_service())
    snapshot = take_snapshot(probes)
 
    mode = sys.argv[1] if len(sys.argv) > 1 else "check"
    path = ".contract/baseline.json"
    if mode == "record":
        with open(path, "w") as f:
            json.dump(snapshot, f, ensure_ascii=False, indent=2, default=str)
        print(f"baseline recorded: {len(snapshot)} probes")
        return 0
 
    with open(path) as f:
        baseline = json.load(f)
    drifts = diff_snapshots(baseline, snapshot)
    if drifts:
        print("contract drift detected:")
        print(json.dumps(drifts, ensure_ascii=False, indent=2, default=str))
        return 1
    print("no contract drift")
    return 0
 
if __name__ == "__main__":
    raise SystemExit(main())

What makes this fundamentally different from a test is that it does not judge whether a value is correct. It only judges whether the signature changed from before. So an intended change surfaces as drift too — and that is fine. Judging intent is the human's job; the harness's role is to put the fact that something changed on the table, without omission.

Running it per commit — hooks and a CI gate

Contract snapshots lose their edge if you only run them after a commit grows large. Keep the granularity at one commit equals one reversible change, and diff right after each commit — that is the most efficient rhythm.

Before touching the refactor, record the baseline once and commit it into the repo.

# run once at the commit that was definitely working, before the refactor
mkdir -p .contract
python contract_harness.py record
git add .contract/baseline.json
git commit -m "chore: pin the contract snapshot baseline"

Then always diff before push. As a pre-push hook, it cuts down the accident of shipping drift to a shared branch without ever putting it on the table.

# .git/hooks/pre-push — stop the push if contract drift appears
#!/usr/bin/env bash
set -euo pipefail
if ! python contract_harness.py check; then
  echo ""
  echo "Contract drift detected. If the change is intended, update the baseline:"
  echo "  python contract_harness.py record && git add .contract/baseline.json"
  exit 1
fi

In CI, make this a required check on the refactor branch. Have jobs that hit drift keep the diff JSON as an artifact, so reviewers can see at a glance which contract moved and in which direction. Handing that diff JSON to Claude Code for a first pass — "is this channel_drift intentional or an accident?" — is useful, but you make the final call.

Signing log-shape and init-order drift

Value drift is relatively easy to see, but what quietly breaks operations is usually the non-functional side, and the way you sign it needs a little care.

For log shape, don't record the body; reduce it to the "key set and level" of structured logs. The harness above uses a simple extra_keys grab, but in practice pulling the set of key names from a structured (say, JSON) logger record is more reliable. Monitoring regexes and dashboard queries depend on this skeleton, so raising a diff only when the skeleton changes catches the essence without noise.

For init order, sign it as a labeled sequence of side effects. With a thin test hook, record the order of labels like "db_connect," "cache_warm," and "external_ping," then compare the sequence. Order can change while the happy-path value stays the same, so it never surfaces in value drift. That is precisely why it is worth holding as an independent signature. The idle-time connection exhaustion I hit in the past was exactly a case where this init order was off by one step; with an order signature, I would have caught it at that single commit.

When drift appears — revert or update the contract

Once the harness puts drift on the table, the decision converges to two choices: if the contract should have held, revert; if changing the contract is correct, update it explicitly. Not being able to move on while things are ambiguous is this tool's greatest benefit.

Some guidance for the call:

channel_drift (return↔raise) and exception_drift: as a rule, revert first. They are likely to break the caller's swallow assumption and the blast radius is hard to read. If you do change the contract, fix every caller path in the same commit, then update the baseline.
value_drift: confirm intent by the shape of the series. Walk the main flow once in staging and check that the response distribution and error-category distribution in observability haven't changed. If the shape is unchanged, accept it as an intended value change and update the baseline.
log_shape_drift: this is a contract with your monitoring, so if you update, make the monitoring regexes and queries follow in the same change. Updating only one side to go green is the quietest way to break.

Always make the baseline update an independent commit, and leave "why the contract changed" in the message. Then a future you or another reviewer can immediately tell that the signature change was intent, not an accident. The one thing to avoid is silently re-recording the baseline to make drift disappear — that is removing the safety valve with your own hands.

Wiring the harness into your Claude Code requests

Finally, here is how I slot this into the Claude Code workflow. When I ask for a large refactor, I always put this constraint at the top of the prompt.

We are doing a large refactor in this repo. Please follow these constraints.
 
1. Split changes into one-commit-equals-one-reversible-unit, and each commit
   must build and start on its own.
2. After each commit I will run: python contract_harness.py check.
   If drift appears in channel / exception / value / log_shape, resolve it
   within that commit, or stop and state why it cannot be resolved.
3. Do not modify .contract/baseline.json or probes.py.
   I manage the definition of the contract.

Claude Code's generation speed is real. But to use that speed with confidence in production, you need a mechanism that guarantees "generated fast" and "runs under the same contract as before" separately. A contract snapshot is a thin layer that takes on only the latter. Keep it as two stances — tests watch "correctness," drift detection watches "sameness with before" — and even at a large rewrite scale, far fewer changes slide into production without ever landing on the table.

The next time you hand a big refactor to Claude Code, try starting — before you write a single line of code — by writing just three failure-case probes and recording the baseline. Those three will be what protects you from the quiet outage a few days later.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.