●CODE — Claude Code adds Trusted Devices, verifying a machine before remote admin sessions begin●CODE — CPU use drops about 37% during streaming, keeping long always-on automation steadier●CODE — Fullscreen mouse-click controls, voice dictation fixes, and better Linux voice detection land●AUTH — Static API keys can now be replaced with short-lived, scoped WIF credentials●TEAM — You can tag Claude directly in Slack and delegate tasks while you focus elsewhere●WORKFLOW — Dynamic workflows arrive in research preview, breaking complex work into steps on their own●CODE — Claude Code adds Trusted Devices, verifying a machine before remote admin sessions begin●CODE — CPU use drops about 37% during streaming, keeping long always-on automation steadier●CODE — Fullscreen mouse-click controls, voice dictation fixes, and better Linux voice detection land●AUTH — Static API keys can now be replaced with short-lived, scoped WIF credentials●TEAM — You can tag Claude directly in Slack and delegate tasks while you focus elsewhere●WORKFLOW — Dynamic workflows arrive in research preview, breaking complex work into steps on their own
Every Tool Call Succeeds, Yet Nothing Moves Forward: Detecting Stagnation in Unattended Agents
No errors, yet the agent keeps replaying the same move while your budget quietly drains. Here is how to detect a success-but-no-progress loop using a progress oracle and action fingerprints, with a working Python implementation that halts safely.
The article-generation pipeline I run overnight was still going in the morning.
I opened the logs. Not a single tool call had failed. read_file returned, edit_file returned, run_tests returned, then back to read_file. Everything was a clean 200, not one error line. And yet the output was barely different from six hours earlier.
We tend to guard against loops that fail: retry limits, circuit breakers, give-up budgets. But a loop that keeps succeeding while making no progress gives you nothing to trip on. No exception, no non-zero exit code.
As an indie developer running several sites unattended, this is the worst kind of failure I deal with. A failure you can notice. A loop that quietly stops advancing while draining the budget costs far more.
This article lays out how to detect that "succeeds but never stops" loop structurally, and how to halt it safely while leaving diagnostics behind.
Error budgets cannot catch stagnation
First, why the usual guardrails slip past this.
Almost every safety device in a typical agent loop counts failures.
Guardrail
What it counts
Fires on a succeeding-but-stuck loop?
Retry limit
Consecutive exceptions
No (no exceptions thrown)
Circuit breaker
Error rate
No (error rate is 0)
Give-up budget
Self-repair attempts
No (no error to repair)
Turn limit
Total turns
Late (only after the whole budget is spent)
Only the turn limit eventually kicks in, but that is not detection of stagnation; it is exhaustion of the entire budget. With a 50-turn cap, you waste 49 turns before stopping. Far too late to call it cost control.
Stagnation has to be framed not as the absence of errors but as the absence of progress. And progress cannot be measured unless you define it explicitly. That is where the design starts.
"Stagnation" does not exist without a definition of progress
Stagnation means "no progress for a stretch of steps." Put the other way around: if you have not defined progress, you cannot define stagnation. The reason so many agent implementations cannot detect it is, I think, that they never had a progress oracle in the first place.
A progress oracle is a function that returns a roughly monotone number tied to the task goal. As long as the value keeps improving, the agent is moving forward.
from typing import Protocolclass ProgressOracle(Protocol): def score(self) -> float: """Higher means closer to the goal. Observes current state, no side effects.""" ...# Example: an oracle for a code-fixing taskclass TestPassProgress: def __init__(self, run_tests): self._run_tests = run_tests # () -> (passed:int, total:int) def score(self) -> float: passed, total = self._run_tests() if total == 0: return 0.0 # Pass ratio as the main component, with a strong bonus for all-green return passed / total + (1.0 if passed == total else 0.0)
The natural metric for progress changes by task.
Task type
Candidate progress oracle
Code fixing
Tests passing / drop in remaining linter violations
Research gathering
Sub-questions answered / new primary sources obtained
File organizing
Unsorted files remaining (a decrease is progress)
Article writing
Quality-gate items satisfied
Some tasks resist a progress oracle. For those, use behavioral novelty, described below, as a proxy for progress. A proxy is a weaker signal than the real thing, but it catches stagnation far sooner than nothing.
One thing I learned in practice: the progress oracle must be cheap and side-effect-free to call. If you run the full test suite every turn, the cost of detecting stagnation outgrows the work itself. Use a cached test result, or just peek at a counter for research. Keep the observation cost flat.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A concrete implementation that catches loops where every tool call succeeds but nothing advances, using a progress oracle plus action fingerprints
✦Logic that separates three distinct stagnation patterns: exact repeats, oscillation (A-B-A-B), and a drop in novelty
✦How to wire a stagnation budget into your agent loop so it halts safely and leaves structured diagnostics behind
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
On top of detecting stalled progress, looking at what the agent keeps repeating lets you classify the stagnation. To do that, reduce each step to a normalized hash: the action fingerprint.
The key is to strip volatile elements before comparing. If you leave timestamps or request IDs in, every hash differs and you can never spot a repeat.
import hashlib, jsonfrom dataclasses import dataclass, fieldfrom collections import dequeVOLATILE_KEYS = {"timestamp", "request_id", "trace_id", "elapsed_ms", "nonce"}def _strip_volatile(obj): if isinstance(obj, dict): return {k: _strip_volatile(v) for k, v in sorted(obj.items()) if k not in VOLATILE_KEYS} if isinstance(obj, list): return [_strip_volatile(v) for v in obj] return objdef action_fingerprint(tool_name: str, tool_input: dict) -> str: canonical = json.dumps( {"tool": tool_name, "input": _strip_volatile(tool_input)}, sort_keys=True, ensure_ascii=False, ) return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]def result_digest(tool_result: dict) -> str: # Keep only the "meaningful change" in the result; hash a summary if the body is long salient = _strip_volatile(tool_result) blob = json.dumps(salient, sort_keys=True, ensure_ascii=False) return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:16]
If steps with the same action_fingerprint line up over and over, the agent is literally repeating the same operation. If the result_digest matches too, that operation is changing nothing in the world. Harmless for reads, but for writes an unchanging result digest is a strong sign of "edits that do not take effect" being thrown repeatedly.
Separating exact repeats, oscillation, and dropping novelty
In practice, stagnation wears several faces. Each calls for a different response, so splitting the verdict up makes diagnosis easier.
@dataclassclass Step: fp: str # action_fingerprint digest: str # result_digest score: float # progress oracle value at this point@dataclassclass StagnationGuard: window: int = 8 # how many recent steps to inspect max_exact_repeats: int = 3 # allowed count of identical (action + result) novelty_floor: float = 0.34 # lower bound on the ratio of new fingerprints progress_patience: int = 6 # steps allowed without progress improving _hist: deque = field(default_factory=lambda: deque(maxlen=64)) _best_score: float = float("-inf") _since_improved: int = 0 def observe(self, step: Step) -> None: self._hist.append(step) if step.score > self._best_score + 1e-9: self._best_score = step.score self._since_improved = 0 else: self._since_improved += 1 def _exact_repeats(self) -> int: if not self._hist: return 0 last = self._hist[-1] key = (last.fp, last.digest) return sum(1 for s in self._hist if (s.fp, s.digest) == key) def _novelty_ratio(self) -> float: recent = list(self._hist)[-self.window:] if len(recent) < self.window: return 1.0 # do not judge while samples are insufficient uniq = len({s.fp for s in recent}) return uniq / len(recent) def _oscillating(self) -> bool: # Detect short-period cycles such as A-B-A-B recent = [s.fp for s in list(self._hist)[-self.window:]] for period in (2, 3): if len(recent) >= period * 2: tail = recent[-period * 2:] if tail[:period] == tail[period:]: return True return False def verdict(self) -> str | None: if self._exact_repeats() >= self.max_exact_repeats: return "exact_repeat" if self._oscillating(): return "oscillation" if self._novelty_ratio() < self.novelty_floor: return "low_novelty" if self._since_improved >= self.progress_patience: return "no_progress" return None
Here is what the four verdicts mean.
Verdict
Situation
Typical cause
exact_repeat
Identical action and result exceeded the allowed count
Re-throwing an edit that does not take, swallowing a fixed error
oscillation
Bouncing on a short 2-3 step cycle
Swinging between two fixes, caught between conflicting constraints
low_novelty
Few distinct actions within the window
Exploration narrowed to reworking the same few moves
no_progress
Progress oracle flat for the patience window
Acting in variety but not actually advancing
no_progress is the primary signal when you have an oracle. exact_repeat, oscillation, and low_novelty are structural proxies that work even on tasks where you cannot write one. Using both together visibly reduces missed detections.
When the stagnation budget runs out, halt and leave diagnostics
Detecting it is not enough; just throwing an exception leaves you stuck in unattended operation. Record the evidence in structured form so you can trace why you stopped, after the fact.
class StagnationHalt(Exception): def __init__(self, reason: str, evidence: dict): super().__init__(f"stagnation halt: {reason}") self.reason = reason self.evidence = evidencedef build_evidence(guard: StagnationGuard, reason: str) -> dict: recent = list(guard._hist)[-guard.window:] return { "reason": reason, "best_score": guard._best_score, "steps_since_improved": guard._since_improved, "novelty_ratio": round(guard._novelty_ratio(), 3), "recent_fingerprints": [s.fp for s in recent], "recent_scores": [round(s.score, 3) for s in recent], }
What I keep in mind here is treating a halt not as a "failure" but as an "observed conclusion." With an evidence log, a glance at the morning logs tells you whether it stopped on oscillation or on flat progress. Once you know the cause, you can move straight to tuning the window or progress_patience, or revisiting the constraints on the prompt side.
Wiring it into the agent loop
Now drop these parts into a real tool-execution loop. Observing right after tool execution and judging at the end of the turn is the easiest shape to work with.
def run_agent_loop(client, messages, tools, oracle, guard: StagnationGuard, max_turns: int = 50): for turn in range(max_turns): resp = client.run_turn(messages, tools) # model call (simplified) messages.append(resp.assistant_message) if not resp.tool_calls: # no tool call means completion return {"status": "done", "turns": turn + 1} for call in resp.tool_calls: result = call.execute() messages.append(result.as_tool_message()) guard.observe(Step( fp=action_fingerprint(call.name, call.input), digest=result_digest(result.payload), score=oracle.score(), )) reason = guard.verdict() if reason: evidence = build_evidence(guard, reason) raise StagnationHalt(reason, evidence) raise StagnationHalt("turn_limit", {"turns": max_turns})
The caller can handle a stagnation halt distinctly from an ordinary exception.
try: out = run_agent_loop(client, msgs, tools, oracle, StagnationGuard())except StagnationHalt as halt: log.warning("agent halted: %s", halt.evidence) # Unattended: leave the evidence and treat this job as a clean failure. # Record it as a target for revisiting window / patience / prompt constraints next run. record_failure(reason=halt.reason, evidence=halt.evidence)
After adopting this, the "spinning until morning" problem effectively vanished from my pipeline. Jobs that used to waste all 50 turns now trip the stagnation verdict somewhere around turn 6-10, and the tokens I had been burning for nothing dropped noticeably. Stopping is itself a failure, but the value of failing early is enormous in unattended operation.
Operational lessons the official guides do not cover
Running this for real surfaced a few things the design theory in the docs does not fill in.
Normalizing volatile fields needs more care than you would expect. Beyond timestamp and request_id, tool returns such as "elapsed time," "cursor position," and "temp file name" also change every time while carrying no meaning. Leave them in and novelty always looks high, so the low-novelty verdict never fires. Start with a conservative exclusion list and grow it as you watch the false-positive tendencies.
Tune the window and progress_patience to the "weight of one move" in the task. For tasks where one move shifts the state a lot (a large refactor), keep patience short; for tasks built from small incremental moves (research gathering), make it longer. Too short and you mistake legitimate trial-and-error for stagnation; too long and detection lags. For a new task type I start with patience 6 and window 8, then nudge after watching two or three runs.
Also, an unchanging result_digest on a write tool is the sign to suspect first. Repeated reads are often harmless, but throwing edits while the result digest never moves strongly suggests the agent is emitting "the same patch that does not take." Separate from the verdict, emitting a warning log on this single point alone speeds up root-cause isolation.
Finally, the stagnation guard is a safety net, not a mechanism that reduces stagnation itself. If you halt on oscillation often, the right move is to suspect an upstream design problem: the prompt imposing conflicting constraints at once. The guard tells you where it got stuck; why the design gets stuck is for a human to read.
A first step for your situation
If you are unsure where to start, use this as a guide.
Your situation
What to add first
Task with quantifiable progress (tests, checklists)
Start with a progress oracle + the no_progress verdict
Free-form task that is hard to quantify
Start with action fingerprints + exact_repeat / oscillation
Already running on a turn limit alone
Add the low-novelty verdict first to detect earlier
Heavy use of write tools
Add the result_digest-unchanged warning log first
The longer you run things unattended, the more "being able to detect failure" matters as much as "not failing." Define stagnation as the absence of progress, and back it up with structural proxy signals. That two-layer setup is what protects the budget that would otherwise drain in silence.
I hope this helps anyone wrestling with their own all-night pipelines.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.