⬡ API & SDK/2026-06-28Advanced

Every Tool Call Succeeds, Yet Nothing Moves Forward: Detecting Stagnation in Unattended Agents

No errors, yet the agent keeps replaying the same move while your budget quietly drains. Here is how to detect a success-but-no-progress loop using a progress oracle and action fingerprints, with a working Python implementation that halts safely.

Claude Agent SDK¹¹ Agent design Unattended operation Reliability³ Cost optimization²

✦ Premium Article

The article-generation pipeline I run overnight was still going in the morning.

I opened the logs. Not a single tool call had failed. read_file returned, edit_file returned, run_tests returned, then back to read_file. Everything was a clean 200, not one error line. And yet the output was barely different from six hours earlier.

We tend to guard against loops that fail: retry limits, circuit breakers, give-up budgets. But a loop that keeps succeeding while making no progress gives you nothing to trip on. No exception, no non-zero exit code.

As an indie developer running several sites unattended, this is the worst kind of failure I deal with. A failure you can notice. A loop that quietly stops advancing while draining the budget costs far more.

This article lays out how to detect that "succeeds but never stops" loop structurally, and how to halt it safely while leaving diagnostics behind.

Error budgets cannot catch stagnation

First, why the usual guardrails slip past this.

Almost every safety device in a typical agent loop counts failures.

Guardrail	What it counts	Fires on a succeeding-but-stuck loop?
Retry limit	Consecutive exceptions	No (no exceptions thrown)
Circuit breaker	Error rate	No (error rate is 0)
Give-up budget	Self-repair attempts	No (no error to repair)
Turn limit	Total turns	Late (only after the whole budget is spent)

Only the turn limit eventually kicks in, but that is not detection of stagnation; it is exhaustion of the entire budget. With a 50-turn cap, you waste 49 turns before stopping. Far too late to call it cost control.

Stagnation has to be framed not as the absence of errors but as the absence of progress. And progress cannot be measured unless you define it explicitly. That is where the design starts.

"Stagnation" does not exist without a definition of progress

Stagnation means "no progress for a stretch of steps." Put the other way around: if you have not defined progress, you cannot define stagnation. The reason so many agent implementations cannot detect it is, I think, that they never had a progress oracle in the first place.

A progress oracle is a function that returns a roughly monotone number tied to the task goal. As long as the value keeps improving, the agent is moving forward.

from typing import Protocol
 
class ProgressOracle(Protocol):
    def score(self) -> float:
        """Higher means closer to the goal. Observes current state, no side effects."""
        ...
 
# Example: an oracle for a code-fixing task
class TestPassProgress:
    def __init__(self, run_tests):
        self._run_tests = run_tests  # () -> (passed:int, total:int)
 
    def score(self) -> float:
        passed, total = self._run_tests()
        if total == 0:
            return 0.0
        # Pass ratio as the main component, with a strong bonus for all-green
        return passed / total + (1.0 if passed == total else 0.0)

The natural metric for progress changes by task.

Task type	Candidate progress oracle
Code fixing	Tests passing / drop in remaining linter violations
Research gathering	Sub-questions answered / new primary sources obtained
File organizing	Unsorted files remaining (a decrease is progress)
Article writing	Quality-gate items satisfied

Some tasks resist a progress oracle. For those, use behavioral novelty, described below, as a proxy for progress. A proxy is a weaker signal than the real thing, but it catches stagnation far sooner than nothing.

One thing I learned in practice: the progress oracle must be cheap and side-effect-free to call. If you run the full test suite every turn, the cost of detecting stagnation outgrows the work itself. Use a cached test result, or just peek at a counter for research. Keep the observation cost flat.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A concrete implementation that catches loops where every tool call succeeds but nothing advances, using a progress oracle plus action fingerprints

✦Logic that separates three distinct stagnation patterns: exact repeats, oscillation (A-B-A-B), and a drop in novelty

✦How to wire a stagnation budget into your agent loop so it halts safely and leaves structured diagnostics behind

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Action fingerprints capture "the same move"

On top of detecting stalled progress, looking at what the agent keeps repeating lets you classify the stagnation. To do that, reduce each step to a normalized hash: the action fingerprint.

The key is to strip volatile elements before comparing. If you leave timestamps or request IDs in, every hash differs and you can never spot a repeat.

import hashlib, json
from dataclasses import dataclass, field
from collections import deque
 
VOLATILE_KEYS = {"timestamp", "request_id", "trace_id", "elapsed_ms", "nonce"}
 
def _strip_volatile(obj):
    if isinstance(obj, dict):
        return {k: _strip_volatile(v) for k, v in sorted(obj.items())
                if k not in VOLATILE_KEYS}
    if isinstance(obj, list):
        return [_strip_volatile(v) for v in obj]
    return obj
 
def action_fingerprint(tool_name: str, tool_input: dict) -> str:
    canonical = json.dumps(
        {"tool": tool_name, "input": _strip_volatile(tool_input)},
        sort_keys=True, ensure_ascii=False,
    )
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]
 
def result_digest(tool_result: dict) -> str:
    # Keep only the "meaningful change" in the result; hash a summary if the body is long
    salient = _strip_volatile(tool_result)
    blob = json.dumps(salient, sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:16]

If steps with the same action_fingerprint line up over and over, the agent is literally repeating the same operation. If the result_digest matches too, that operation is changing nothing in the world. Harmless for reads, but for writes an unchanging result digest is a strong sign of "edits that do not take effect" being thrown repeatedly.

Separating exact repeats, oscillation, and dropping novelty

In practice, stagnation wears several faces. Each calls for a different response, so splitting the verdict up makes diagnosis easier.

@dataclass
class Step:
    fp: str          # action_fingerprint
    digest: str      # result_digest
    score: float     # progress oracle value at this point
 
@dataclass
class StagnationGuard:
    window: int = 8                 # how many recent steps to inspect
    max_exact_repeats: int = 3      # allowed count of identical (action + result)
    novelty_floor: float = 0.34     # lower bound on the ratio of new fingerprints
    progress_patience: int = 6      # steps allowed without progress improving
    _hist: deque = field(default_factory=lambda: deque(maxlen=64))
    _best_score: float = float("-inf")
    _since_improved: int = 0
 
    def observe(self, step: Step) -> None:
        self._hist.append(step)
        if step.score > self._best_score + 1e-9:
            self._best_score = step.score
            self._since_improved = 0
        else:
            self._since_improved += 1
 
    def _exact_repeats(self) -> int:
        if not self._hist:
            return 0
        last = self._hist[-1]
        key = (last.fp, last.digest)
        return sum(1 for s in self._hist if (s.fp, s.digest) == key)
 
    def _novelty_ratio(self) -> float:
        recent = list(self._hist)[-self.window:]
        if len(recent) < self.window:
            return 1.0  # do not judge while samples are insufficient
        uniq = len({s.fp for s in recent})
        return uniq / len(recent)
 
    def _oscillating(self) -> bool:
        # Detect short-period cycles such as A-B-A-B
        recent = [s.fp for s in list(self._hist)[-self.window:]]
        for period in (2, 3):
            if len(recent) >= period * 2:
                tail = recent[-period * 2:]
                if tail[:period] == tail[period:]:
                    return True
        return False
 
    def verdict(self) -> str | None:
        if self._exact_repeats() >= self.max_exact_repeats:
            return "exact_repeat"
        if self._oscillating():
            return "oscillation"
        if self._novelty_ratio() < self.novelty_floor:
            return "low_novelty"
        if self._since_improved >= self.progress_patience:
            return "no_progress"
        return None

Here is what the four verdicts mean.

Verdict	Situation	Typical cause
exact_repeat	Identical action and result exceeded the allowed count	Re-throwing an edit that does not take, swallowing a fixed error
oscillation	Bouncing on a short 2-3 step cycle	Swinging between two fixes, caught between conflicting constraints
low_novelty	Few distinct actions within the window	Exploration narrowed to reworking the same few moves
no_progress	Progress oracle flat for the patience window	Acting in variety but not actually advancing

no_progress is the primary signal when you have an oracle. exact_repeat, oscillation, and low_novelty are structural proxies that work even on tasks where you cannot write one. Using both together visibly reduces missed detections.

When the stagnation budget runs out, halt and leave diagnostics

Detecting it is not enough; just throwing an exception leaves you stuck in unattended operation. Record the evidence in structured form so you can trace why you stopped, after the fact.

class StagnationHalt(Exception):
    def __init__(self, reason: str, evidence: dict):
        super().__init__(f"stagnation halt: {reason}")
        self.reason = reason
        self.evidence = evidence
 
def build_evidence(guard: StagnationGuard, reason: str) -> dict:
    recent = list(guard._hist)[-guard.window:]
    return {
        "reason": reason,
        "best_score": guard._best_score,
        "steps_since_improved": guard._since_improved,
        "novelty_ratio": round(guard._novelty_ratio(), 3),
        "recent_fingerprints": [s.fp for s in recent],
        "recent_scores": [round(s.score, 3) for s in recent],
    }

What I keep in mind here is treating a halt not as a "failure" but as an "observed conclusion." With an evidence log, a glance at the morning logs tells you whether it stopped on oscillation or on flat progress. Once you know the cause, you can move straight to tuning the window or progress_patience, or revisiting the constraints on the prompt side.

Wiring it into the agent loop

Now drop these parts into a real tool-execution loop. Observing right after tool execution and judging at the end of the turn is the easiest shape to work with.

def run_agent_loop(client, messages, tools, oracle,
                   guard: StagnationGuard, max_turns: int = 50):
    for turn in range(max_turns):
        resp = client.run_turn(messages, tools)   # model call (simplified)
        messages.append(resp.assistant_message)
 
        if not resp.tool_calls:                    # no tool call means completion
            return {"status": "done", "turns": turn + 1}
 
        for call in resp.tool_calls:
            result = call.execute()
            messages.append(result.as_tool_message())
            guard.observe(Step(
                fp=action_fingerprint(call.name, call.input),
                digest=result_digest(result.payload),
                score=oracle.score(),
            ))
 
        reason = guard.verdict()
        if reason:
            evidence = build_evidence(guard, reason)
            raise StagnationHalt(reason, evidence)
 
    raise StagnationHalt("turn_limit", {"turns": max_turns})

The caller can handle a stagnation halt distinctly from an ordinary exception.

try:
    out = run_agent_loop(client, msgs, tools, oracle, StagnationGuard())
except StagnationHalt as halt:
    log.warning("agent halted: %s", halt.evidence)
    # Unattended: leave the evidence and treat this job as a clean failure.
    # Record it as a target for revisiting window / patience / prompt constraints next run.
    record_failure(reason=halt.reason, evidence=halt.evidence)

After adopting this, the "spinning until morning" problem effectively vanished from my pipeline. Jobs that used to waste all 50 turns now trip the stagnation verdict somewhere around turn 6-10, and the tokens I had been burning for nothing dropped noticeably. Stopping is itself a failure, but the value of failing early is enormous in unattended operation.

Operational lessons the official guides do not cover

Running this for real surfaced a few things the design theory in the docs does not fill in.

Normalizing volatile fields needs more care than you would expect. Beyond timestamp and request_id, tool returns such as "elapsed time," "cursor position," and "temp file name" also change every time while carrying no meaning. Leave them in and novelty always looks high, so the low-novelty verdict never fires. Start with a conservative exclusion list and grow it as you watch the false-positive tendencies.

Tune the window and progress_patience to the "weight of one move" in the task. For tasks where one move shifts the state a lot (a large refactor), keep patience short; for tasks built from small incremental moves (research gathering), make it longer. Too short and you mistake legitimate trial-and-error for stagnation; too long and detection lags. For a new task type I start with patience 6 and window 8, then nudge after watching two or three runs.

Also, an unchanging result_digest on a write tool is the sign to suspect first. Repeated reads are often harmless, but throwing edits while the result digest never moves strongly suggests the agent is emitting "the same patch that does not take." Separate from the verdict, emitting a warning log on this single point alone speeds up root-cause isolation.

Finally, the stagnation guard is a safety net, not a mechanism that reduces stagnation itself. If you halt on oscillation often, the right move is to suspect an upstream design problem: the prompt imposing conflicting constraints at once. The guard tells you where it got stuck; why the design gets stuck is for a human to read.

A first step for your situation

If you are unsure where to start, use this as a guide.

Your situation	What to add first
Task with quantifiable progress (tests, checklists)	Start with a progress oracle + the no_progress verdict
Free-form task that is hard to quantify	Start with action fingerprints + exact_repeat / oscillation
Already running on a turn limit alone	Add the low-novelty verdict first to detect earlier
Heavy use of write tools	Add the result_digest-unchanged warning log first

The longer you run things unattended, the more "being able to detect failure" matters as much as "not failing." Define stagnation as the absence of progress, and back it up with structural proxy signals. That two-layer setup is what protects the budget that would otherwise drain in silence.

I hope this helps anyone wrestling with their own all-night pipelines.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.