The Day a Non-Responding MCP Call Swallowed an Entire Unattended Run — Owning the Stop With Your Own Deadline

When a remote MCP tool call stops responding, an unattended scheduled run just keeps waiting. Instead of leaving the cutoff entirely to the platform, here is how I designed my own deadline and a per-connector circuit breaker to own the stop — with working code.

claude-code¹²² mcp¹³ agent¹¹ timeout⁷ reliability⁸

✦ Premium Article

In a scheduled run's log, a step that normally finishes in tens of seconds shows only "started," and then nothing after it. Re-run it and it passes fine. At first I let it go as a transient hiccup, but once it started landing on the same spot once or twice a week, I couldn't ignore it. The cause was a tool call sent to a remote MCP server that never returned a response — and our side kept waiting for it.

If you're working interactively, you step in the moment something feels stuck. But an unattended scheduled run has no one to step in. It burns the run's time budget still waiting for a response, and finishes without ever reaching the later steps. The worst part was that it wasn't even recorded as a failure.

The June 25 update fixed the behavior where a remote MCP tool call would wait indefinitely with no response; it now interrupts with an error after a set time instead of waiting forever. That's a real step forward. But what this fix gives you is a guarantee that "a failure will eventually come back" — not a decision about what your unattended run does with that failure. Today is my record of redesigning exactly that part, anchored to how I actually run things as an indie developer.

"Hanging" and "a failure returns" are two completely different operational modes

A state that hangs with no response and a state that returns an explicit error differ in operational character far more than the code suggests.

A hang robs you of the chance to actively decide anything. No exception and no return value arrive, so neither your try/except nor any downstream branch fires. The process just waits, and until the run's outer timeout arrives and kills it from the outside, the log holds only "started." Looking back afterward, there's barely a trace of where, or for how many minutes, it was waiting.

When a failure returns, on the other hand, you can decide. Retry, give up on this connector for now, or push the later steps forward anyway. Now that the platform interrupts a non-response, you finally stand at the entrance to that decision.

That said, the grace period before the platform cuts off can be long for an unattended run. When the overall run budget is short, cutting yourself off inside the outer limit leaves more time for later steps. Rather than delegating the cutoff wholesale to the platform, laying your own deadline further inside is the starting point.

Decide up front which layer stops the call

The most accident-prone part of timeout design is having several layers each hold a different limit, leaving it ambiguous who stops first. To keep this straight, narrow the layer that owns the stop down to one.

In my setup, I line them up from the inside out. First, each individual MCP tool call gets a call deadline, kept clearly shorter than the overall run budget. Next comes a step deadline for a step that bundles several calls. Finally there's a run deadline watching the whole scheduled run, and I make that one match the platform's limit.

The crucial part is keeping inner limits strictly shorter than outer ones. If the call deadline is longer than the step deadline, your own cutoff never fires and you end up swallowed by the outermost layer after all. When unsure of the numbers, work backward from what fraction of the outer budget you can spend on a single call. To leave time for the remaining steps, holding one call to at most roughly a third of the total budget felt realistic in my experience.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Separate what actually breaks when a remote MCP call hangs in an unattended run, so you can decide which layer should own the cutoff

✦Implement an asyncio wrapper that puts your own deadline inside the platform default and distinguishes a hang, a real error, and a normal response before retrying

✦Automate skip decisions with a per-connector circuit breaker so one broken connector can't drag the rest of the run's time down with it

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

An asyncio wrapper that wraps the call in your own deadline

The implementation starts by wrapping the MCP tool call in your own deadline and sorting the result into three buckets: it hung and you cut it off, the server returned an explicit error, or a response came back normally. This distinction is what supports the retry decision later.

import asyncio
import enum
from dataclasses import dataclass
from typing import Any, Awaitable, Callable
 
 
class Outcome(enum.Enum):
    OK = "ok"
    DEADLINE = "deadline"   # cut off by our own deadline (hang-equivalent)
    ERROR = "error"         # server returned an explicit failure
 
 
@dataclass
class CallResult:
    outcome: Outcome
    value: Any = None
    detail: str = ""
    elapsed: float = 0.0
 
 
async def call_with_deadline(
    fn: Callable[[], Awaitable[Any]],
    deadline_s: float,
) -> CallResult:
    """Wrap an MCP tool call in our own deadline, returning a 3-way result.
 
    Keep deadline_s inside the platform's default cutoff.
    """
    loop = asyncio.get_event_loop()
    start = loop.time()
    try:
        value = await asyncio.wait_for(fn(), timeout=deadline_s)
        return CallResult(Outcome.OK, value=value, elapsed=loop.time() - start)
    except asyncio.TimeoutError:
        # reached our own deadline with no response
        return CallResult(
            Outcome.DEADLINE,
            detail=f"no response within {deadline_s}s",
            elapsed=loop.time() - start,
        )
    except Exception as e:  # noqa: BLE001 — caller decides on the failure kind
        return CallResult(
            Outcome.ERROR,
            detail=f"{type(e).__name__}: {e}",
            elapsed=loop.time() - start,
        )

Here asyncio.wait_for sends CancelledError to the coroutine to cut it off, but there's no guarantee the other side's processing stops immediately. That's exactly why it matters not to fire the next call at the same connector right after cutting off. Fire again and it tends to collide with the prior call that hasn't fully stopped, leaving the next one non-responsive as well.

A hang and a normal error have different premises for retrying

I separated DEADLINE from ERROR because the meaning of a retry differs.

An ERROR the server returned usually leaves you information: an expired credential, an invalid argument, a transient upstream fault. For some kinds a retry is meaningful; for others, throwing it again gets the same result every time.

A DEADLINE, by contrast, is a cutoff made without knowing the other side's current state. A retry here can also work in the direction of crowding the other side further. So I keep DEADLINE retries to wider intervals and a more modest count. And I mix jitter into each retry so that several scheduled runs don't re-enter at the same instant and build a wave.

import random
 
 
async def call_with_retry(
    fn: Callable[[], Awaitable[Any]],
    deadline_s: float,
    max_attempts: int = 3,
    base_backoff_s: float = 2.0,
) -> CallResult:
    """Retry according to the 3-way result. Use only for idempotent calls."""
    last = CallResult(Outcome.ERROR, detail="not attempted")
    for attempt in range(1, max_attempts + 1):
        last = await call_with_deadline(fn, deadline_s)
        if last.outcome is Outcome.OK:
            return last
        if attempt == max_attempts:
            break
        # DEADLINE: other side's state is unknown, so widen the interval
        factor = 2.0 if last.outcome is Outcome.DEADLINE else 1.0
        backoff = base_backoff_s * (2 ** (attempt - 1)) * factor
        backoff += random.uniform(0, backoff)  # full jitter
        await asyncio.sleep(backoff)
    return last

It's worth stressing that retries are limited to idempotent calls. A call you cut off on DEADLINE may well have completed on the other side. Retrying a write tool unconditionally risks applying it twice. For writes, attach an idempotency key or leave them out of the retry set entirely. That line is continuous with the "what to allow" design I covered in MCP policy enforcement and allowlist design for unattended agents.

Don't let a broken connector drag the rest of the run with it

Even once you can bound a single call, if the same connector returns DEADLINE repeatedly, every added retry eats more of the run's budget. So insert a per-connector circuit breaker. After a set number of consecutive failures, hold that connector open for the rest of this run and skip it immediately before calling.

import time
 
 
class ConnectorBreaker:
    """A minimal per-connector circuit breaker.
 
    Open when consecutive failures hit the threshold; skip until cooldown passes.
    """
 
    def __init__(self, fail_threshold: int = 2, cooldown_s: float = 120.0):
        self.fail_threshold = fail_threshold
        self.cooldown_s = cooldown_s
        self._fails = 0
        self._opened_at = 0.0
 
    def is_open(self) -> bool:
        if self._fails < self.fail_threshold:
            return False
        if time.monotonic() - self._opened_at >= self.cooldown_s:
            self._fails = 0  # half-open: let one attempt through
            return False
        return True
 
    def record(self, result: CallResult) -> None:
        if result.outcome is Outcome.OK:
            self._fails = 0
            return
        self._fails += 1
        if self._fails == self.fail_threshold:
            self._opened_at = time.monotonic()
 
 
async def guarded_call(
    breaker: ConnectorBreaker,
    fn: Callable[[], Awaitable[Any]],
    deadline_s: float,
) -> CallResult:
    if breaker.is_open():
        return CallResult(Outcome.ERROR, detail="breaker open: skipped")
    result = await call_with_retry(fn, deadline_s)
    breaker.record(result)
    return result

Whether an open breaker makes the step a whole-run failure or just lets later steps proceed depends on the step's nature. In my runs, steps like fetching reference data — where the rest still holds even if it's missing — get skipped and continue, while steps tied to finalizing output give up for that cycle. Either way, I log the fact that the breaker opened. Skip it silently and you can't reach the cause next time the same thing happens.

Leave the fact that you waited in the log

What made hangs nasty was the thin trace. So when I switched to cutting off on my own, I made sure to also write out "how many seconds it waited and why it cut off."

For each call, I flatten the result bucket, elapsed seconds, attempt count, and breaker state into one line. Whether DEADLINE skews toward a particular connector, or toward a time of day, only becomes visible once these lines accumulate. Back when things were killed while still frozen, I had no material even to ask the question.

How far to raise the log's granularity I treat as separate from the interactive triage procedure. The steps for a person chasing a connector that suddenly stopped working I'll leave to the diagnosis guide for when an MCP server suddenly stops in Claude Code; here I narrow to the minimum an unattended run needs to explain itself afterward.

Where it tends to trip you up

Set your own deadline longer than the outer one and the cutoff never fires, leaving you swallowed by the outermost layer after all. Keeping the inner one shorter is the prerequisite.

Re-firing the same connector right after a DEADLINE cutoff is also a source of accidents. If the prior call is still alive on the other side, it collides and the next one tends to go non-responsive too. Insert a backoff before retrying and break the wave with jitter.

Unconditionally retrying write calls is dangerous. Because DEADLINE includes the possibility of completion on the other side, attach an idempotency key or leave writes out of the retry set.

Set the circuit breaker's cooldown longer than the run budget and, once open, it never tries again this cycle. That's fine if intended — but it's often unintended, so decide it against the run's length.

The next step

First, check one value: the cutoff you rely on at the outermost layer — the platform limit or the run timeout. Then add just one deadline, clearly shorter than that, to your innermost MCP call. Once you've slotted call_with_deadline into one place and the result lands in the log as three buckets, the step that used to vanish while frozen finally starts to tell you "how many seconds it waited before being cut off." From there, adding retries and the breaker becomes a grounded starting point.

How you give up on a counterpart that won't respond matters more the more unattended the machinery is. I'm still mid-tuning at Dolice, but just deciding not to leave the stop to the platform made my scheduled-run logs noticeably quieter. I hope it helps anyone building unattended operations the same way.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.