⬡ API & SDK/2026-06-28Advanced

Did That Post Actually Go Through? Safely Retrying an Interrupted MCP Write Without Double-Executing

When an MCP write tool call is interrupted by a dropped connection, you can't tell whether the server ran it. Here's why naive retries cause double-execution, and a working wrapper that uses idempotency keys and a reconcile read to retry safely — with examples from an unattended pipeline.

Claude API⁹¹ MCP³⁷ idempotency⁶ automation⁷⁸ reliability¹¹

✦ Premium Article

One of my unattended publishing jobs once got its connection cut mid-request while posting to X. All I got back was a timeout error, with no way to know whether the post had landed. The log recorded a "failure" — yet a few minutes later the same post was sitting on the timeline. The server had succeeded; only the result never reached me.

If I had naively decided "it failed, so retry," two identical posts would have ended up side by side. As an indie developer automating announcements across several sites, that is an incident that has nothing to do with content quality. To a reader, posting the same thing twice just signals sloppy operations. If you are going to run write-type tool calls unattended, you have to design for this "did it go through?" state head-on.

"Failed" and "uncertain" are not the same thing

There are two kinds of errors. One is a clear rejection — the server says "I will not accept your request." The other is unknown — the connection dropped before any response came back. The first is safe to retry, because you know the server did nothing.

The troublesome one is the second. The request may have reached the server and been processed, or it may have been cut off before arrival. From your side, you cannot tell. This is the well-known problem in distributed systems: the sender can never be certain the receiver executed. Timeouts, connection resets, and mid-stream disconnects all belong in this "uncertain" bucket.

The June 27, 2026 update improved MCP resilience in Claude Code so that partial responses are preserved even when a stream is cut mid-flight. The receiving side is genuinely more robust now. Even so, the uncertainty that remains the moment a write tool call is interrupted — "did the server execute it?" — is not something a more resilient receiver alone can remove. That part lives in your application.

A common implementation trap is error classification. An HTTP 5xx can mean "the server failed to process" or "it processed but only the response was lost," so pushing it straight to failed is dangerous. I treat every ambiguous error as uncertain. Since uncertain operations are settled by reconciliation during recovery, over-classifying as uncertain never causes double-execution — whereas mislabeling a truly uncertain call as failed leads to it immediately. When in doubt, fall to the safe side: uncertain.

Why naive retries cause double-execution

When people write retry logic, they usually think in two states: success or failure, and retry on failure. That design is the breeding ground for double-execution.

If you collapse "uncertain" into "failed," you re-run even the cases that actually succeeded on the server. A post becomes a double post, a charge becomes a double charge, an email becomes a second copy. That is exactly the trap I fell into first: I wrapped my retry logic in a sloppy except Exception: and unconditionally resent inside it. It never reproduced in testing, and the first double post showed up on a night when the production connection got flaky.

The correct approach splits state into three: committed, failed, and uncertain. Only failed is safe to retry directly. For uncertain, you always insert one extra step — "check before you redo."

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A three-state ledger that treats a dropped connection as 'uncertain' rather than 'failed', and why two states break down

✦A Python wrapper that protects MCP tools without idempotency support, using a correlation token and a reconcile read

✦A table for deciding when retrying is safe and when it isn't, weighed by the cost of a duplicate vs the cost of a miss

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Two pillars: idempotency keys and reconcile reads

There are two ways to resolve uncertain safely.

Idempotency key: attach a unique key to each request and have the server guarantee "the same key executes only once." Stripe's Idempotency-Key header is the canonical example. If the server supports it, resending with the same key has an effect exactly once.
Reconcile read: before retrying, ask the server whether the effect already exists. For a post, search for "a post containing my correlation token" — skip if found, resend if not. This does not depend on server-side idempotency.

Idempotency keys are ideal, but most real MCP tools still don't accept one. So you build a two-tier defense: attach a key where the tool supports it, and protect the rest with reconcile reads.

Attach an idempotency key where the tool supports it

First, generate a stable operation_id per logical operation and record it to a local ledger before the call. This ordering is critical: record before, not after. If you record after the call, a crash mid-call leaves an operation that was "sent but never recorded," and you lose the ability to track it during recovery.

import json, os, time, uuid
 
LEDGER = os.path.expanduser("~/.cache/mcp_ledger.jsonl")
 
def _append(rec: dict) -> None:
    # Persist with fsync — crash recovery depends on durability
    os.makedirs(os.path.dirname(LEDGER), exist_ok=True)
    with open(LEDGER, "a", encoding="utf-8") as f:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")
        f.flush()
        os.fsync(f.fileno())
 
def begin_operation(action: str, args: dict) -> str:
    op_id = str(uuid.uuid4())
    # Write the "pending" record before the call
    _append({"op_id": op_id, "action": action, "args": args,
             "status": "pending", "ts": time.time()})
    return op_id
 
def mark(op_id: str, status: str) -> None:
    _append({"op_id": op_id, "status": status, "ts": time.time()})

On the call side, pass operation_id as the idempotency key in the tool arguments. If the MCP server interprets the key, this alone makes resends safe.

TRANSIENT = (TimeoutError, ConnectionError, ConnectionResetError)
 
def call_with_idempotency(client, action: str, args: dict):
    op_id = begin_operation(action, args)
    payload = {**args, "idempotency_key": op_id}
    try:
        result = client.call_tool(action, payload)   # MCP tool call
        mark(op_id, "committed")
        return result
    except TRANSIENT:
        # Connection dropped — execution unknown. Do NOT resend here
        mark(op_id, "uncertain")
        raise
    except Exception:
        # Server rejected clearly — safe to treat as failed
        mark(op_id, "failed")
        raise

The point is that the spot where you catch uncertain must not resend. The decision to retry is funneled entirely into the recovery phase described below. If you catch and immediately retry in the same place, you end up "resending while still uncertain," which defeats the purpose of splitting into three states.

Protect tools without idempotency keys via a reconcile read

When an MCP tool ignores idempotency keys, you have no choice but to verify the effect yourself. To do that, embed a correlation token in the very content you write — mix an identifier invisibly into the post body, put op_id in an external-id field, or attach it as metadata. As long as you can later search by that identifier, a reconcile read becomes possible.

def reconcile(client, op_id: str, action: str, args: dict) -> bool:
    """Check whether this operation's effect exists on the server.
    True  -> it exists (no resend), False -> resend is safe."""
    if action == "post_status":
        # Search your own posts for the correlation token
        found = client.call_tool("search_own_posts", {"contains": op_id})
        return len(found.get("items", [])) > 0
    if action == "create_issue":
        found = client.call_tool("search_issues", {"external_id": op_id})
        return len(found.get("items", [])) > 0
    # Actions you can't reconcile must surface "can't verify" to the caller
    raise LookupError(f"no reconcile strategy for action={action}")
 
def recover_uncertain(client) -> None:
    """Inspect operations that ended 'uncertain' last run and settle them."""
    for op_id, action, args in load_uncertain(LEDGER):
        # Mixing the correlation token into the write content is a prerequisite
        args = {**args, "correlation_token": op_id}
        try:
            if reconcile(client, op_id, action, args):
                mark(op_id, "committed")           # it had gone through
            else:
                client.call_tool(action, args)     # it hadn't -> resend
                mark(op_id, "committed")
        except LookupError:
            # Don't auto-resend unreconcilable actions; route to human review
            mark(op_id, "needs_review")

Run this recovery step right at the start of every unattended pipeline run. If an operation ended uncertain the previous night, the next startup always inspects it and settles it without double-execution. In my Dolice Labs publishing flow, after moving the post-push social announcements onto this scheme, the duplicate posts stopped recurring.

What to do with actions you can't reconcile

Not every action is reconcilable. For something like sending email, where you can't search for what you sent, there is no way to verify the effect without an idempotency key. Here you decide by which is worse: the harm of sending twice, or the harm of never sending at all.

Strategy	Requires	Good for	Risk
Idempotency key	Server dedupes by key	Charges, inventory — heavy side effects	Depends on server implementation
Reconcile read	Effect is searchable later	Posts, issue creation, record inserts	Useless if you forget the correlation token
Resend, tolerate dups	Duplicates are cheap or removable	Idempotent aggregates, overwrites	Not OK if the recipient sees the dup
Don't resend	A miss is acceptable	Unverifiable email sends	Occasionally nothing gets sent

Personally, I make reconcile reads the first choice for actions whose side effects are visible to readers (posts, notifications), and I default to not auto-resending anything I can't reconcile. The discomfort of a duplicate arriving does more operational damage than the rare miss, in my judgment. Conversely, for idempotent updates that overwrite on the server side, I resend without hesitation.

Where to start

Pick one write-type MCP tool you have and check whether its effect is searchable afterward. If it is, embed op_id as a correlation token in the written content and add a single recover_uncertain call at startup — that alone prevents most double-executions from dropped connections. If the tool turns out not to be searchable, the starting point is simply to articulate whether it is "an action that is safe to resend."

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.