⬡ API & SDK/2026-06-27Advanced

When Claude API Streaming Stops Without an Error: Detecting Silent Stalls and Resuming Mid-Stream

How to catch the 'silent stall' where Claude API streaming stops with no exception at all, using a content-level watchdog that times the gap between tokens, plus a resume path that carries received text forward as an assistant prefill, and a four-layer timeout budget for long-running automation.

streaming¹⁹ api³⁸ python²² production¹⁰⁵ reliability¹⁰ sse²

✦ Premium Article

The hard streaming failures are not the disconnects that raise an exception. They are the stops that say nothing at all: deltas simply stop arriving. Your try/except catches nothing. There is no stack trace in the logs. The process is alive, the socket is open — and yet the response ends partway through. I missed this "silent stall" for a while.

While running unattended article generation across several of my sites as an indie developer, a handful of outputs were always saved with the last few paragraphs missing. No alert ever fired. What made the cause hard to find was that the code looked completely normal: I was iterating stream.text_stream to the end with a for loop, but the loop was exiting early. This piece is the setup I settled on for detecting that silent stall and continuing from where it stopped. I'll take the usual advice about longer timeouts and retries as given, and focus on what was needed beyond it.

Why ReadTimeout doesn't fire on a silent stall

Most write-ups stop at "increase the read timeout." But on a silent stall, that timeout often never fires — because the SDK's read timeout measures the interval between bytes arriving on the socket, not the interval between meaningful content.

A Server-Sent Events path frequently has a reverse proxy or load balancer emitting comment lines (harmless lines like : ping) at fixed intervals, and Anthropic's own ping events may keep arriving too. The result: not a single content delta has come, yet something keeps landing on the socket. The socket looks alive, so the read timeout keeps resetting and never fires. The connection is healthy; the content is dead. That is what a silent stall actually is.

So the granularity you watch is wrong by default. What you need to guard is not "are bytes arriving" but "is meaningful content (a text_delta) making progress." That is what the watchdog below measures.

An inter-token watchdog

The idea is simple: record the timestamp of the last text_delta, and if the gap exceeds a threshold, abort it yourself. Think of it as a content-layer guard added alongside the SDK's socket-layer timeout, not a replacement for it.

import time
import threading
import anthropic
 
class StreamStalled(Exception):
    """No meaningful delta arrived within the window (silent stall)."""
 
def stream_with_watchdog(
    client: anthropic.Anthropic,
    messages: list,
    model: str = "claude-sonnet-4-6",
    max_tokens: int = 8192,
    stall_seconds: float = 25.0,   # tolerated gap between text deltas
):
    """
    Watch the arrival gap between text deltas; raise StreamStalled past stall_seconds.
    Buffer received text so the caller can carry it forward on a stall.
    """
    buf: list[str] = []
    last_delta = {"t": time.monotonic()}
    stop = threading.Event()
    closer = {"fn": lambda: None}
 
    def watchdog():
        while not stop.wait(1.0):
            if time.monotonic() - last_delta["t"] > stall_seconds:
                stop.set()
                closer["fn"]()   # close the underlying connection to break the for-loop
                return
 
    wd = threading.Thread(target=watchdog, daemon=True)
 
    with client.messages.stream(
        model=model, max_tokens=max_tokens, messages=messages,
    ) as stream:
        closer["fn"] = stream.close   # let the watchdog close the connection
        wd.start()
        try:
            for text in stream.text_stream:
                last_delta["t"] = time.monotonic()
                buf.append(text)
                yield text
        finally:
            stop.set()
 
    received = "".join(buf)
    if stop.is_set() and time.monotonic() - last_delta["t"] > stall_seconds:
        raise StreamStalled(received)

Two implementation points matter. First, the watchdog must be able to call stream.close(). The for loop is blocking, so unless you close the connection from the outside, it will sit there waiting on a silence forever. Second, always keep received up to the moment of the stop — without it, "continue from where it stopped" is impossible.

Tune stall_seconds to your path. In my environment the first token can take around 20 seconds (especially with models that think longer), so I keep a separate, longer grace period for the first token (the first_token_seconds below) and cap the gap at 25 seconds only once text has started flowing. If the body has been moving and then goes silent for 25 seconds, I treat that as a path-side stop and abort, because recovering is faster than waiting.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A watchdog that measures the gap between text deltas — not socket reads — so it catches the silent stalls that ReadTimeout never fires on

✦A resume path that feeds received text back as an assistant prefill so generation continues from where it stopped instead of restarting, plus how to trim the overlap

✦A four-layer timeout budget (connect / first-token / inter-token / total) and how to set each threshold from measured p95–p99 rather than a guess

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Carry the received text forward and continue "mid-stream"

Once you can detect a silent stall, the next problem is resuming. The naive fix — resend the same request — makes the model regenerate the entire response from scratch. For a long article or report, that wastes both tokens and time.

Instead, use a prefill: pass the text you already received as the final assistant message. Claude continues from the prior assistant turn, so it picks up roughly where it stopped.

def resume_stream(client, base_messages, received_text, **kw):
    """Pass received_text as an assistant prefill and generate the continuation."""
    messages = base_messages + [{"role": "assistant", "content": received_text}]
    for text in stream_with_watchdog(client, messages, **kw):
        yield text
 
 
def robust_generate(client, base_messages, max_resumes=3, **kw):
    full = ""
    for attempt in range(max_resumes + 1):
        try:
            gen = (resume_stream(client, base_messages, full, **kw)
                   if attempt else stream_with_watchdog(client, base_messages, **kw))
            for text in gen:
                full = stitch(full, text) if attempt else full + text
            return full   # completed normally
        except StreamStalled as e:
            full = stitch(full, e.args[0]) if e.args and isinstance(e.args[0], str) else full
            continue       # next loop resumes via prefill
    return full

One caveat: a prefill resume does not guarantee a seamless join. If the stop landed mid-word or mid-symbol, the start of the continuation may slightly overlap the previous text or fail to connect. In practice, add a step that trims the overlap at the seam.

def stitch(prev: str, cont: str, max_overlap: int = 80) -> str:
    """Detect and drop a duplicated boundary between prev's tail and cont's head."""
    window = prev[-max_overlap:]
    for size in range(min(len(window), len(cont)), 0, -1):
        if window.endswith(cont[:size]):
            return prev + cont[size:]
    return prev + cont

It is not a perfect splice, but it is far better than "the last few paragraphs are silently missing." For me, eliminating the loss itself was worth far more than the slight roughness at the seam. I still run the resumed text through a single read-through check (the verification step below) before saving.

Design timeouts as a budget, split into four layers

Once you face a silent stall, a timeout stops being one number and becomes several layers with different jobs. I manage four:

Layer	What it watches	Rough target	What an overshoot means
Connect	TCP/TLS establishment	10s	Reachability / DNS / proxy issue
First token	Time to the first text_delta	30–45s	Long model thinking / queueing
Inter-token	Gap between deltas	20–25s	Silent stall (likely path-side)
Total	Whole-request wall time	Use-case (e.g. 8 min)	Runaway / unexpected length

The connect and total layers are easy to guard with the SDK or asyncio.wait_for. The first-token and inter-token layers can share one mechanism: branch on whether last_delta["t"] has been updated yet.

def deadline_for(received_any: bool, first_token_s: float, inter_token_s: float) -> float:
    """First-token budget until any delta has arrived; inter-token budget after."""
    return first_token_s if not received_any else inter_token_s

The important part is to set thresholds from measured distributions, not by feel. Emit the elapsed time for each layer as a metric, watch p50 / p95 / p99 for about a week, then draw the lines. In my automation the first-token p95 was about 18 seconds, so I set the first-token budget to 45 (more than double p95); the inter-token p99 was around 6 seconds, so I capped it at 25. Drawing the line clear of the healthy p99 is how you avoid resuming needlessly on a false positive.

import logging, time
logger = logging.getLogger("claude.stream")
 
def timed_stream(client, messages, **kw):
    t0 = time.monotonic(); first_t = None; gaps = []; last = t0
    for text in stream_with_watchdog(client, messages, **kw):
        now = time.monotonic()
        if first_t is None: first_t = now - t0
        else: gaps.append(now - last)
        last = now
        yield text
    p99 = sorted(gaps)[int(len(gaps) * 0.99)] if gaps else 0.0
    logger.info("first_token=%.2fs total=%.2fs max_gap=%.2fs p99_gap=%.2fs",
                first_t or -1, time.monotonic() - t0, max(gaps) if gaps else 0.0, p99)

Once that one log line accumulates, you can review both the validity of your thresholds and any degradation of the path in numbers. "It stalls sometimes" becomes "first-token p95 is 6 seconds higher than last week," and the precision of your response changes entirely.

Reduce silence at the path (proxy buffering)

The app-side watchdog notices a stall and continues; settings that avoid generating silence in the first place help too. The most common culprit is proxy buffering. If Nginx sits in the path, leaving response buffering on batches the deltas, so the inter-token gaps look artificially spiky and tempt the watchdog into false positives.

location /api/chat {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
 
    proxy_buffering off;          # without this, deltas arrive batched and delayed
    proxy_cache off;
    chunked_transfer_encoding on;
 
    proxy_connect_timeout 10s;
    proxy_read_timeout 600s;      # socket-layer backstop; silent stalls are guarded in the app
    proxy_send_timeout 600s;
}

If your app relays SSE itself, add X-Accel-Buffering: no to the response headers to disable Nginx buffering explicitly. Not buffering at the path and measuring content progress in the app are two separate defenses; you need both to be resilient to silent stalls.

Add one final check before saving

Once resume is in place, you want to confirm the resumed response actually finished. I put a one-line gate that inspects stop_reason right before saving.

def assert_complete(final_message) -> None:
    sr = final_message.stop_reason
    if sr == "max_tokens":
        raise RuntimeError("Hit max_tokens; a continuation may remain.")
    if sr not in ("end_turn", "stop_sequence"):
        raise RuntimeError(f"Incomplete stop reason: {sr}")

After a prefill resume, inspect the final message's stop_reason. If it is end_turn, the model at least considers itself done. The thing that hurt most about silent stalls was that the absence of an error meant I never noticed the output was incomplete. This one line closes that blind spot.

Looking back

None of these pieces is flashy on its own: a watchdog that measures content progress, a prefill resume of received text, a four-layer timeout budget, disabling buffering at the path, and a stop_reason check before saving. Each is a small move. Together, though, they nearly eliminated the most easily-missed failure — the response that silently loses its tail.

The longer something runs unattended, the more I value "it doesn't quietly break partway" over any flashy feature. A bug that raises no error is worse than one that does. That is exactly why, for me, starting by making the silence observable looked like the long way around but turned out to be the shortest path. If you recognize the same missing-tail symptom, I hope this gives you a thread to pull on.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.