●MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasks●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)●MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directive●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)●MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasks●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)●MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directive●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)
Claude API Streaming Breaks the "Everything Arrives" Assumption — Field Notes on Recovering from Partial Failure
Once concurrency climbs, Claude API streams disconnect mid-response, replay events, and emit half-finished tool arguments. Treating partial failure as the norm rather than an anomaly, here is how I rebuilt the implementation and monitoring to recover quietly.
When I first wired Claude API into a chat feature as an indie developer, everything worked fine. It started falling apart once more than a dozen people used it at the same time. Responses stalled halfway, the same sentence appeared twice, a tool call somehow fired with only half its arguments. None of this showed up in the SDK samples, and none of it reproduced locally. For a while I was convinced I had a bug somewhere in my own code, and I kept hunting for it. The conclusion I finally reached was far more deflating: in production, streaming partially fails as a matter of course.
This article is a set of field notes on how I rebuilt the implementation once I accepted that premise. The goal is not a stream that never breaks, but one that quietly recovers before the user notices anything went wrong. I will walk through five layers — state, deduplication, a safety valve for tool arguments, separated backoff, and monitoring — alongside the marks each one left while I trimmed it down in a live service.
The difference from a single response is that it half-succeeds
With a normal request/response, failure arrives in a clean shape. A timeout returns nothing; an error fails the whole thing. You retry, and the state is binary — success or failure.
What makes streaming awkward is how routinely the middle ground happens. Because the server and client stay connected for tens of seconds, all of the following land on you as things that "can happen within normal operation." An intermediate proxy, disliking the silence, cuts the connection. A load balancer drops existing connections on every deploy. A browser tab moves to the background and falls behind on reading the buffer. An HTTP/2 retry causes the server to send the same event again.
These are not bugs; they are the environment. So tightening things in the "make it never happen" direction never catches up, because the combinations of runtime conditions are endless. Switch the mindset from "an implementation that doesn't break" to "an implementation that remembers where it broke and continues from there." That alone changes the shape of the code at its root. Concretely, it becomes a design that always holds, in hand, which block of the stream you have received up to which point.
Remember which event you were cut off in
Claude API streaming arrives as Server-Sent Events, and the main events are message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop, plus the ping keep-alive and error. The decisive detail for recovery is that content_block_delta carries a per-block index. Even if you are cut off mid-stream, as long as you hold which block and how many characters into it you have reached, you can reconstruct the continuation.
So instead of iterating the raw stream, wrap it in a state object that holds the partial response. The code below is a minimal structure that accumulates per-block text and tool-argument fragments while carrying along information that becomes a hint for reconnection.
from dataclasses import dataclass, fieldfrom typing import Optional@dataclassclass StreamState: """Holds the partial response during streaming; consulted to rebuild on reconnect.""" message_id: Optional[str] = None model: Optional[str] = None blocks: list[dict] = field(default_factory=list) current_index: int = -1 stop_reason: Optional[str] = None def apply_delta(self, index: int, delta: dict) -> None: # Grow the container up to the received block count, then add the delta while len(self.blocks) <= index: self.blocks.append({"type": "text", "text": "", "partial_json": ""}) block = self.blocks[index] kind = delta.get("type") if kind == "text_delta": block["text"] += delta.get("text", "") elif kind == "input_json_delta": block["partial_json"] += delta.get("partial_json", "") def assistant_text(self) -> str: return "".join(b["text"] for b in self.blocks if b.get("type") == "text")
Inserting this single object lines up every downstream recovery step into the same plain shape: "look at the current state and decide." Keeping the state from scattering is the foundation that makes streaming stable.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A stateful reader that distinguishes disconnects, duplicates, and mid-stream errors and recovers each by its own rule
✦A safety valve that stops half-finished tool_use arguments at the execution boundary, plus a dedup key that prevents double-concatenated text
✦A four-signal monitoring setup — completion rate, reconnects, first-token latency, duplicate rate — to notice degradation before users do
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
When cut off, hand back "this far" — not "resume" — and let it write on
When the connection dropped, the first thing I considered was whether I could resume the same request. But Claude API has no per-event replay mechanism, so reconnecting to the same call gives no guarantee about where it picks up. You might get duplicates, or it might start from the top. That makes deterministic reassembly impossible.
What is reliable is to push the partial response you have received so far into the conversation history as an assistant message and have it generate the continuation. The request count goes up, but because each round is determined, you can treat it idempotently. Below is a thin wrapper that catches only connection-class errors and reconnects automatically. The deliberate choice not to handle rate limits here matters, for reasons I cover later.
import asyncio, loggingimport anthropiclog = logging.getLogger(__name__)class ResilientStream: def __init__(self, client: anthropic.AsyncAnthropic, max_retries: int = 3): self.client = client self.max_retries = max_retries self.state = StreamState() async def run(self, **kwargs): attempt = 0 while True: try: async with self.client.messages.stream(**kwargs) as stream: async for event in stream: self._track(event) yield event return # reached message_stop except (anthropic.APIConnectionError, anthropic.APITimeoutError) as e: attempt += 1 if attempt > self.max_retries: log.error("stream gave up after %d attempts: %s", attempt, e) raise await asyncio.sleep(min(2 ** attempt, 20)) kwargs = self._resume_kwargs(kwargs) # push partial response into history def _track(self, event) -> None: t = event.type if t == "message_start": self.state.message_id = event.message.id self.state.model = event.message.model elif t == "content_block_start": self.state.current_index = event.index elif t == "content_block_delta": self.state.apply_delta(event.index, event.delta.model_dump()) elif t == "message_delta" and event.delta.stop_reason: self.state.stop_reason = event.delta.stop_reason def _resume_kwargs(self, original: dict) -> dict: partial = self.state.assistant_text() if not partial: return original resumed = dict(original) resumed["messages"] = list(original.get("messages", [])) + [ {"role": "assistant", "content": partial} ] return resumed
From the caller's side, no matter how many times it was cut off, it looks like a complete response flowed through. Because all the recovery logic hides inside this boundary, the rendering and business logic need to know nothing about it. How you draw that boundary pays off later.
Don't double the text when the same delta arrives twice
Duplication fails loudly. The moment text is concatenated twice, the user notices on sight. Depending on HTTP/2 retries and how reconnection is implemented, the same content_block_delta flowing again is perfectly ordinary, so dedup cannot be skipped.
The crucial part here is being explicit about what counts as identical. Timestamps can change on resend, so they cannot be the key. Where I settled was combining three things: the message ID, the block index, and the cumulative receive position within that block. Including the position also prevents the accident of mistaking a different delta of the same block for one another.
from hashlib import sha256class DedupWindow: """Remember the last `window` events and drop anything seen a second time.""" def __init__(self, window: int = 512): self.seen: set[str] = set() self.order: list[str] = [] self.window = window def is_duplicate(self, message_id: str, index: int, position: int) -> bool: key = sha256(f"{message_id}:{index}:{position}".encode()).hexdigest() if key in self.seen: return True self.seen.add(key) self.order.append(key) if len(self.order) > self.window: self.seen.discard(self.order.pop(0)) return False
To keep memory from growing without bound, I cap how many keys to remember. A single response holds at most a few hundred deltas, so 512 entries amply cover one round. Without that cap, the set creeps upward on a server left running for a long time.
Don't let half-finished tool arguments touch production data
The one that froze my blood was not garbled text but a tool call. Tool arguments arrive as JSON string fragments via input_json_delta. If cut off mid-way, that fragment stays as broken JSON. Trying to parse it and getting an exception is the lucky case; the worst is when it happens to be valid JSON up to the cut point, and the tool runs with half its arguments.
It actually happened once that a write tool ran with a parameter missing from a mid-stream disconnect, and nearly applied an unintended update to a production record. I caught it just in time, but ever since, I have replaced the very condition for executing a tool with "the arguments parse completely."
import jsonfrom typing import Optionaldef parse_tool_input(block: dict) -> Optional[dict]: """Safely read a tool_use block's partial_json. Return None if incomplete.""" raw = block.get("partial_json", "").strip() if not raw: return None try: return json.loads(raw) except json.JSONDecodeError as e: log.warning("incomplete tool input, skipping: %s (head=%r)", e, raw[:120]) return None# Always pass through this at the execution boundaryargs = parse_tool_input(block)if args is None: # Do not start the tool while arguments are incomplete returnrun_tool(block["name"], args)
It is only a few lines, but placing this one check at the application boundary nearly closes the path by which a mid-stream interruption turns into a side effect. A safety valve works even when it is unglamorous.
Rest connection errors and rate limits on separate rhythms
There was a reason the wrapper deliberately did not handle rate limits. If you run connection-error retries and rate-limit retries through the same machinery, they interfere and retries pile up like a storm. Re-establishing the connection at short intervals while 429s are coming back simply invites more 429s.
So vary how you rest by error type. For rate limits, respect the wait time the server returns and top up the shortfall with jitter. For connection-class errors, use a shorter exponential backoff. The jitter is there to scatter the phenomenon of many clients trying to recover at once and forming a fresh peak; shifting things by just a few seconds visibly smooths the simultaneous-recovery spike.
import randomasync def backoff_for(attempt: int, error: Exception) -> None: if isinstance(error, anthropic.RateLimitError): base = getattr(error, "retry_after", None) or 30 delay = base + random.uniform(0, 5) # server hint + scatter elif isinstance(error, (anthropic.APIConnectionError, anthropic.APITimeoutError)): delay = min(2 ** attempt, 20) + random.uniform(0, 1) else: delay = min(2 ** attempt, 60) log.info("backoff %.1fs (attempt=%d, %s)", delay, attempt, type(error).__name__) await asyncio.sleep(delay)
Once I separated the types, retries stopped cascading into a self-inflicted choke even during high-load windows. A retry only becomes a recovery strategy once its interval is part of the design.
Don't miss the error event that arrives as a 200
There is one more step you can trip over when handling HTTP yourself. Claude API's error event arrives embedded in the SSE body while the HTTP status stays 200. The official SDK converts it to an exception, but if you write the low layer yourself, the error slips through disguised as a normal response unless you check event.type == "error" on your own.
if event.type == "error": err = event.error if err.type == "overloaded_error": raise anthropic.APIStatusError(err.message) # capacity issues are worth a retry raise RuntimeError(f"fatal stream error: {err.type}: {err.message}")
Here only overloaded_error has a realistic chance of recovery, so route it to a retry and treat everything else as failure. Without that distinction, a defect on the prompt side quietly builds a loop that retries forever.
Catch the signs of breakage before your users do
Hardening the implementation means little if you cannot notice degradation. The worst state is the one where the user is the first to notice. In my own service, I watch these four.
Signal
Definition
Warning line
Completion rate
Share of streams that reach message_stop
Below 99.5%
Avg reconnects
Reconnections per stream
Above 0.1
First-token latency P95
P95 of time until the first token appears
Above 6 seconds
Duplicate rate
Share of deltas that hit the dedup check
Above 1%
Completion rate is the overall health score, reconnects reflect path instability, first-token latency reflects perceived speed, and the duplicate rate reflects an anomaly in the delivery path. The moment any single one crosses its line, you can act before a user asks "isn't this slow lately?" Monitoring is the last layer of recovery.
Roll it in three steps, in order
Trying to fold in everything at once makes it hard to start. I introduced it in this order: first replace one existing stream with the stateful wrapper, then add dedup, and finally slot the tool-argument safety valve into the execution boundary. Monitoring can come after that.
There is a reason for the order: once the stateful wrapper is in place, both dedup and the safety valve can be written by just "looking at the current state." Building from the foundation makes the latter two come out in surprisingly short code. Rather than aiming never to break, remember where you broke and continue quietly. Just re-framing it that way genuinely shifts the felt stability by a notch under the same traffic. The stretch of time during which users stop noticing the glitches is, I think, the return this design pays back.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.