CLAUDE LABJP
MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasksPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directiveLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskMCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)MODEL — Claude Opus 4.8 improves coding, agentic, and professional work, with consistency for long-running tasksPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers (Cloudflare/Daytona/Modal/Vercel)MODEL — Fable 5 (1M-token context, always-on adaptive thinking) was suspended on June 12 under a US export-control directiveLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskMCP — Enterprise-managed MCP connectors (Okta) enable zero-touch access (Team/Enterprise beta)
Articles/API & SDK
API & SDK/2026-06-22Advanced

Claude API Streaming Breaks the "Everything Arrives" Assumption — Field Notes on Recovering from Partial Failure

Once concurrency climbs, Claude API streams disconnect mid-response, replay events, and emit half-finished tool arguments. Treating partial failure as the norm rather than an anomaly, here is how I rebuilt the implementation and monitoring to recover quietly.

claude-api68streaming17production101resilience9monitoring6

Premium Article

When I first wired Claude API into a chat feature as an indie developer, everything worked fine. It started falling apart once more than a dozen people used it at the same time. Responses stalled halfway, the same sentence appeared twice, a tool call somehow fired with only half its arguments. None of this showed up in the SDK samples, and none of it reproduced locally. For a while I was convinced I had a bug somewhere in my own code, and I kept hunting for it. The conclusion I finally reached was far more deflating: in production, streaming partially fails as a matter of course.

This article is a set of field notes on how I rebuilt the implementation once I accepted that premise. The goal is not a stream that never breaks, but one that quietly recovers before the user notices anything went wrong. I will walk through five layers — state, deduplication, a safety valve for tool arguments, separated backoff, and monitoring — alongside the marks each one left while I trimmed it down in a live service.

The difference from a single response is that it half-succeeds

With a normal request/response, failure arrives in a clean shape. A timeout returns nothing; an error fails the whole thing. You retry, and the state is binary — success or failure.

What makes streaming awkward is how routinely the middle ground happens. Because the server and client stay connected for tens of seconds, all of the following land on you as things that "can happen within normal operation." An intermediate proxy, disliking the silence, cuts the connection. A load balancer drops existing connections on every deploy. A browser tab moves to the background and falls behind on reading the buffer. An HTTP/2 retry causes the server to send the same event again.

These are not bugs; they are the environment. So tightening things in the "make it never happen" direction never catches up, because the combinations of runtime conditions are endless. Switch the mindset from "an implementation that doesn't break" to "an implementation that remembers where it broke and continues from there." That alone changes the shape of the code at its root. Concretely, it becomes a design that always holds, in hand, which block of the stream you have received up to which point.

Remember which event you were cut off in

Claude API streaming arrives as Server-Sent Events, and the main events are message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop, plus the ping keep-alive and error. The decisive detail for recovery is that content_block_delta carries a per-block index. Even if you are cut off mid-stream, as long as you hold which block and how many characters into it you have reached, you can reconstruct the continuation.

So instead of iterating the raw stream, wrap it in a state object that holds the partial response. The code below is a minimal structure that accumulates per-block text and tool-argument fragments while carrying along information that becomes a hint for reconnection.

from dataclasses import dataclass, field
from typing import Optional
 
@dataclass
class StreamState:
    """Holds the partial response during streaming; consulted to rebuild on reconnect."""
    message_id: Optional[str] = None
    model: Optional[str] = None
    blocks: list[dict] = field(default_factory=list)
    current_index: int = -1
    stop_reason: Optional[str] = None
 
    def apply_delta(self, index: int, delta: dict) -> None:
        # Grow the container up to the received block count, then add the delta
        while len(self.blocks) <= index:
            self.blocks.append({"type": "text", "text": "", "partial_json": ""})
        block = self.blocks[index]
        kind = delta.get("type")
        if kind == "text_delta":
            block["text"] += delta.get("text", "")
        elif kind == "input_json_delta":
            block["partial_json"] += delta.get("partial_json", "")
 
    def assistant_text(self) -> str:
        return "".join(b["text"] for b in self.blocks if b.get("type") == "text")

Inserting this single object lines up every downstream recovery step into the same plain shape: "look at the current state and decide." Keeping the state from scattering is the foundation that makes streaming stable.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
A stateful reader that distinguishes disconnects, duplicates, and mid-stream errors and recovers each by its own rule
A safety valve that stops half-finished tool_use arguments at the execution boundary, plus a dedup key that prevents double-concatenated text
A four-signal monitoring setup — completion rate, reconnects, first-token latency, duplicate rate — to notice degradation before users do
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-06-15
When a Model Disappears Without Warning: A State Machine for Retirement, Withdrawal, and Overload
A model can become unusable in hours for reasons that have nothing to do with a technical outage. This guide models three distinct flavors of 'unavailable'—retirement, withdrawal, and transient overload—as one availability state machine, with a router that keeps automated pipelines running. Working TypeScript and Python included.
API & SDK2026-05-26
Designing Graceful Degradation for the Claude API — A Four-Tier Fallback Architecture That Keeps AI Features Quietly Alive
Once Claude API features hit real production traffic, model-level fallback alone stops being enough. This article walks through an SLI-driven four-tier degradation design, with Python and TypeScript code, SLO burn-rate alerting, and the operational trade-offs an indie developer actually runs into.
API & SDK2026-05-02
Building a Budget Circuit Breaker for Claude API in Production — Auto-Halt When Daily Token Spend Exceeds Your Cap
A practical guide to enforcing daily and monthly Claude API budget caps in production. Includes copy-paste Cloudflare Workers + KV / Durable Objects code, three response strategies (halt, degrade, alert), and the operational habits that keep the breaker honest.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →