CLAUDE LABJP
MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskMODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
Articles/API & SDK
API & SDK/2026-06-23Advanced

When Thinking Is Always On, Prefill Quietly Stops Working — Fixing Streaming and Token Budgets for Fable 5

Fable 5 thinks by default. Prefill no longer applies, the first streamed block isn't text, and max_tokens has to leave room for reasoning. Here is how I fixed those three broken assumptions in my own automated publishing pipeline.

Claude API85Fable 52extended thinkingstreaming18migration6

Premium Article

The morning after I swapped one stage of my publishing pipeline over to Fable 5, a step that had always returned JSON started returning an empty string. No error. stop_reason was end_turn. The body was simply empty. The cause was mundane: the assistant prefill I had relied on for years was being ignored on a model that thinks by default.

Fable 5 became generally available on June 9, and its defining trait is always-on adaptive thinking. It assumes reasoning happens first, and that quietly invalidates code written when thinking was something you opted into. Running four sites' worth of generation by myself as an indie developer, I found the silent failure far worse than a loud one — a stage that keeps running and returns nothing is harder to catch than one that throws. This article records the three broken assumptions I hit during that migration, and how I fixed each one in code.

What actually changed — three broken assumptions

When thinking is always on, both the shape of the output and its accounting change. In the order I fixed them:

Broken assumptionOld behaviorWith thinking on
Assistant prefillContinue from a seed to pin the output shapePrefill can't be combined; shape isn't pinned
First streamed blockFirst content block = textFirst block is thinking; text comes later
Meaning of max_tokensRoughly the body limitCombined limit for thinking + body; can run out first

None of these surface as exceptions. They show up as output that is thin, empty, or occasionally truncated — which is exactly why they are easy to miss. The logs look fine.

Broken assumption 1: prefill no longer applies

To force valid JSON, I used to seed the assistant turn with { and let the model continue. On a thinking-on model that doesn't work. The model produces a thinking block before any output, so there is no "assistant continuation" point to seed.

Force the two together and the API either rejects the request or silently drops the prefill. In my stage it was the latter, which is why I got empty strings.

The fix is to constrain the output shape with a forced tool call rather than prefill. Make a specific tool mandatory with tool_choice, and use its input schema as your output schema. Thinking still runs; only the final output is structurally guaranteed.

import anthropic
 
client = anthropic.Anthropic()
 
# Confirm the actual model ID in the official release notes
MODEL = "claude-fable-5"
 
# Define the structure you want as a "tool input schema"
EXTRACT_TOOL = {
    "name": "emit_article_meta",
    "description": "Return article metadata in a structured form",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "tags": {"type": "array", "items": {"type": "string"}},
            "is_premium": {"type": "boolean"},
        },
        "required": ["title", "tags", "is_premium"],
    },
}
 
def extract_meta(source_text: str) -> dict:
    msg = client.messages.create(
        model=MODEL,
        max_tokens=8000,  # leave room for thinking (see below)
        tools=[EXTRACT_TOOL],
        # require this tool = pin the shape, the way prefill used to
        tool_choice={"type": "tool", "name": "emit_article_meta"},
        messages=[{"role": "user", "content": source_text}],
    )
    # Skip thinking blocks; grab only the tool_use block
    for block in msg.content:
        if block.type == "tool_use":
            return block.input  # already a schema-conformant dict
    raise RuntimeError("No tool_use block found")

The key is tool_choice set to {"type": "tool", "name": ...}. The model is then required to call that tool, and its input follows the schema you declared. Instead of pinning the first character the way prefill does, you guarantee the output structure itself, so a thinking block in front of it doesn't break the result. If you are unwinding a prefill-based design, the layered defenses in a four-layer defense for always-valid JSON with Claude prefill are worth rereading, because the assumptions they build on change here.

Note that with thinking on you can't set a custom temperature (it uses the default). If you relied on a low temperature for determinism, move that guarantee off temperature and onto your tool schema plus a validation loop.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
You'll understand why prefill is silently ignored, and you'll be able to keep structured output reliable on a thinking-on model using forced tool calls instead
You'll have a streaming handler that routes content blocks by type, so a leading thinking block never corrupts what you render
You'll know how to size max_tokens with reasoning headroom, and how to detect and retry when a turn is cut off before the answer
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-06-14
Record Which Model Actually Answered — Attestation Logging for Headless Pipelines
Persist the model field and usage from every API response so you can detect when the served model differs from the one you requested, and reconcile per-model cost ahead of the usage credits change.
API & SDK2026-05-28
Why JSON.parse Fails on Claude API Streaming tool_use Arguments — and How to Fix It
When you stream a Claude API response with tool_use, calling JSON.parse on each input_json_delta throws SyntaxError. Here is the correct way to assemble partial_json fragments, plus disconnect handling.
API & SDK2026-05-14
6 Traps I Hit Building In-App AI Chat with Claude API — Lessons from 10 Years of Indie Dev and 50M+ Downloads
Six real design mistakes I encountered shipping Claude API in-app chat to production — covering context management, streaming error detection, guardrails, session persistence, model versioning, and cost monitoring. Includes working TypeScript code.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →