●MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K output●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●MCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first login●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●CODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running work●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K output●PLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hour●MCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first login●SANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP servers●CODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running work●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
When Thinking Is Always On, Prefill Quietly Stops Working — Fixing Streaming and Token Budgets for Fable 5
Fable 5 thinks by default. Prefill no longer applies, the first streamed block isn't text, and max_tokens has to leave room for reasoning. Here is how I fixed those three broken assumptions in my own automated publishing pipeline.
The morning after I swapped one stage of my publishing pipeline over to Fable 5, a step that had always returned JSON started returning an empty string. No error. stop_reason was end_turn. The body was simply empty. The cause was mundane: the assistant prefill I had relied on for years was being ignored on a model that thinks by default.
Fable 5 became generally available on June 9, and its defining trait is always-on adaptive thinking. It assumes reasoning happens first, and that quietly invalidates code written when thinking was something you opted into. Running four sites' worth of generation by myself as an indie developer, I found the silent failure far worse than a loud one — a stage that keeps running and returns nothing is harder to catch than one that throws. This article records the three broken assumptions I hit during that migration, and how I fixed each one in code.
What actually changed — three broken assumptions
When thinking is always on, both the shape of the output and its accounting change. In the order I fixed them:
Broken assumption
Old behavior
With thinking on
Assistant prefill
Continue from a seed to pin the output shape
Prefill can't be combined; shape isn't pinned
First streamed block
First content block = text
First block is thinking; text comes later
Meaning of max_tokens
Roughly the body limit
Combined limit for thinking + body; can run out first
None of these surface as exceptions. They show up as output that is thin, empty, or occasionally truncated — which is exactly why they are easy to miss. The logs look fine.
Broken assumption 1: prefill no longer applies
To force valid JSON, I used to seed the assistant turn with { and let the model continue. On a thinking-on model that doesn't work. The model produces a thinking block before any output, so there is no "assistant continuation" point to seed.
Force the two together and the API either rejects the request or silently drops the prefill. In my stage it was the latter, which is why I got empty strings.
The fix is to constrain the output shape with a forced tool call rather than prefill. Make a specific tool mandatory with tool_choice, and use its input schema as your output schema. Thinking still runs; only the final output is structurally guaranteed.
import anthropicclient = anthropic.Anthropic()# Confirm the actual model ID in the official release notesMODEL = "claude-fable-5"# Define the structure you want as a "tool input schema"EXTRACT_TOOL = { "name": "emit_article_meta", "description": "Return article metadata in a structured form", "input_schema": { "type": "object", "properties": { "title": {"type": "string"}, "tags": {"type": "array", "items": {"type": "string"}}, "is_premium": {"type": "boolean"}, }, "required": ["title", "tags", "is_premium"], },}def extract_meta(source_text: str) -> dict: msg = client.messages.create( model=MODEL, max_tokens=8000, # leave room for thinking (see below) tools=[EXTRACT_TOOL], # require this tool = pin the shape, the way prefill used to tool_choice={"type": "tool", "name": "emit_article_meta"}, messages=[{"role": "user", "content": source_text}], ) # Skip thinking blocks; grab only the tool_use block for block in msg.content: if block.type == "tool_use": return block.input # already a schema-conformant dict raise RuntimeError("No tool_use block found")
The key is tool_choice set to {"type": "tool", "name": ...}. The model is then required to call that tool, and its input follows the schema you declared. Instead of pinning the first character the way prefill does, you guarantee the output structure itself, so a thinking block in front of it doesn't break the result. If you are unwinding a prefill-based design, the layered defenses in a four-layer defense for always-valid JSON with Claude prefill are worth rereading, because the assumptions they build on change here.
Note that with thinking on you can't set a custom temperature (it uses the default). If you relied on a low temperature for determinism, move that guarantee off temperature and onto your tool schema plus a validation loop.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You'll understand why prefill is silently ignored, and you'll be able to keep structured output reliable on a thinking-on model using forced tool calls instead
✦You'll have a streaming handler that routes content blocks by type, so a leading thinking block never corrupts what you render
✦You'll know how to size max_tokens with reasoning headroom, and how to detect and retry when a turn is cut off before the answer
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Broken assumption 2: the first streamed block isn't text
In the stage that renders incrementally to a UI, I treated the first streamed content_block as text. With thinking on, the first block is a thinking block. Render the head as body and you either show fragments of reasoning or stall on a type mismatch.
Route by the block type instead. Keep thinking_delta out of the body and show only text_delta. A signed signature_delta also streams through; it is not for display.
def stream_answer(prompt: str): answer_parts = [] current_type = None with client.messages.stream( model=MODEL, max_tokens=12000, messages=[{"role": "user", "content": prompt}], ) as stream: for event in stream: if event.type == "content_block_start": # only here do you learn the block type (thinking or text) current_type = event.content_block.type elif event.type == "content_block_delta": delta = event.delta if delta.type == "text_delta": answer_parts.append(delta.text) yield delta.text # the only thing readers see elif delta.type == "thinking_delta": pass # record only; never display elif delta.type == "signature_delta": pass # internal use elif event.type == "content_block_stop": current_type = None return "".join(answer_parts)
The expected behavior is that only text_delta reaches the reader and no reasoning fragments are ever shown. I once appended thinking deltas straight into the buffer and leaked the model's internal monologue into an article preview. Routing by type is the only reliable guard.
There is a second catch for agents that also use tools. When you return a tool result and continue the conversation, you must put the thinking block from the previous assistant turn back into history, signature and all. Drop the thinking block and return only the tool_result and you break reasoning continuity, sometimes with an error. I had been trimming history down to "text only," which tripped me here too. Alongside a guide to handling Claude's stop_reason, review your history-reconstruction logic before you ship.
Broken assumption 3: max_tokens is a tug-of-war between thinking and body
Set max_tokens small, as if it were the body limit, and on a turn where thinking runs long the body has no room left. stop_reason comes back as max_tokens and the body is short or empty. Because adaptive thinking scales with how hard the problem is, the same prompt produces a thin body only sometimes — a low-reproducibility symptom.
Two countermeasures. First, explicitly budget headroom for thinking in max_tokens. Second, detect stop_reason == "max_tokens", raise the budget, and retry exactly once.
def answer_with_budget(prompt: str, want_answer_tokens: int = 2000) -> str: # Stack several times the expected body for thinking (harder = longer) budget = want_answer_tokens * 4 for attempt in range(2): msg = client.messages.create( model=MODEL, max_tokens=budget, messages=[{"role": "user", "content": prompt}], ) text = "".join(b.text for b in msg.content if b.type == "text") # cut off mid-thinking, before any body if msg.stop_reason == "max_tokens" and not text.strip(): budget *= 2 # raise the budget and retry once continue return text raise RuntimeError("No body even after raising the budget")
Look at usage and you'll see thinking tokens count toward output billing. Measuring my own pipeline for a week, output tokens per request noticeably rose while body length stayed the same. Thinking is not free preprocessing; it is a cost to account for. For designing around always-on thinking from the cost side, a production cost analysis of Claude extended thinking goes deeper, and for the trade-off with long single-pass generation, notes on generating long content in one pass with Fable 5's 128k output is useful context.
Why I dropped prefill and leaned on tools
Dropping prefill felt like a loss. It was short, reliable, and a habit of many years. I leaned on tools anyway because on a thinking-on model another block always precedes the output, so "continue from a seed" simply no longer holds.
If you want a guaranteed shape, constrain the structure, not the next characters. That looked like the long way around, but it was more robust precisely because it doesn't depend on whether the model thinks. The biggest lesson from the migration wasn't a technique — it was raising the level of what I guarantee.
Three things worth checking before you migrate
If you have a stage moving to a thinking-on model, verifying these three things on real calls before deploy will spare you the quiet failures I hit.
First, find any stage that depends on prefill or a custom temperature; replace it with a forced tool call and a validation loop. Second, confirm your streaming code routes by the content_blocktype — if it assumes the head is text, fix it. Third, check that max_tokens leaves headroom for thinking and that you retry on stop_reason == "max_tokens".
Swap a single stage to the thinking-on model first, confirm it produces none of the three symptoms (empty, thin, truncated), and only then widen it across your pipeline. If this spares one person the same quiet failure, I'll be glad.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.