⬡ API & SDK/2026-06-23Advanced

When Thinking Is Always On, Prefill Quietly Stops Working — Fixing Streaming and Token Budgets for Fable 5

Fable 5 thinks by default. Prefill no longer applies, the first streamed block isn't text, and max_tokens has to leave room for reasoning. Here is how I fixed those three broken assumptions in my own automated publishing pipeline.

Claude API⁸⁵ Fable 5² extended thinking streaming¹⁸ migration⁶

✦ Premium Article

The morning after I swapped one stage of my publishing pipeline over to Fable 5, a step that had always returned JSON started returning an empty string. No error. stop_reason was end_turn. The body was simply empty. The cause was mundane: the assistant prefill I had relied on for years was being ignored on a model that thinks by default.

Fable 5 became generally available on June 9, and its defining trait is always-on adaptive thinking. It assumes reasoning happens first, and that quietly invalidates code written when thinking was something you opted into. Running four sites' worth of generation by myself as an indie developer, I found the silent failure far worse than a loud one — a stage that keeps running and returns nothing is harder to catch than one that throws. This article records the three broken assumptions I hit during that migration, and how I fixed each one in code.

What actually changed — three broken assumptions

When thinking is always on, both the shape of the output and its accounting change. In the order I fixed them:

Broken assumption	Old behavior	With thinking on
Assistant prefill	Continue from a seed to pin the output shape	Prefill can't be combined; shape isn't pinned
First streamed block	First content block = text	First block is thinking; text comes later
Meaning of max_tokens	Roughly the body limit	Combined limit for thinking + body; can run out first

None of these surface as exceptions. They show up as output that is thin, empty, or occasionally truncated — which is exactly why they are easy to miss. The logs look fine.

Broken assumption 1: prefill no longer applies

To force valid JSON, I used to seed the assistant turn with { and let the model continue. On a thinking-on model that doesn't work. The model produces a thinking block before any output, so there is no "assistant continuation" point to seed.

Force the two together and the API either rejects the request or silently drops the prefill. In my stage it was the latter, which is why I got empty strings.

The fix is to constrain the output shape with a forced tool call rather than prefill. Make a specific tool mandatory with tool_choice, and use its input schema as your output schema. Thinking still runs; only the final output is structurally guaranteed.

import anthropic
 
client = anthropic.Anthropic()
 
# Confirm the actual model ID in the official release notes
MODEL = "claude-fable-5"
 
# Define the structure you want as a "tool input schema"
EXTRACT_TOOL = {
    "name": "emit_article_meta",
    "description": "Return article metadata in a structured form",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "tags": {"type": "array", "items": {"type": "string"}},
            "is_premium": {"type": "boolean"},
        },
        "required": ["title", "tags", "is_premium"],
    },
}
 
def extract_meta(source_text: str) -> dict:
    msg = client.messages.create(
        model=MODEL,
        max_tokens=8000,  # leave room for thinking (see below)
        tools=[EXTRACT_TOOL],
        # require this tool = pin the shape, the way prefill used to
        tool_choice={"type": "tool", "name": "emit_article_meta"},
        messages=[{"role": "user", "content": source_text}],
    )
    # Skip thinking blocks; grab only the tool_use block
    for block in msg.content:
        if block.type == "tool_use":
            return block.input  # already a schema-conformant dict
    raise RuntimeError("No tool_use block found")

The key is tool_choice set to {"type": "tool", "name": ...}. The model is then required to call that tool, and its input follows the schema you declared. Instead of pinning the first character the way prefill does, you guarantee the output structure itself, so a thinking block in front of it doesn't break the result. If you are unwinding a prefill-based design, the layered defenses in a four-layer defense for always-valid JSON with Claude prefill are worth rereading, because the assumptions they build on change here.

Note that with thinking on you can't set a custom temperature (it uses the default). If you relied on a low temperature for determinism, move that guarantee off temperature and onto your tool schema plus a validation loop.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You'll understand why prefill is silently ignored, and you'll be able to keep structured output reliable on a thinking-on model using forced tool calls instead

✦You'll have a streaming handler that routes content blocks by type, so a leading thinking block never corrupts what you render

✦You'll know how to size max_tokens with reasoning headroom, and how to detect and retry when a turn is cut off before the answer

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Broken assumption 2: the first streamed block isn't text

In the stage that renders incrementally to a UI, I treated the first streamed content_block as text. With thinking on, the first block is a thinking block. Render the head as body and you either show fragments of reasoning or stall on a type mismatch.

Route by the block type instead. Keep thinking_delta out of the body and show only text_delta. A signed signature_delta also streams through; it is not for display.

def stream_answer(prompt: str):
    answer_parts = []
    current_type = None
 
    with client.messages.stream(
        model=MODEL,
        max_tokens=12000,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for event in stream:
            if event.type == "content_block_start":
                # only here do you learn the block type (thinking or text)
                current_type = event.content_block.type
            elif event.type == "content_block_delta":
                delta = event.delta
                if delta.type == "text_delta":
                    answer_parts.append(delta.text)
                    yield delta.text          # the only thing readers see
                elif delta.type == "thinking_delta":
                    pass                       # record only; never display
                elif delta.type == "signature_delta":
                    pass                       # internal use
            elif event.type == "content_block_stop":
                current_type = None
 
    return "".join(answer_parts)

The expected behavior is that only text_delta reaches the reader and no reasoning fragments are ever shown. I once appended thinking deltas straight into the buffer and leaked the model's internal monologue into an article preview. Routing by type is the only reliable guard.

There is a second catch for agents that also use tools. When you return a tool result and continue the conversation, you must put the thinking block from the previous assistant turn back into history, signature and all. Drop the thinking block and return only the tool_result and you break reasoning continuity, sometimes with an error. I had been trimming history down to "text only," which tripped me here too. Alongside a guide to handling Claude's stop_reason, review your history-reconstruction logic before you ship.

Broken assumption 3: max_tokens is a tug-of-war between thinking and body

Set max_tokens small, as if it were the body limit, and on a turn where thinking runs long the body has no room left. stop_reason comes back as max_tokens and the body is short or empty. Because adaptive thinking scales with how hard the problem is, the same prompt produces a thin body only sometimes — a low-reproducibility symptom.

Two countermeasures. First, explicitly budget headroom for thinking in max_tokens. Second, detect stop_reason == "max_tokens", raise the budget, and retry exactly once.

def answer_with_budget(prompt: str, want_answer_tokens: int = 2000) -> str:
    # Stack several times the expected body for thinking (harder = longer)
    budget = want_answer_tokens * 4
 
    for attempt in range(2):
        msg = client.messages.create(
            model=MODEL,
            max_tokens=budget,
            messages=[{"role": "user", "content": prompt}],
        )
        text = "".join(b.text for b in msg.content if b.type == "text")
 
        # cut off mid-thinking, before any body
        if msg.stop_reason == "max_tokens" and not text.strip():
            budget *= 2          # raise the budget and retry once
            continue
        return text
 
    raise RuntimeError("No body even after raising the budget")

Look at usage and you'll see thinking tokens count toward output billing. Measuring my own pipeline for a week, output tokens per request noticeably rose while body length stayed the same. Thinking is not free preprocessing; it is a cost to account for. For designing around always-on thinking from the cost side, a production cost analysis of Claude extended thinking goes deeper, and for the trade-off with long single-pass generation, notes on generating long content in one pass with Fable 5's 128k output is useful context.

Why I dropped prefill and leaned on tools

Dropping prefill felt like a loss. It was short, reliable, and a habit of many years. I leaned on tools anyway because on a thinking-on model another block always precedes the output, so "continue from a seed" simply no longer holds.

If you want a guaranteed shape, constrain the structure, not the next characters. That looked like the long way around, but it was more robust precisely because it doesn't depend on whether the model thinks. The biggest lesson from the migration wasn't a technique — it was raising the level of what I guarantee.

Three things worth checking before you migrate

If you have a stage moving to a thinking-on model, verifying these three things on real calls before deploy will spare you the quiet failures I hit.

First, find any stage that depends on prefill or a custom temperature; replace it with a forced tool call and a validation loop. Second, confirm your streaming code routes by the content_block type — if it assumes the head is text, fix it. Third, check that max_tokens leaves headroom for thinking and that you retry on stop_reason == "max_tokens".

Swap a single stage to the thinking-on model first, confirm it produces none of the three symptoms (empty, thin, truncated), and only then widen it across your pipeline. If this spares one person the same quiet failure, I'll be glad.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.