⬡ API & SDK/2026-06-13Intermediate

Retiring the 'Please Continue' Prompt — Single-Pass Long-Form Generation with Claude Fable 5's 128k Output

Every month my report generation hit the output cap and I had to ask Claude to 'continue from here.' Claude Fable 5's 128k output let me retire that workflow: a streaming implementation, a resume-after-disconnect pattern, and a measured cost comparison against chunked generation.

claude-fable-5² 128k-output streaming²¹ long-form-generation cost-optimization²⁹

✦ Premium Article

At the start of every month, I generate operations reports for the four apps I run as an indie developer — crash trends, store review replies, AdMob revenue movement. Stitched into one document, a month's report runs close to 40,000 Japanese characters.

At that length, the recurring headache was the output cap. Generation would stop mid-document, I would send a "please continue from here" follow-up, then read across the seam and tidy it up. A heading level would drift, a caveat paragraph would appear twice, the tone would shift slightly in the back half. That stitching work cost me about twenty minutes every month.

Claude Fable 5, released on June 9, supports up to 128,000 output tokens. It is also included at no extra cost on the major plans until June 22, so this seemed like the right week to find out whether my report generation could become a single pass. The short version: the seam-fixing work went to zero, and the measured cost came in about 26% lower than the chunked approach. The longer version involves a streaming-first implementation and a few traps worth knowing about, which is what this article covers.

What Chunked Generation Was Actually Costing Me

Let me be honest about the old setup first. I used to set max_tokens to 16,384 and generate the report in four passes. From the second request onward, I would include everything generated so far and ask Claude to continue.

Three problems kept surfacing:

Seam quality was unreliable. Asking for a continuation sometimes produced a preamble — "Understood, continuing the report" — or a partial restatement of the previous paragraph before the actual continuation. Not every time, but across four stitches per month, something almost always leaked in.
Structure drifted. In later requests, Claude only sees the document as pasted input rather than as something it is currently writing. Granularity that was ### in the first half would sometimes come back as ## in the second.
Input tokens accumulated. Pass two re-sends pass one's output; pass three re-sends passes one and two. The more chunks, the more times the same text gets billed as input.

None of this was fatal. But spending twenty minutes a month inspecting seams in an automated pipeline felt like the automation had quietly stopped paying for itself.

Fable 5's Output Spec — 128k Tokens and Always-On Adaptive Thinking

Claude Fable 5 is the generally available model in the new Mythos class positioned above Opus: a 1M-token context window and up to 128,000 output tokens. The API model string is claude-fable-5, priced at $10 per million input tokens and $50 per million output tokens. If you remember the 1M context window as a Sonnet beta that was later retired — I wrote about that in Migrating from the Claude 1M Context Window Beta: Everything You Need to Do Before April 30, 2026 — Fable 5 brings it back as a standard capability.

Two spec details matter before you write any code.

Adaptive thinking shows up in output_tokens

Fable 5 thinks adaptively on every request, scaling its reasoning to task difficulty. In my logs, usage.output_tokens consistently came in 10–20% higher than what I estimated from the saved document alone; I read that gap as the thinking overhead landing on the output side. The practical consequence: estimate costs from measured usage, not from the visible text length.

Long outputs assume streaming

A 128k-class output takes several minutes to generate. If you architect around a single non-streaming request, you will collide with HTTP timeouts, and the Anthropic SDK itself nudges long-running requests toward streaming. Designing for streaming from the start is the realistic path.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Replace chunked 'please continue' long-form generation with one streamed 128k-token pass, using working Python you can adapt today

✦Recover from mid-stream disconnects with partial saves and an assistant-prefill resume pattern that avoids re-introductions

✦Estimate your post-June-23 usage-credit burn from real token counts — the single pass measured about 26% cheaper than chunking

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The Basic Implementation: Streaming the Whole Document

What this code solves, in one line: receive a long document in a single request, writing to disk as chunks arrive.

import anthropic
 
client = anthropic.Anthropic()  # ANTHROPIC_API_KEY via environment variable
 
PROMPT = """You are the editor of an app-operations report.
Using the source material below, write the monthly operations report in Markdown.
Structure: Overview / Crashes / Review Replies / Revenue / Next Month's Tasks.
 
<source>
{source_digest}
</source>
"""
 
def generate_single_pass(source_digest: str, out_path: str):
    """Use the 128k window to receive a long report in one request."""
    with client.messages.stream(
        model="claude-fable-5",
        max_tokens=128000,  # reserve the full output window
        messages=[{
            "role": "user",
            "content": PROMPT.format(source_digest=source_digest),
        }],
    ) as stream:
        with open(out_path, "w", encoding="utf-8") as f:
            for text in stream.text_stream:
                f.write(text)  # persist as we receive, in case of disconnects
                f.flush()
        final = stream.get_final_message()
 
    print("stop_reason:", final.stop_reason)
    print("usage:", final.usage)
    return final

Running it against May's data produced:

stop_reason: end_turn
usage: Usage(input_tokens=6120, output_tokens=63807, ...)

Two deliberate choices here. First, max_tokens=128000 is not a commitment to use 128k — it reserves headroom so a 64k report never gets clipped. Second, every received chunk is written and flushed immediately. A multi-minute stream should be designed on the assumption that it can die mid-flight; persisting as you go makes the resume logic downstream much simpler.

I also recommend making stop_reason a habit. end_turn means Claude decided it was finished; max_tokens means the window cut it off. In the chunked era, hitting max_tokens was the expected behavior, so I never looked at it. In single-pass operation it becomes the single most important quality signal.

Surviving Disconnects: Partial Saves Plus Assistant Prefill

A several-minute stream will occasionally drop. During my testing, a Wi-Fi handoff killed one run partway through. This is the design that avoids starting over.

def build_resume_messages(original_prompt: str, partial_path: str) -> list:
    """Build messages that resume generation from a partial save."""
    with open(partial_path, encoding="utf-8") as f:
        partial = f.read()
 
    # Drop the trailing half-written paragraph; resume from a paragraph boundary
    cut = partial.rfind("\n\n")
    done = partial[:cut] if cut > 0 else partial
 
    # Passing the partial document as an assistant message makes Claude
    # continue from exactly where it left off
    return [
        {"role": "user", "content": original_prompt},
        {"role": "assistant", "content": done},
    ]

The key move: instead of asking "please continue" in a user message, you hand the generated-so-far document back as an assistant message. When a conversation ends on an assistant turn, Claude continues generating from the literal end of that text. Greetings and restatements have no place to sneak in, which eliminates the seam contamination that plagued the chunked workflow.

The second, smaller move is trimming the partial save back to a paragraph boundary (\n\n) before resuming. Continuing from a mid-sentence fragment invites grammatical accidents at the joint; continuing from a clean paragraph break almost never does, at least in my runs.

Measuring the Cost: Chunked vs. Single-Pass

For the cost comparison I ran both approaches on the same model, same source material, same instructions — Fable 5 with a 128k window in one pass, versus Fable 5 with a 16k window in four passes. I wanted to isolate the method, not compare models.

PRICE_IN = 10.0   # USD per 1M input tokens (Fable 5)
PRICE_OUT = 50.0  # USD per 1M output tokens
 
def cost_usd(input_tokens: int, output_tokens: int) -> float:
    return input_tokens / 1e6 * PRICE_IN + output_tokens / 1e6 * PRICE_OUT
 
# Measured usage from May's report
runs = {
    "single pass (128k window, 1 request)": [(6_120, 63_807)],
    "chunked (16k window, 4 requests)": [
        (6_118, 16_214),
        (22_337, 16_402),  # later passes re-send prior output as input
        (38_512, 16_087),
        (54_805, 14_869),
    ],
}
 
for label, batches in runs.items():
    cin = sum(i for i, _ in batches)
    cout = sum(o for _, o in batches)
    print(f"{label}: input {cin:,} / output {cout:,} / ${cost_usd(cin, cout):.2f}")

The output:

single pass (128k window, 1 request): input 6,120 / output 63,807 / $3.25
chunked (16k window, 4 requests): input 121,772 / output 63,572 / $4.40

Output tokens are nearly identical, but the chunked run's input balloons to about 120k tokens — every re-send of prior output is billed again. The difference is $1.15 per report, making the single pass about 26% cheaper.

I want to be straightforward about the scale here: $1.15 a month is not a business case. Prompt caching would shrink the re-send cost further, so I would not argue for migration on cost alone. What decided it for me was the twenty minutes of seam inspection going away and the structural drift disappearing. The cost measurement mattered as a confirmation that the quality win did not come with a price regression.

What Changed in Quality — and What Did Not

I regenerated three months of reports (March through May) and inspected the results.

What changed: Heading-level drift vanished. The document is one generation, so there is no second half to reinterpret the structure. The "Understood, continuing" preambles also went to zero.
What did not: The prose itself. Paragraph by paragraph, I could not point to a density or accuracy difference between chunked and single-pass output. That matches my broader impression of the model from the claude.ai side, which I wrote up in Three Days of Running Claude Fable 5 Side by Side with Opus 4.8 — Settling My Model Split Before June 15 — on isolated tasks, the differences are subtle.

I do not rely on eyeballing alone; a minimal verifier runs on every generated report.

import re
 
def verify_report(path: str, stop_reason: str) -> list:
    """Minimum viable inspection for a single-pass report."""
    problems = []
    if stop_reason != "end_turn":
        problems.append(f"stop_reason was {stop_reason} (output may be truncated)")
 
    with open(path, encoding="utf-8") as f:
        text = f.read()
 
    # Detect heading-level jumps (## followed directly by ####)
    levels = [len(m) for m in re.findall(r"^(#{2,4})\s", text, re.MULTILINE)]
    for a, b in zip(levels, levels[1:]):
        if b - a >= 2:
            problems.append("heading level jump detected")
            break
 
    # Duplicate H2s — the classic chunked-generation accident
    h2 = re.findall(r"^##\s+(.+)$", text, re.MULTILINE)
    if len(h2) != len(set(h2)):
        problems.append("duplicate H2 headings found")
    return problems

When I wrote this checker in the chunked era, the duplicate-H2 rule fired about once a month. Across three single-pass runs it has fired zero times. The checker stays in the pipeline anyway — in generation workflows, I prefer keeping guards in place even after the problem they catch seems solved.

Where You Can Get Burned — stop_reason, Timeouts, and Credit Math

The missteps and near-misses from my own testing, so you can skip them.

Forgetting to widen max_tokens

Right after switching to single-pass, I ran one generation with max_tokens still at 32,000 and got stop_reason: max_tokens. Old habits said "just ask it to continue" — but in a single-pass design the correct move is to stop, widen the window, and rerun. My pipeline now refuses to pass any report downstream unless stop_reason is end_turn.

Waiting on a non-streaming request

On day one I tried a plain messages.create call for a 64k output and met the SDK's long-request warning head-on. The longer the output, the longer a connection has to stay open; if you plan to use the 128k window at all, treat streaming as the only mode worth building.

Misjudging usage-credit burn after June 23

Fable 5 is bundled with the major plans until June 22; from June 23 it draws from usage credits. At $50 per million output tokens, running it with Sonnet-era instincts will drain a balance faster than you expect. My allocation: only long-form, low-frequency work — like this monthly report — goes to Fable 5, while daily lightweight steps stay on the previous models. The reasoning behind that split is in Reallocating My Automation Pipeline Ahead of the June 15 Billing Change.

The high-risk fallback

Fable 5 ships with a safety design that falls back to Opus 4.8 in high-risk domains such as cybersecurity (reportedly under 5% of sessions). Across three months of regenerated operations reports I never saw it trigger. Still, if you wire Fable 5 into unattended runs, logging the responding model alongside each output is cheap insurance — it lets you explain later why one month's report reads differently.

What to Try Before the Free Window Closes on June 22

The first concrete step I would suggest: pick the one chunked-generation job that causes you the most seam pain, switch it to a streamed call with max_tokens=128000, and capture the measured usage. During the bundled period, the experiment itself costs nothing extra on the major plans. Once you have real input/output token counts, projecting your post-June-23 credit burn is a few lines of arithmetic — the cost_usd helper above is all I use.

I have not moved every workload to Fable 5, and I do not plan to. But for the specific job of writing a long document in one breath, I no longer see a reason to go back to stitching. If your months also include twenty minutes of seam repair, I hope this saves you that time.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.