⬡ API & SDK/2026-05-30Advanced

Continuing past max_tokens in the Claude API without duplicated text or broken code fences

Detect stop_reason: max_tokens, continue the generation with an assistant prefill, and stitch the parts back together without duplicated seams or broken code fences. A production-tested continuation pattern in TypeScript.

Claude API¹¹⁵ stop_reason² long-form generation² TypeScript²⁴ production¹¹¹

✦ Premium Article

While batch-generating release notes in several languages for my six wallpaper apps, I noticed the output ended mid-sentence — yet the script happily moved on to the next step as if nothing was wrong. The cause was almost embarrassingly simple: I was concatenating content[0].text without ever checking stop_reason. I have been shipping apps solo since 2014, and tuning store pages for AdMob revenue is something I have done hundreds of times, but a generation that "cuts off in the middle" behaves differently from an ordinary bug. No error, no exception — it just breaks quietly. This article is about how to detect that silent failure and stitch the continuation back together safely, including the traps I actually hit in production.

What is happening the moment a long generation gets cut off

Every Claude API response carries a stop_reason field. When generation finishes naturally it is end_turn; when it pauses for a tool call it is tool_use; and when the model hits the max_tokens ceiling it is max_tokens. The moment you ask for something long, it is common for the model to stop mid-thought because it ran into that ceiling while it still had more to write.

The trap is that the content you get back still contains "everything written so far." The response comes back as a clean 200 with real text in it, so unless you inspect stop_reason there is no way to tell a finished answer from a truncated one. That is exactly what bit me in release-note generation: English and Japanese fit fine, but the third language hit max_tokens partway through a heading, and the half-written string flowed straight into my published store copy.

You might think raising max_tokens fixes it. That is only half true. Each model has an output ceiling, and even at the top some topics still need more. Worse, blindly maxing out max_tokens reserves a long generation budget even for short answers, which hurts latency. The practical answer is a continuation design: generate with a sensible max_tokens, and when you stop on max_tokens, write the rest and stitch it together.

Three ways it breaks when you ignore stop_reason

A sloppy continuation makes the output worse, not better. The three failure modes I actually ran into were these.

The first is leaving truncation in place. If you treat a single response as complete without checking stop_reason, text that ended mid-sentence or mid-code flows downstream. A mechanical pipeline never notices an unnaturally clipped ending.

The second is duplicated seams. When you ask the model to "continue from before," it tends to be polite and re-summarizes the previous paragraph or rewrites the same heading. Concatenate that naively and the same sentence appears twice.

The third is broken structure. If max_tokens lands inside a code block, the opening code fence never closes. When the continuation starts with a fresh fence, the Markdown parser miscounts the nesting and swallows the entire body into a code block. I only discovered this after seeing a config example in my release notes vanish into one giant grey box.

These are not independent problems — a single continuation design prevents all three in a chain. Let me build it up step by step.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A minimal continuation loop (about 40 lines of TypeScript) that detects stop_reason: max_tokens and keeps going

✦An overlap-detection trimming function that removes duplicated seams using a 200-character window

✦A fence-balance check that stops an unbalanced code fence from collapsing your whole document into a code block

✦Guard rails — a round cap and an estimated-USD budget gate — to stop runaway loops and cost

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

A minimal continuation loop

First, here is the broken naive version, then the fix. The code below never inspects stop_reason, so it returns the incomplete text from the moment max_tokens hit.

// ❌ Naive version that never notices truncation
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
 
async function generateNaive(system: string, prompt: string) {
  const resp = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 4096,
    system,
    messages: [{ role: "user", content: prompt }],
  });
  // stop_reason is not checked → a max_tokens cutoff is treated as complete
  return resp.content[0].type === "text" ? resp.content[0].text : "";
}

The fix turns this into a loop: while stop_reason is max_tokens, push the previous response back as an assistant turn and let the model keep writing. When the last message is an assistant role, Claude continues that same turn. This behavior is called assistant prefill, and because the continuation is treated as "more of the same text" rather than "a new topic," the seams line up far better.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
 
interface LongResult {
  text: string;
  rounds: number;
  inputTokens: number;
  outputTokens: number;
}
 
async function generateLong(
  system: string,
  prompt: string,
  opts: { maxRounds?: number; maxTokens?: number } = {},
): Promise<LongResult> {
  const maxRounds = opts.maxRounds ?? 6;
  const maxTokens = opts.maxTokens ?? 4096;
 
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: prompt },
  ];
  const parts: string[] = [];
  let inputTokens = 0;
  let outputTokens = 0;
  let rounds = 0;
 
  while (rounds < maxRounds) {
    rounds++;
    const resp = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: maxTokens,
      system,
      messages,
    });
    inputTokens += resp.usage.input_tokens;
    outputTokens += resp.usage.output_tokens;
 
    const text = resp.content
      .filter((b): b is Anthropic.TextBlock => b.type === "text")
      .map((b) => b.text)
      .join("");
    parts.push(text);
 
    if (resp.stop_reason !== "max_tokens") break;
 
    // Push the prior response as an assistant turn and let it continue
    messages.push({ role: "assistant", content: text });
  }
 
  return { text: stitch(parts), rounds, inputTokens, outputTokens };
}

The crucial detail is that we only push { role: "assistant", content: text } and call create() again. We add no new user instruction, so Claude resumes the previous turn directly. Showing continuation through the role structure is more stable than asking for it in words. In my case, the explicit-request approach mixed in a re-summary of the previous paragraph about 20% of the time; after switching to prefill, that contamination nearly disappeared.

Removing duplicated seams

Even with prefill, the same phrase can repeat at a token boundary, because the continuation may overlap the previous ending by a few characters. So at join time, find the largest overlap between the "previous tail" and the "next head" and trim it. The stitch function below searches for a back-match up to 200 characters and, if it overlaps, drops that part before concatenating.

function stitch(parts: string[]): string {
  return parts.reduce((acc, next) => {
    if (!acc) return next;
    const maxOverlap = Math.min(acc.length, next.length, 200);
    // Try the longest overlap first, join at the first match
    for (let n = maxOverlap; n >= 16; n--) {
      if (acc.slice(-n) === next.slice(0, n)) {
        return acc + next.slice(n);
      }
    }
    return acc + next;
  });
}

The lower bound of 16 characters avoids mistrimming on a too-short match. If you treat common fragments like a period or "the " as an overlap, you would delete parts of genuinely different sentences. In production, somewhere between 16 and 24 characters proved stable. When no overlap is found it falls back to plain concatenation, so the worst case is "fails to remove a duplicate" — it never over-trims your body text. The default is to not delete.

Handling broken code fences and Markdown structure

Technical articles and release notes contain code blocks. When max_tokens lands inside one, the opening code fence is left unclosed. If the next part starts with a fresh fence, the Markdown renderer mismatches the open/close pairing.

So first, detect whether each part ends with an unclosed fence.

function isFenceOpen(s: string): boolean {
  // If the count of triple-backticks is odd, the last fence is unclosed
  const fence = "`".repeat(3);          // three backticks
  const count = s.split(fence).length - 1;
  return count % 2 === 1;
}

When a part ends with an open fence, you must not let the continuation start with another fresh code fence. Prefill rarely causes this split, but occasionally the model writes a fence again to "tidy up." As a safeguard, verify after joining whether the text ends with an odd number of fences and, if it is still open, either append a closing fence or regenerate just that span.

function repairFences(text: string): string {
  if (isFenceOpen(text)) {
    // Ends while still open → close it to prevent display breakage
    return text.trimEnd() + "\n" + "`".repeat(3) + "\n";
  }
  return text;
}

From experience, in production I prefer regenerating the part that was cut inside a fence — with max_tokens temporarily doubled — over auto-appending a closing fence. Auto-closing prevents the display breakage, but it does not change the fact that the code itself ends mid-line. A truncated config example on a store page is useless. Display consistency and content completeness are separate problems, and treating them separately turns out to cause the fewest incidents.

Guard rails against runaway loops and cost

The scariest thing about a continuation loop is a max_tokens chain that never ends. With prompts that demand extremely long output, or topics where the model tends to be verbose, the rounds run away. At minimum, a round cap (maxRounds above) is mandatory. On top of that, being able to stop on estimated cost is reassuring.

const PRICE = { inPerMTok: 3, outPerMTok: 15 }; // USD, rough Sonnet figures
 
function estimateUSD(inTok: number, outTok: number): number {
  return (inTok / 1e6) * PRICE.inPerMTok + (outTok / 1e6) * PRICE.outPerMTok;
}
 
async function generateLongGuarded(
  system: string,
  prompt: string,
  budgetUSD: number,
): Promise<LongResult> {
  // Same as generateLong, but check the budget after each round:
  //   if (estimateUSD(inputTokens, outputTokens) > budgetUSD)
  //     throw new Error(`budget exceeded: $${...}`);
  return generateLong(system, prompt, { maxRounds: 6 });
}

I set max_tokens per round to 4096, cap rounds at 6, and gate a single long generation at roughly $0.5. That keeps a single generation under about 24K output tokens, which is plenty for one release note or explainer. If continuation does not finish within six rounds, I take that as a signal the prompt is "too big to write in one generation" and switch to a section-by-section design. Continuation is not a cure-all; it only helps when you need a single coherent document to exceed the model's per-turn ceiling.

Prefill versus an explicit "keep going" turn

There are two broad ways to continue. One is the prefill approach above: just push the prior response as an assistant turn. The other adds a user turn that explicitly asks "continue from before without repeating yourself."

The difference is a trade-off between seam smoothness and control. Prefill writes as the same document, so duplicated headings and re-summarized paragraphs almost never happen — but you cannot nudge the direction of the continuation. The explicit approach lets you steer ("from here, just the conclusion, briefly"), at the cost of a higher chance the model politely retraces the prior text.

// Prefill: push only the assistant turn
messages.push({ role: "assistant", content: text });
 
// Explicit: follow the assistant turn with a user request to continue
messages.push({ role: "assistant", content: text });
messages.push({
  role: "user",
  content: "Continue the previous text without repeating what you already wrote.",
});

My call: for a single coherent piece like a release note or explainer, I recommend prefill. For something like folding a long processing log into a running summary, the explicit approach — where you can inject an instruction each round — is easier to work with. Choose by use case.

Combining streaming with continuation

A chat UI that displays text incrementally needs streaming and continuation at once. Stream each round, but keep the display unbroken across rounds. Since overlap trimming only matters at round boundaries, stream as-is during a round and only reconcile the head of the next round against the tail of the previous one.

async function* streamLong(system: string, prompt: string, maxRounds = 6) {
  const messages: Anthropic.MessageParam[] = [{ role: "user", content: prompt }];
  let tail = "";
  for (let r = 0; r < maxRounds; r++) {
    const stream = client.messages.stream({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      system,
      messages,
    });
    let roundText = "";
    for await (const ev of stream) {
      if (ev.type === "content_block_delta" && ev.delta.type === "text_delta") {
        roundText += ev.delta.text;
        yield ev.delta.text; // pass straight to the UI
      }
    }
    const final = await stream.finalMessage();
    if (final.stop_reason !== "max_tokens") break;
    tail = roundText.slice(-200); // keep the tail to reconcile next round
    messages.push({ role: "assistant", content: roundText });
  }
}

In streaming you cannot easily un-display characters you already showed, so it pairs well with the low-duplication prefill approach. A few boundary characters can still overlap, so reconcile only the first text_delta of the next round against tail and suppress that much output if it overlaps. Stopping before you print is gentler on UX than reaching back to delete.

Testing continuation by forcing a cutoff

The awkward part of continuation logic is that max_tokens cutoffs only happen "sometimes." You cannot wait for a lucky truncation on every test run. So in tests, deliberately set max_tokens very small (say 64) to force multiple rounds and verify behavior.

// Test: a tiny max_tokens forces the continuation loop to fire
const res = await generateLong(system, "Explain the numbers 1 to 30 as a numbered list", {
  maxTokens: 64,
  maxRounds: 10,
});
// Things to check
//  1) rounds >= 2 (continuation fired)
//  2) no number appears twice (stitch removed duplicates)
//  3) fences do not end on an odd count (isFenceOpen is false)
console.assert(res.rounds >= 2, "continuation did not fire");

Shrinking the ceiling to force the path is a standard trick for routinely exercising a code path you never want to hit in production. I run this in my release-note tests and tuned the stitch lower bound until mistrimming hit zero. Continuation is the kind of logic that breaks quietly, so owning a test that can reproduce the breakage on demand maps directly to peace of mind in production.

Why I do not just max out max_tokens

"Continuation is a hassle, so just set max_tokens to the ceiling" is a thought everyone has once. I did too. But once I measured it in production, I found it pays a hidden cost.

First, max_tokens is "the maximum output length," not "the billed length." Billing is on the tokens actually generated, so 4096 versus 16384 costs the same if only 500 tokens come out. Most people know this. The blind spot is latency. A larger output ceiling nudges the model toward "I can keep going," and the average output length itself grows. In my release-note generation, raising max_tokens from 4096 to 16384 alone padded the output with unneeded asides — average output swelled by about 1.4x and per-note generation time visibly grew.

// If most generations are short, keep max_tokens modest and grow only via continuation
const SHORT = { maxTokens: 2048, maxRounds: 1 }; // FAQ answers, etc.
const LONG  = { maxTokens: 4096, maxRounds: 6 }; // explainer docs, etc.

My policy is to keep the default max_tokens modest and grow only the routes that truly need length, through the continuation loop. That avoids sacrificing latency on short generations while still handling long ones without a ceiling. Raising max_tokens is tuning "how much to write per round," not a way to increase "how much can be written" — that is the accurate framing.

Notes on continuing JSON or structured output

Everything so far assumed plain prose, but continuing JSON or structured output has a different difficulty. Prose that is cut off is merely "mid-sentence," but JSON with one missing bracket is unparseable.

Continuation actually pairs best with a design that treats structure as line-delimited records (JSON Lines) rather than one big JSON object. With one record per line, a max_tokens cutoff just means "drop the last incomplete line" and the rest are still valid records. Then tell the continuation round to "continue from the last complete line" and you recover the missing record.

function dropPartialLastLine(s: string): string {
  const i = s.lastIndexOf("\n");
  return i >= 0 ? s.slice(0, i) : s; // drop the incomplete trailing line
}
// Safely accumulate each round's output as JSON Lines
const safe = dropPartialLastLine(roundText);
const records = safe.split("\n").filter(Boolean).map((l) => JSON.parse(l));

If you absolutely must continue a single large JSON object, keep the string up to the cut point, continue the next round via assistant prefill, and validate only bracket balance until it parses. That said, in my experience "assembling one giant JSON via continuation" is fragile, and routing to either JSON Lines or section splitting is more stable in production. The more your output needs structural completeness, the safer it is to split up front rather than stitch a continuation later.

Wiring it into release-note generation

Finally, here is how I actually wire it in. When generating release notes for six apps across multiple languages, I split into one independent generateLong call per language and always log rounds and outputTokens from each return value. If rounds is consistently 1, the content is short enough that continuation is unnecessary, so I lower max_tokens to win latency; if rounds keeps pinning to the cap, I raise per-round max_tokens or split the template.

const results = await Promise.all(
  locales.map((loc) =>
    generateLong(systemFor(loc), releaseNotePrompt(loc), { maxRounds: 6 })
  )
);
 
for (const [i, r] of results.entries()) {
  const md = repairFences(r.text);
  console.log(`[${locales[i]}] rounds=${r.rounds} out=${r.outputTokens}tok est=$${estimateUSD(r.inputTokens, r.outputTokens).toFixed(3)}`);
  await writeReleaseNote(locales[i], md);
}

That rounds log turns continuation from "magic that runs silently" into "behavior you can observe." Running apps that have crossed 50 million downloads as a one-person operation, the worst enemy is something breaking quietly and going unnoticed. Do not swallow stop_reason, observe the seams, and stop on a cost ceiling. Hold those three and long-form generation becomes safe to automate.

For a next step, add a single if (resp.stop_reason === "max_tokens") branch to your existing long-generation script and just log how often truncation actually happens. Once you know the real cutoff frequency, the right values for max_tokens and the round cap come into focus. I hope this helps anyone wrestling with the same quiet truncation.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.