⬡ API & SDK/2026-06-27Advanced

Designing the Give-Up Condition in Self-Repair Loops: Four Error Classes, Four Retry Budgets

LLM self-repair loops break on the fantasy that 'if you keep fixing, it eventually passes.' Classify errors into four classes, give each its own retry budget. Working TypeScript and real cost numbers included.

Claude API⁸⁹ self-repair retry design error handling² production¹⁰⁴

✦ Premium Article

Late at night, the cost of an unattended generation pipeline spiked to three times the previous day.

Nothing had crashed. The opposite, in fact: every job was logged as "finally succeeded." Tracing it, one job had rebuilt the same output 27 times. The loop had dutifully thrown "try again" at an output that could never pass validation.

LLM self-repair loops have a quiet trap. Write the loop on the assumption that "fixing makes it pass," and it will try to fix errors that are unfixable, forever. What matters in production is not how cleverly you repair, but when you decide to stop.

This piece covers a design that classifies errors into four classes and assigns a retry budget per class. The code is shown in copy-and-run form.

Why naive retry breaks in production

The loop most of us write first looks like this.

// Anti-pattern: keep fixing until it passes
async function generateUntilValid(prompt: string) {
  while (true) {
    const out = await callClaude(prompt);
    if (validate(out).ok) return out;
    prompt = `${prompt}\n\nThe previous output failed validation. Please fix it.`;
  }
}

This is fine when a human is watching interactively. They give up after a few rounds and fix it by hand.

It breaks under unattended operation. A bug in the validator, a constraint that can't be satisfied, a transient model hiccup — none of these are solved by "please fix it." Yet the loop keeps spinning and keeps burning tokens.

The root issue is treating all retries uniformly. A transient overload (429 or 529) and a structurally unsatisfiable constraint demand completely different actions. The former heals if you wait; the latter never heals no matter how many times you resend.

Classify errors into four classes

What actually drives the retry decision in production is this four-way split. The trick is to divide by "how should I respond," not by where the error originates.

Class	Typical examples	Correct response	Suggested retry budget
transient	429 / 529 / timeout / network drop	Wait with exponential backoff and resend. Do not change the prompt	5–7 (including backoff)
repairable	Broken JSON / schema mismatch / missing required field	Rebuild once or twice with the error attached	Up to 2
semantic-invalid	Factual error / constraint violation / quality-gate failure	Do not repeat the same request. Change the approach itself	1 (with a different strategy)
hard-fail	401 / 400 (bad input) / model not found / unsatisfiable constraint	Abort immediately. Hand to a human or fallback	0

The value of this split is that the give-up condition falls out naturally per class. hard-fail is zero; semantic-invalid is one, with a different strategy. Plain retry of the same request only ever makes sense for transient.

The distinction between repairable and semantic-invalid is the most important one in the design. repairable means "the shape is broken," so showing the error fixes it. semantic-invalid means "the content does not meet the requirement," so repeating the same ask just goes in circles. Those 27 attempts at the top were a textbook case of mistaking semantic-invalid for repairable.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A classifier that maps errors into transient / repairable / semantic-invalid / hard-fail and assigns a separate retry budget per class (working TypeScript)

✦Why an unattended pipeline's token cost balloons several-fold without an explicit give-up condition, and how to set retry budgets and a hard cost ceiling

✦How to add structured attempt logs and fallbacks (switch strategy, hand off to a human) so a silent loop never spins forever undetected

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Implement the error classifier

First, build a classifier that maps exceptions and responses to a class. Centralizing this in one place means a change in response policy lands in one spot.

type ErrorClass = "transient" | "repairable" | "semantic-invalid" | "hard-fail";
 
interface Classified {
  cls: ErrorClass;
  reason: string;
}
 
// Map API exceptions / validation results to the four classes
function classify(err: unknown, validation?: { ok: boolean; kind?: string }): Classified {
  // 1) Decide validation failures first
  if (validation && !validation.ok) {
    if (validation.kind === "schema" || validation.kind === "json")
      return { cls: "repairable", reason: `malformed: ${validation.kind}` };
    // Factual error, quality-gate failure: a content problem, not a shape problem
    return { cls: "semantic-invalid", reason: "content does not meet requirement" };
  }
 
  // 2) Classify by HTTP status
  const status = (err as { status?: number })?.status;
  if (status === 429 || status === 529) return { cls: "transient", reason: `overload ${status}` };
  if (status === 408 || status === 500 || status === 503)
    return { cls: "transient", reason: `temporary ${status}` };
  if (status === 401 || status === 403) return { cls: "hard-fail", reason: `auth ${status}` };
  if (status === 400) return { cls: "hard-fail", reason: "bad input (400)" };
  if (status === 404) return { cls: "hard-fail", reason: "model/resource missing (404)" };
 
  // 3) Network-level exceptions
  const code = (err as { code?: string })?.code;
  if (code === "ETIMEDOUT" || code === "ECONNRESET" || code === "ENOTFOUND")
    return { cls: "transient", reason: `network ${code}` };
 
  // 4) Anything we can't tell: fail safe (abort)
  return { cls: "hard-fail", reason: "unclassified, aborting on the safe side" };
}

That last line — "unclassified is hard-fail" — quietly matters. If you treat the unknown as transient, an unknown permanent error makes you resend endlessly. The safe side is the side that stops.

A loop that assigns a budget per class

Next, a loop that changes behavior by class. The key is to hold an independent remaining budget per class. Not "how many total," but "how many transient, how many repairable," counted separately.

interface Budget {
  transient: number;       // e.g. 6
  repairable: number;      // e.g. 2
  semanticInvalid: number; // e.g. 1 (with a different strategy)
  hardCostCeilingUsd: number; // e.g. 0.50 (this task's cost ceiling)
}
 
interface Attempt { n: number; cls?: ErrorClass; reason: string; costUsd: number; }
 
async function repairLoop(
  task: { build: (hint?: string) => string; altBuild?: () => string },
  validateFn: (out: string) => { ok: boolean; kind?: string; detail?: string },
  budget: Budget
): Promise<{ ok: boolean; out?: string; attempts: Attempt[] }> {
  const left = { ...budget };
  const attempts: Attempt[] = [];
  let spent = 0;
  let prompt = task.build();
  let usedAlt = false;
 
  for (let n = 1; ; n++) {
    // The cost ceiling is the final stopper shared across all classes
    if (spent >= budget.hardCostCeilingUsd) {
      attempts.push({ n, reason: `cost ceiling $${budget.hardCostCeilingUsd} reached`, costUsd: 0 });
      return { ok: false, attempts };
    }
 
    let out: string, cost: number;
    try {
      const res = await callClaude(prompt);     // assumed to return { text, costUsd }
      out = res.text; cost = res.costUsd; spent += cost;
    } catch (e) {
      const c = classify(e);
      attempts.push({ n, cls: c.cls, reason: c.reason, costUsd: 0 });
      if (c.cls === "transient" && left.transient-- > 0) {
        await sleep(backoff(n)); continue;       // wait, resend the same prompt
      }
      return { ok: false, attempts };            // anything but transient: abort now
    }
 
    const v = validateFn(out);
    if (v.ok) { attempts.push({ n, reason: "success", costUsd: cost }); return { ok: true, out, attempts }; }
 
    const c = classify(undefined, v);
    attempts.push({ n, cls: c.cls, reason: `${c.reason}: ${v.detail ?? ""}`, costUsd: cost });
 
    if (c.cls === "repairable" && left.repairable-- > 0) {
      prompt = task.build(`Last time it failed on ${v.kind} (${v.detail}). Fix only that.`);
      continue;
    }
    if (c.cls === "semantic-invalid" && left.semanticInvalid-- > 0 && task.altBuild && !usedAlt) {
      usedAlt = true; prompt = task.altBuild();  // change strategy instead of rephrasing
      continue;
    }
    return { ok: false, attempts };              // budget spent, no alt strategy -> give up
  }
}
 
const backoff = (n: number) => Math.min(1000 * 2 ** (n - 1), 30_000) + Math.random() * 250;
const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

This loop has no infinite path. Each class budget only decreases, and the cost ceiling is a separate global stopper. Whatever branch you take, you reach ok: false or ok: true in a finite number of steps.

Notice the altBuild() call on the semantic-invalid branch. Rather than rephrasing the same prompt, it switches to a different way of building the task (split the output, add more references, draft with a smaller model then polish). "Giving up" only becomes practical when it is paired with "play a different move."

Why hold a cost ceiling separately from class budgets

Class budgets alone make the count finite. We still hold a separate cost ceiling because the cost per attempt is not constant.

For jobs with long inputs or extended thinking, a single attempt can swing by tens of times. Even bounded by count, a run of unusually expensive attempts can blow past your assumptions. The count budget guards "how many runaway steps"; the cost ceiling guards "total dollars." Different roles, so hold both.

In practice, tying the cost ceiling to the value of the task keeps it from breaking. In the article generation I run as an indie developer, I set each article's ceiling to a fraction of the value that article produces, and if it's exceeded the job quietly gives up and rolls over to the next run. Folding gracefully and moving on produced a more stable daily output than grinding until the budget was exhausted.

Always keep structured attempt logs

The scariest state in unattended operation is the one at the top: "logged as success, actually broken." Whether you can detect it comes down to the quality of your attempt logs.

For each attempt, keep at minimum the following.

Field	Use
Attempt count and per-class breakdown	Detect "how many repairable showed up in one job"
Reason per class	Post-hoc analysis of which validation failed
Cumulative cost	Whether the ceiling was hit; early detection of cost anomalies
Final result and give-up reason	Distinguish "budget spent" / "cost ceiling" / "hard-fail"

What earns its keep here is an aggregate that warns on "success but too many attempts." If repairable shows up three or more times in one job, there's a structural problem in the validator or the prompt. Even on success, that's a sign it "happened to pass."

// Warn even on success when the class breakdown looks off (catch silent decay)
function inspect(attempts: Attempt[]) {
  const count = (c: ErrorClass) => attempts.filter((a) => a.cls === c).length;
  const total = attempts.reduce((s, a) => s + a.costUsd, 0);
  if (count("repairable") >= 3)
    log.warn("too many repairable: suspect structural issue in prompt/validator", { attempts });
  if (count("semantic-invalid") >= 1)
    log.warn("semantic-invalid occurred: reconsider topic or approach", { attempts });
  return { total, repaired: count("repairable") };
}

Watching only the success/failure binary, this layer's anomalies stay invisible forever. Keeping the per-class breakdown is exactly what lets you catch "succeeding but not healthy."

Provide fallback tiers

Do nothing after giving up, and the unattended pipeline quietly springs a leak. Decide the "next move" per class in advance.

For transient with budget spent, rolling over to the next scheduled run is the natural choice. It heals with time, so don't force the issue right there.

For semantic-invalid where the alternate strategy is also exhausted, switch to a lower-difficulty substitute deliverable. In my case, when generating a new article fails, a tier kicks in that automatically switches to enriching an existing article. Reassigning to valuable other work beats ending with zero output.

For hard-fail, escalate to a human as a rule. 401 and 400 are config or input problems, so the right answer is notification, not retry. Mix this into automatic retry and the root cause stays hidden while the incident drags on.

The first move to make

If you have an existing naive retry loop, the first step is introducing the classifier. Placing classify() in one spot and routing 429/529, validation failure, and 4xx down separate paths stops most infinite loops by itself.

Then audit whether you're mistaking repairable for semantic-invalid. If there's a path that keeps throwing "please fix it," it is highly likely misreading semantic-invalid as repairable. Stop repeating the same request; fall over to a different strategy or to giving up. That is the surest single move to prevent the three-in-the-morning cost-times-three incident.

I hope this gives some footing to anyone wrestling with the same unattended operation.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.