⬡ API & SDK/2026-06-15Advanced

When a Model Disappears Without Warning: A State Machine for Retirement, Withdrawal, and Overload

A model can become unusable in hours for reasons that have nothing to do with a technical outage. This guide models three distinct flavors of 'unavailable'—retirement, withdrawal, and transient overload—as one availability state machine, with a router that keeps automated pipelines running. Working TypeScript and Python included.

claude-api⁸¹ architecture¹⁰ resilience¹⁰ model-migration⁴ production¹¹¹ fallback⁸

✦ Premium Article

One morning, before kicking off the automated publishing across my four sites, I opened the changelog and the official announcements as usual. One of the models I had been using for smoke tests the day before had been suspended within hours for reasons that were not a technical outage—withdrawn for all foreign-national users. There was no retirement email. The API had not started returning 429. The model name that answered yesterday simply stopped being accepted today.

What I run is a content-generation pipeline. No lives or payments depend on it. Even so, the fact that a headless job running overnight assumed a specific model name suddenly felt heavy. I did have a fallback, but that branch only anticipated 429 and 529—transient overload. Permanent retirement and a sudden, externally driven withdrawal were lumped together inside the same fallback().

What I learned that day is that "unavailable" wears several different faces, and pouring them all into one exception handler leads you to make the wrong recovery decision. Transient overload returns in minutes. Hammering a withdrawn model every few minutes only piles up wasted failures. A retired model never comes back no matter how long you wait. This article designs an availability state machine that holds these three as distinct states, and a router built on top of it that changes behavior per state.

"Unavailable" Has Three Faces

Let me classify, by nature, the kinds of "unavailable" that stop automation. I settled on three.

The first is retirement. The vendor announces it in advance, and past a certain date the model is permanently no longer accepted. Today the older claude-sonnet-4 and claude-opus-4 retired from the API. This is predictable and the successor is known. Waiting won't help, so the right move is to switch to the successor the moment you detect it.

The second is withdrawal. Not a technical outage, but a short-notice suspension driven by policy, legal, or security decisions—external factors. There is no announced date, and recovery carries the uncertainty of "maybe it will come back eventually." The case I opened with is this. Unlike retirement, no successor is waiting, so you need to decide whether to shift sideways to another logical role or temporarily scale that job down.

The third is overload. Transient unavailability that recovers on its own within minutes to hours, like 429 Too Many Requests or 529 Overloaded. If you perform a permanent switch here, you stay needlessly downgraded from the higher model you actually wanted. The correct response is exponential backoff, then return to the original once it recovers.

Handle all three in a single catch, and you get misjudgments: applying overload-style backoff endlessly to a withdrawn model, or permanently downgrading an overloaded model as if it were retired. That is why the states are held separately.

Decouple Logical Roles From Physical Model IDs

Before designing the state machine, I set up the registry that everything rests on. As long as business code knows model strings directly, every withdrawal or retirement means rewriting dozens of call sites. As an indie developer I learned this the hard way at first.

So the only thing the code references is a logical role (fast / balanced / deep), and the mapping from role to physical model ID, along with each model's availability state, is consolidated in one place.

// model-registry.ts
export type Role = "fast" | "balanced" | "deep";
export type Availability = "available" | "overloaded" | "withdrawn" | "retired";
 
interface ModelEntry {
  id: string;            // physical model ID
  inputPer1M: number;    // input token price (USD / 1M tokens)
  outputPer1M: number;   // output token price
  availability: Availability;
  understudy?: string;   // in-role fallback ID for withdrawal/retirement
}
 
// Each role holds a priority-ordered candidate list; the head is first choice.
export const REGISTRY: Record<Role, ModelEntry[]> = {
  deep: [
    { id: "claude-opus-4-8", inputPer1M: 5.0, outputPer1M: 25.0, availability: "available" },
    { id: "claude-sonnet-4-6", inputPer1M: 3.0, outputPer1M: 15.0, availability: "available" },
  ],
  balanced: [
    { id: "claude-sonnet-4-6", inputPer1M: 3.0, outputPer1M: 15.0, availability: "available" },
    { id: "claude-haiku-4-5", inputPer1M: 1.0, outputPer1M: 5.0, availability: "available" },
  ],
  fast: [
    { id: "claude-haiku-4-5", inputPer1M: 1.0, outputPer1M: 5.0, availability: "available" },
  ],
};

Business code requests a model by logical role, like route("deep"), and never knows the physical ID. To remove a withdrawn model from every job, you rewrite the availability of one entry in REGISTRY. That single-file containment is the core of the design.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A transition table that treats 'retirement' (announced permanent removal), 'withdrawal' (short-notice suspension), and 'overload' (minutes-to-hours of 429/529) as separate states, plus a router whose behavior changes per state.

✦A ModelRegistry that maps logical roles (fast / balanced / deep) to physical model IDs in one file, so a withdrawn model can be removed from every job with a one-line state change (TypeScript and Python).

✦A daily preflight probe that detects withdrawal with a single tiny request, and a cost-accounting routine that re-prices tokens on fallback so your monthly totals never quietly drift.

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Define the State Machine Transitions

Next, make explicit how each model's availability transitions. Leave it vague and you end up deciding "when do we revert?" by gut feel during operations, losing reproducibility.

I defined the transitions as follows.

available → overloaded: on a run of consecutive 429 or 529. Record a timestamp; mark as a candidate to auto-return to available after a cooldown.
overloaded → available: after the cooldown (default 10 minutes), once the first successful response is confirmed.
available → retired: past the announced retirement date, or on a permanent not-found error. Never auto-reverted.
available → withdrawn: on behavior indicating withdrawal (a model that succeeded until moments ago starts getting rejected en masse with validation or permission errors). With no successor defined, scale down to the next in-role candidate. Reverted only by an explicit operator action.

The key is that retired and withdrawn are not auto-recovered. Only overload naturally heals over time; the other two won't return unless the outside world changes. If a machine reverts on its own, it keeps slamming a withdrawn model and mass-produces failures.

// availability-machine.ts
const OVERLOAD_COOLDOWN_MS = 10 * 60 * 1000; // 10 minutes
 
interface Health {
  state: Availability;
  overloadedAt?: number;
  consecutive429: number;
}
 
const health = new Map<string, Health>(); // key: model ID
 
export function noteResult(
  id: string,
  outcome: "ok" | "overloaded" | "retired" | "withdrawn",
) {
  const h = health.get(id) ?? { state: "available", consecutive429: 0 };
  if (outcome === "ok") {
    h.state = "available";
    h.consecutive429 = 0;
    h.overloadedAt = undefined;
  } else if (outcome === "overloaded") {
    h.consecutive429 += 1;
    if (h.consecutive429 >= 3) {
      h.state = "overloaded";
      h.overloadedAt = Date.now();
    }
  } else if (outcome === "retired") {
    h.state = "retired"; // no auto-recovery
  } else if (outcome === "withdrawn") {
    h.state = "withdrawn"; // operator action only
  }
  health.set(id, h);
}
 
export function effectiveState(id: string): Availability {
  const h = health.get(id);
  if (!h) return "available";
  // Only overload heals naturally over time.
  if (h.state === "overloaded" && h.overloadedAt &&
      Date.now() - h.overloadedAt > OVERLOAD_COOLDOWN_MS) {
    return "available"; // cooldown over; confirmed by the next success
  }
  return h.state;
}

Implement the Router

With the registry and health in place, the router becomes a thin layer that just "takes a logical role and returns the first usable candidate." The distinctions among withdrawal, retirement, and overload are all absorbed into effectiveState, so the router itself carries no branching.

// router.ts
import Anthropic from "@anthropic-ai/sdk";
import { REGISTRY, Role } from "./model-registry";
import { noteResult, effectiveState } from "./availability-machine";
 
const client = new Anthropic(); // ANTHROPIC_API_KEY from env
 
function pickModel(role: Role): string {
  const candidates = REGISTRY[role]
    .filter((m) => effectiveState(m.id) === "available");
  if (candidates.length === 0) {
    throw new Error(`no available model for role ${role}`);
  }
  return candidates[0].id;
}
 
function classifyError(err: unknown): "overloaded" | "retired" | "withdrawn" | "other" {
  const e = err as { status?: number; error?: { type?: string } };
  if (e.status === 429 || e.status === 529) return "overloaded";
  // A permanent not-found is treated as retirement.
  if (e.error?.type === "not_found_error") return "retired";
  // A model that succeeded until moments ago, now rejected on permission, hints at withdrawal.
  if (e.status === 403 || e.error?.type === "permission_error") return "withdrawn";
  return "other";
}
 
export async function route(role: Role, params: Omit<Anthropic.MessageCreateParams, "model">) {
  let lastErr: unknown;
  for (let attempt = 0; attempt < REGISTRY[role].length; attempt++) {
    const model = pickModel(role);
    try {
      const res = await client.messages.create({ ...params, model });
      noteResult(model, "ok");
      return { res, model };
    } catch (err) {
      lastErr = err;
      const kind = classifyError(err);
      if (kind === "other") throw err; // unexpected: rethrow
      noteResult(model, kind);
      // overloaded may recover after cooldown, but this call falls
      // through to the next candidate to return a response now.
    }
  }
  throw lastErr;
}

What pays off here is classifying the error kind up front with classifyError. A 403 or permission error can come from a misconfiguration, but when it starts suddenly on a model that succeeded until moments ago, it is worth treating as a withdrawal signal. If false positives worry you, add a guard: only flag withdrawn for a model that has a successful call within the last N minutes.

Detect Withdrawal Early With a Daily Preflight

Before the nightly batch runs, place a preflight that fires a single tiny call to each role's head candidate to catch withdrawal and retirement early. It is far cheaper and safer than discovering the problem mid-way through a heavy job.

# preflight.py — one lightweight probe per role
import anthropic
 
client = anthropic.Anthropic()  # ANTHROPIC_API_KEY from env
 
PROBE_BY_ROLE = {
    "deep": "claude-opus-4-8",
    "balanced": "claude-sonnet-4-6",
    "fast": "claude-haiku-4-5",
}
 
def probe(model_id: str) -> str:
    try:
        client.messages.create(
            model=model_id,
            max_tokens=1,
            messages=[{"role": "user", "content": "ok"}],
        )
        return "available"
    except anthropic.APIStatusError as e:
        if e.status_code in (429, 529):
            return "overloaded"   # transient; batch may proceed
        if e.status_code == 404:
            return "retired"      # switch to successor required
        if e.status_code == 403:
            return "withdrawn"    # scale-down decision required
        raise
 
if __name__ == "__main__":
    blocking = []
    for role, model_id in PROBE_BY_ROLE.items():
        state = probe(model_id)
        print(f"{role:9s} {model_id:20s} -> {state}")
        if state in ("retired", "withdrawn"):
            blocking.append((role, model_id, state))
    if blocking:
        for role, model_id, state in blocking:
            print(f"::alert:: role={role} model={model_id} state={state}")

A max_tokens=1 probe costs a trivial amount per call. In my own operation, one run a day across three roles came to a few yen a month. Slipping it in right before the nightly batch nearly eliminated the "discover the withdrawal mid-way through a heavy job" accident.

Don't Let Fallback Corrupt Your Cost Records

An easy thing to miss is cost accounting when a fallback changes the model. When the deep role downgrades from claude-opus-4-8 to claude-sonnet-4-6, the per-token price changes for the same token count. Keep charging at the first-choice price without recording the downgrade and your end-of-month totals quietly drift.

The router returns the model ID that actually succeeded, so always account using that return value.

// cost.ts
import { REGISTRY } from "./model-registry";
 
export function recordCost(
  usedModelId: string,
  inputTokens: number,
  outputTokens: number,
): number {
  const entry = Object.values(REGISTRY).flat().find((m) => m.id === usedModelId);
  if (!entry) throw new Error(`unknown model: ${usedModelId}`);
  const usd =
    (inputTokens / 1_000_000) * entry.inputPer1M +
    (outputTokens / 1_000_000) * entry.outputPer1M;
  // Price by the model actually used, not the first choice.
  return usd;
}

Downgrades can be cheaper, but a configuration where the fast role climbs to balanced under overload can also raise the unit price. Anchor accounting on the actually-used model so re-pricing is correct in either direction. In a period like today, when the billing scheme itself is under review, this "account by actual use" becomes your direct way of observing real cost.

Lessons and Pitfalls From Production

After running this setup for a few weeks, here is what I saw, candidly.

First, treating automatic withdrawal detection as advisory only turned out to be the practical stance. Declaring a 403 to be a withdrawal leaves too much room for false positives. I eventually settled on: when the preflight probe reports withdrawn, notify a human, and let the operator confirm the registry state change. The only things I let the machine do automatically are overload scale-down and recovery. The line between what a machine may decide in production and what a person should decide is best drawn by the cost of a false positive.

Second, always keep one in-role alternative that is not from the same lineage. In an event like today's withdrawal, where a whole family becomes unusable at once, escaping to another model in the same family is no guarantee of safety. Place a generation that is lower in capability but reliably present as the deep role's scale-down target, so the worst case is still "scale down and finish."

Third, set the overloaded cooldown too short and you get oscillation—repeatedly reverting to the first choice during a peak only to be rejected again. I started at 10 minutes and tuned it against the measured duration of 529 spells. I recommend measuring how long overload actually lasts before settling on a value, rather than hard-coding a guess.

Finally, the essence of this design comes down to clearly holding "things that return if you wait" and "things that won't return until the outside world changes" as distinct states in code. When a day like today arrives, being able to change one line in the registry—rather than scrambling to rewrite code—is what keeps automation running. For someone operating several systems alone, that quiet assurance is worth a great deal.

As a next step, count how many places in your own pipeline hard-code the model string you use today, with a quick grep. That number is exactly how many edits a generation change or a withdrawal will cost you.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.