CLAUDE LABJP
MODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yetPRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context windowGATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoffENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alertsMODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yetPRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context windowGATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoffENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alerts
Articles/Claude Code
Claude Code/2026-07-04Advanced

A 1M Context Window Is the New Default — So I Built an Admission Policy Instead of Filling It

Sonnet 5 is now the Claude Code default and native 1M context is standard. The hard errors disappeared, but a quieter kind of degradation took their place. Here is how I made it visible with a probe, plus an admission policy and an effective-token-cost view — with working code and my own measurements.

Claude Code179Sonnet 54context design1M contextunattended automation3

Premium Article

I was scanning my nightly batch logs when a strange quiet caught my attention. A few weeks earlier, my heaviest article-generation job had been dropping sessions two or three times a day with context window exceeded. That had simply stopped.

It happened right after the Claude Code default switched to Sonnet 5, with a native 1M-token context becoming standard. My first reaction was relief: no more shaving off parts of a repository just to fit the window. But after a few days of running it, I noticed that the sessions no longer crashing did not mean everything was fine. There were no errors. It was just that output quality wobbled now and then, and my input-token bill was quietly climbing.

This article is a record of how I made that silent degradation visible and what I did about it — with a working probe, an admission policy, and the actual numbers I measured on my own pipeline.

From "it crashes" to "it smudges": the failure mode flipped

When the window was narrow, context problems always showed up as exceptions. Exceed the limit and the session stops. It hurts, but at least you notice. context window exceeded sits in the log, and trimming the overstuffed input is an obvious, unambiguous fix for the same day.

Once 1M became the default, that safety valve came off. Even my heavy job — which hands over repository excerpts from four sites, reference data, and the day's logs all at once — measures around 150K–200K tokens. Nowhere near 1M. So it never crashes again. So far, so good.

The problem was that no longer crashing quietly shifted my judgment toward "if it fits, keep it." Material I used to trim to fit the window now went in whole. And that produced two kinds of smudge.

Failure typeWhen the window was narrowAfter 1M became default
DetectabilityYou notice instantly via the exceptionNo error. You only notice by reading the output
SymptomSession stopsFacts buried mid-context get dropped or paraphrased wrong
CostWasted on failures, but total stays restrainedInput tokens creep up and the bill inflates
Urgency to fixFixed the same dayEasy to leave alone until it becomes the norm

A wide window means there is no longer a penalty for overstuffing. When the penalty disappears, the only discipline left is the one you bring yourself. The first thing I needed was to make the degradation visible enough to see whether it was actually happening.

First, measure it: turn fill rate versus recall into a probe

Saying "quality feels worse" gives you nothing to act on. What I did was embed known facts into the context, raise the fill rate step by step, and measure whether the model could recall those facts accurately. It is the needle-in-a-haystack idea familiar from long-context testing, run against my own material.

Three things matter. First, the facts you embed must be verifiable — unique numbers or strings, like preconditions. Second, scatter them across shallow, middle, and tail positions; the middle ones tend to fall first. Third, raise the fill rate gradually so you can trace the curve of where recall starts to break.

// context-probe.ts — embed known facts, measure recall at each fill rate
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
const MODEL = "claude-sonnet-5";
 
// Verifiable "needles". In production I use each line of RUNTIME_ASSUMPTIONS.
interface Needle {
  id: string;
  fact: string;   // the unique fact embedded in context
  answer: string; // ground truth (matched exactly)
}
 
const NEEDLES: Needle[] = [
  { id: "n01", fact: "The deploy identifier DEPLOY_VERSION is v9137.", answer: "v9137" },
  { id: "n02", fact: "The nightly batch concurrency cap MAX_CONCURRENCY is 4.", answer: "4" },
  { id: "n03", fact: "The idempotency-key prefix for the outermost retry is job-.", answer: "job-" },
  // … 12 needles in id order in production
];
 
// dilute to the target fill with meaningful but harmless prose
function buildContext(needles: Needle[], filler: string, targetTokens: number): string {
  const facts = needles.map((n, i) => `[fact ${i}] ${n.fact}`);
  // spread across shallow / middle / tail (put the most fragile in the middle)
  const third = Math.ceil(facts.length / 3);
  const head = facts.slice(0, third);
  const mid = facts.slice(third, third * 2);
  const tail = facts.slice(third * 2);
  const chunks: string[] = [...head];
  let approx = head.join("\n").length / 3.5; // rough token estimate
  while (approx < targetTokens * 0.5) { chunks.push(filler); approx += filler.length / 3.5; }
  chunks.push(...mid);
  while (approx < targetTokens * 0.9) { chunks.push(filler); approx += filler.length / 3.5; }
  chunks.push(...tail);
  return chunks.join("\n\n");
}
 
async function probeRecall(fillTokens: number): Promise<number> {
  const context = buildContext(NEEDLES, FILLER_PARAGRAPH, fillTokens);
  const question =
    "From the material below, return only the value written in each fact, as a JSON array with its id. " +
    "No guessing. If it is not in the material, return null.\n\n" + context;
 
  const res = await client.messages.create({
    model: MODEL,
    max_tokens: 1024,
    messages: [{ role: "user", content: question }],
  });
 
  const text = res.content.find((b) => b.type === "text")?.text ?? "";
  let recalled = 0;
  for (const n of NEEDLES) {
    // loose check: exact answer present and not overwritten with a wrong value
    if (text.includes(n.answer)) recalled++;
  }
  return recalled / NEEDLES.length;
}
 
// walk the fill rate up to trace the curve
async function runCurve() {
  for (const fill of [30_000, 120_000, 300_000, 600_000, 900_000]) {
    const acc = await probeRecall(fill);
    console.log(`fill≈${fill}\trecall=${(acc * 100).toFixed(0)}%`);
  }
}

The key is to make FILLER_PARAGRAPH prose that resembles what you actually feed into context (document excerpts and the like). Diluting with random strings erases the difference between "meaningless noise" and "meaningful but currently irrelevant context," which makes the model look better than it will in production.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Why the failure mode flipped from hard errors to silent drift, and a probe that keeps you from missing it
A needle probe that measures recall against fill rate, plus the degradation curve I observed on my own pipeline
An admission policy that vets each slot (Before/After), and an effective-token-cost view that stops quiet cost creep
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Claude Code2026-07-02
Which Model Ran Last Night's Unattended Session? Building Model Attribution and Default-Drift Detection After the Sonnet 5 Switch
Claude Code's default model switched to Sonnet 5, and unpinned headless runs changed models silently. Here is a working design for extracting the actual model from run output, appending an atomic run record, and deciding per task lineage whether to pin or follow the default.
Claude Code2026-07-01
Don't Accept an Agent's Numbers and Citations As-Is — A Verification Gate Built on a Dedicated Auditor Subagent
A design that verifies every number and citation in an agent-generated summary using a separate subagent before accepting it — with working TypeScript for deterministic recomputation and fail-closed source matching.
Claude Code2026-07-03
Keep the Extra Capacity Out of Your Baseline — Burning Backlog During the Time-Boxed +50% Weekly Limit
Claude Code's weekly limits are raised 50% until July 13. A design for spending the temporary headroom only on finite backlog work: an expiry-aware burst queue, a dual-lane ledger, and a single ratio that tells you whether your baseline quietly grew.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →