●MODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yet●PRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31●CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context window●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)●CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoff●ENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alerts●MODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yet●PRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31●CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context window●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)●CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoff●ENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alerts
A 1M Context Window Is the New Default — So I Built an Admission Policy Instead of Filling It
Sonnet 5 is now the Claude Code default and native 1M context is standard. The hard errors disappeared, but a quieter kind of degradation took their place. Here is how I made it visible with a probe, plus an admission policy and an effective-token-cost view — with working code and my own measurements.
I was scanning my nightly batch logs when a strange quiet caught my attention. A few weeks earlier, my heaviest article-generation job had been dropping sessions two or three times a day with context window exceeded. That had simply stopped.
It happened right after the Claude Code default switched to Sonnet 5, with a native 1M-token context becoming standard. My first reaction was relief: no more shaving off parts of a repository just to fit the window. But after a few days of running it, I noticed that the sessions no longer crashing did not mean everything was fine. There were no errors. It was just that output quality wobbled now and then, and my input-token bill was quietly climbing.
This article is a record of how I made that silent degradation visible and what I did about it — with a working probe, an admission policy, and the actual numbers I measured on my own pipeline.
From "it crashes" to "it smudges": the failure mode flipped
When the window was narrow, context problems always showed up as exceptions. Exceed the limit and the session stops. It hurts, but at least you notice. context window exceeded sits in the log, and trimming the overstuffed input is an obvious, unambiguous fix for the same day.
Once 1M became the default, that safety valve came off. Even my heavy job — which hands over repository excerpts from four sites, reference data, and the day's logs all at once — measures around 150K–200K tokens. Nowhere near 1M. So it never crashes again. So far, so good.
The problem was that no longer crashing quietly shifted my judgment toward "if it fits, keep it." Material I used to trim to fit the window now went in whole. And that produced two kinds of smudge.
Failure type
When the window was narrow
After 1M became default
Detectability
You notice instantly via the exception
No error. You only notice by reading the output
Symptom
Session stops
Facts buried mid-context get dropped or paraphrased wrong
Cost
Wasted on failures, but total stays restrained
Input tokens creep up and the bill inflates
Urgency to fix
Fixed the same day
Easy to leave alone until it becomes the norm
A wide window means there is no longer a penalty for overstuffing. When the penalty disappears, the only discipline left is the one you bring yourself. The first thing I needed was to make the degradation visible enough to see whether it was actually happening.
First, measure it: turn fill rate versus recall into a probe
Saying "quality feels worse" gives you nothing to act on. What I did was embed known facts into the context, raise the fill rate step by step, and measure whether the model could recall those facts accurately. It is the needle-in-a-haystack idea familiar from long-context testing, run against my own material.
Three things matter. First, the facts you embed must be verifiable — unique numbers or strings, like preconditions. Second, scatter them across shallow, middle, and tail positions; the middle ones tend to fall first. Third, raise the fill rate gradually so you can trace the curve of where recall starts to break.
// context-probe.ts — embed known facts, measure recall at each fill rateimport Anthropic from "@anthropic-ai/sdk";const client = new Anthropic();const MODEL = "claude-sonnet-5";// Verifiable "needles". In production I use each line of RUNTIME_ASSUMPTIONS.interface Needle { id: string; fact: string; // the unique fact embedded in context answer: string; // ground truth (matched exactly)}const NEEDLES: Needle[] = [ { id: "n01", fact: "The deploy identifier DEPLOY_VERSION is v9137.", answer: "v9137" }, { id: "n02", fact: "The nightly batch concurrency cap MAX_CONCURRENCY is 4.", answer: "4" }, { id: "n03", fact: "The idempotency-key prefix for the outermost retry is job-.", answer: "job-" }, // … 12 needles in id order in production];// dilute to the target fill with meaningful but harmless prosefunction buildContext(needles: Needle[], filler: string, targetTokens: number): string { const facts = needles.map((n, i) => `[fact ${i}] ${n.fact}`); // spread across shallow / middle / tail (put the most fragile in the middle) const third = Math.ceil(facts.length / 3); const head = facts.slice(0, third); const mid = facts.slice(third, third * 2); const tail = facts.slice(third * 2); const chunks: string[] = [...head]; let approx = head.join("\n").length / 3.5; // rough token estimate while (approx < targetTokens * 0.5) { chunks.push(filler); approx += filler.length / 3.5; } chunks.push(...mid); while (approx < targetTokens * 0.9) { chunks.push(filler); approx += filler.length / 3.5; } chunks.push(...tail); return chunks.join("\n\n");}async function probeRecall(fillTokens: number): Promise<number> { const context = buildContext(NEEDLES, FILLER_PARAGRAPH, fillTokens); const question = "From the material below, return only the value written in each fact, as a JSON array with its id. " + "No guessing. If it is not in the material, return null.\n\n" + context; const res = await client.messages.create({ model: MODEL, max_tokens: 1024, messages: [{ role: "user", content: question }], }); const text = res.content.find((b) => b.type === "text")?.text ?? ""; let recalled = 0; for (const n of NEEDLES) { // loose check: exact answer present and not overwritten with a wrong value if (text.includes(n.answer)) recalled++; } return recalled / NEEDLES.length;}// walk the fill rate up to trace the curveasync function runCurve() { for (const fill of [30_000, 120_000, 300_000, 600_000, 900_000]) { const acc = await probeRecall(fill); console.log(`fill≈${fill}\trecall=${(acc * 100).toFixed(0)}%`); }}
The key is to make FILLER_PARAGRAPH prose that resembles what you actually feed into context (document excerpts and the like). Diluting with random strings erases the difference between "meaningless noise" and "meaningful but currently irrelevant context," which makes the model look better than it will in production.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Why the failure mode flipped from hard errors to silent drift, and a probe that keeps you from missing it
✦A needle probe that measures recall against fill rate, plus the degradation curve I observed on my own pipeline
✦An admission policy that vets each slot (Before/After), and an effective-token-cost view that stops quiet cost creep
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Here is the result of running that probe with the material my article-generation job normally feeds in as the filler. Twelve preconditions as needles, Sonnet 5, median of three runs at each fill rate.
Fill rate (approx.)
Input tokens (approx.)
Accurate recall
Notes
15%
~150K
12 / 12
nothing dropped
40%
~400K
12 / 12
stable
60%
~600K
11 / 12
one mid fact paraphrased into ambiguity
80%
~800K
9 / 12
more middle dropouts
92%
~920K
7 / 12
two "recalled but wrong value"
It was not a sharp cliff but a gentle downhill — which is exactly what makes it tricky. Up to around 50%, you cannot feel it. Past 60%, facts placed in the middle start dropping one by one. And the scary part is the 92% row: not just forgetting, but "recalled the fact but got the value wrong." A null would still be workable; a plausible wrong answer quietly breaks everything downstream.
The other thing I measured was cost. Once I no longer had to fit the window, I had carelessly removed the step that trimmed material. Across that change, mean input tokens per task had gone from about 48K to about 118K. Recall had not improved — only the bill had, more than doubling.
Those two measurements settled the direction. Do not fill just because the window allows it. Instead, put an explicit layer that vets "is this allowed in?" ahead of context construction.
An admission policy that vets each slot
The mechanism is simple. Gather the material you want in context as candidates, then decide what gets in based on priority and an effective fill cap. I set the cap just short of where recall began to break in the curve above — around 55% effective fill for my material. The point is to treat the range where recall holds, not the whole window, as your "effective window."
First, the old pass-through approach.
// Before: put in everything that fits (the window is wide, so why not)function buildPromptBefore(candidates: Candidate[]): string { return candidates.map((c) => c.text).join("\n\n");}
Replace it with a priority-aware admission gate.
// admission-policy.ts — vet slots before assembling contextinterface Candidate { id: string; text: string; priority: number; // higher is more important (see the table below) estTokens: number; // pre-estimate (approx. via tiktoken etc.) pinned?: boolean; // always include, e.g. preconditions}interface AdmissionResult { admitted: Candidate[]; rejected: Candidate[]; usedTokens: number; effectiveFill: number;}function admit( candidates: Candidate[], opts: { modelWindow: number; effectiveFillCap: number }): AdmissionResult { const budget = Math.floor(opts.modelWindow * opts.effectiveFillCap); // reserve pinned first, then take the rest by priority then token efficiency const pinned = candidates.filter((c) => c.pinned); const rest = candidates .filter((c) => !c.pinned) .sort((a, b) => b.priority - a.priority || a.estTokens - b.estTokens); const admitted: Candidate[] = []; const rejected: Candidate[] = []; let used = 0; for (const c of [...pinned, ...rest]) { if (used + c.estTokens <= budget || c.pinned) { admitted.push(c); used += c.estTokens; } else { rejected.push(c); } } return { admitted, rejected, usedTokens: used, effectiveFill: used / opts.modelWindow, };}// After: build context only from what passed the gatefunction buildPromptAfter(candidates: Candidate[]): string { const r = admit(candidates, { modelWindow: 1_000_000, effectiveFillCap: 0.55 }); if (r.rejected.length > 0) { console.warn( `admission: rejected ${r.rejected.length} ` + `(effective ${(r.effectiveFill * 100).toFixed(0)}% / ${r.usedTokens} tok)` ); } return r.admitted.map((c) => c.text).join("\n\n");}
effectiveFillCap is not a magic number. It should sit just short of the fill rate where your own probe shows recall breaking. The best value shifts with the nature of your material — code-heavy versus prose-heavy, shallow needles versus deep ones. For me 0.55 was the safe side; for your material it will differ. That is exactly why the probe from the first half of this article comes first.
What to drop and what to keep
An admission gate is only as good as how you assign priority. The ordering I use across the four-site pipeline I run as an indie developer is roughly the following. Treat the numbers as a relative ranking, not absolute values.
Kind
priority
pinned
Why
Preconditions (DEPLOY_VERSION, caps, etc.)
100
yes
getting these wrong silently breaks downstream. Pin near the top
Today's logs / recent state
80
no
drives freshness of judgment; usually short
The actual target files
70
no
the work itself, but narrow to the relevant parts
Reference doc excerpts
40
no
a summary is often enough; avoid full text
"Just in case" surrounding context
10
no
first to drop; this is where bloat breeds
What actually helped was being able to drop that bottom "just in case" row mechanically. People are bad at discarding "this might be relevant." Give the gate an ordering and a cap, and it makes that call the same way every time. On handling preconditions as state in long-running jobs, this connects to my notes on context budget and compaction for long-running agents.
Stop quiet cost creep with an effective-token-cost view
Overstuffing shows up not only in quality but in the bill. Here an "effective token cost" view helps. Instead of total input tokens, look at cost per token that actually contributed to the result. Using the probe's recall as a proxy, you can show at a glance that overstuffing is worsening your unit cost.
// effective-cost.ts — "effective token cost" using recall as a proxyinterface RunStat { inputTokens: number; recall: number; // 0..1 (probe recall) inputPricePerMTok: number; // 2 at intro price, 3 at standard}function effectiveCostPer1kUseful(s: RunStat): number { const inputCost = (s.inputTokens / 1_000_000) * s.inputPricePerMTok; // "useful" input tokens = total tokens × recall const usefulTokens = s.inputTokens * s.recall; return (inputCost / usefulTokens) * 1000;}// Before (pass-through) versus After (admission)const before: RunStat = { inputTokens: 118_000, recall: 0.75, inputPricePerMTok: 2 };const after: RunStat = { inputTokens: 62_000, recall: 1.0, inputPricePerMTok: 2 };console.log("before", effectiveCostPer1kUseful(before).toFixed(5));console.log("after ", effectiveCostPer1kUseful(after).toFixed(5));
After adding the admission policy, input tokens per task fell back from about 118K to about 62K, and probe recall recovered to 12/12. Even by raw input cost alone, that is roughly a 2.4x difference per task at the intro price. "The window is wide, so keep it in" turned out to be a doubly bad call — lowering quality while raising the bill.
Recording effective token cost over time lets you notice from the cost side when growing material starts slipping past the gate. Recall degradation and cost inflation share a cause (overstuffing), so watching one picks up the early signs of the other. For how to build the token accounting itself, see my Claude Code context budget optimization guide.
One thing to do once, after the default changes
When a model default switches quietly, the danger is that everything looks like it is still working. Without hard errors, we do not notice the change. In a case like this one — 1M becoming the default — running these three once will let you rest easier.
First, run the probe on your own real material and take the degradation curve. The fill rate where recall breaks differs per material, so you need your numbers, not someone else's. Second, put an admission gate capped just short of that point ahead of context construction. Design around the effective window where recall holds, not the width of the window. Third, keep effective token cost in your records so you can catch the early signs of quiet cost inflation.
The wider window is unmistakably progress. But progress did not, by itself, become a reason to relax. If anything, with the safety valve gone, the judgment of how much to pack came back to my own design. I let my guard down when things stopped crashing, and missed the silent degradation for a few days. I hope this helps you avoid the same stumble. Thank you for reading.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.