●FABLE 5 — Claude Fable 5 is available again to users worldwide from July 1 after US export controls were lifted●SCIENCE — Claude Science, a workbench for researchers, is in beta; the AI for Science credit program is open through July 15●CODE — Claude Code adds dynamic workflows (research preview) and raises weekly usage limits by 50% through July 13●MODEL — Claude Sonnet 5 is the default across all plans at $2/$10 per million tokens through August 31●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost control)●SECURITY — A new cybersecurity classifier ships alongside the Fable 5 redeployment●FABLE 5 — Claude Fable 5 is available again to users worldwide from July 1 after US export controls were lifted●SCIENCE — Claude Science, a workbench for researchers, is in beta; the AI for Science credit program is open through July 15●CODE — Claude Code adds dynamic workflows (research preview) and raises weekly usage limits by 50% through July 13●MODEL — Claude Sonnet 5 is the default across all plans at $2/$10 per million tokens through August 31●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost control)●SECURITY — A new cybersecurity classifier ships alongside the Fable 5 redeployment
When to Use Claude Code's Native 1M Context — and When Not To: A Cost-Based Rule
With Sonnet 5 as the default, Claude Code now handles a native 1M-token context. A big window is convenient, but every token you park in it is billed again each turn. Should you load the whole repo, or feed slices? Here is an estimable token model and a decision rule that gives a concrete answer per situation, with working code and the traps to avoid.
I handed a large repository to Claude Code, felt reassured that it would "read all of it," and half an hour later opened the usage screen. My hand stopped.
That single session had consumed several times my usual tokens. The work itself finished correctly. But that job did not need that window size. Feed it only the slices it needed, and the same result would have cost far less.
On June 30, 2026, Claude Sonnet 5 became the default across all plans, and Claude Code gained a native one-million-token context. The old 1M was a beta limited to specific models. Now it is within reach by default. Which is exactly why the instinct to "load everything because the window is wide" quietly melts money.
This article turns "when to use a big window and when not to" into something you decide by arithmetic rather than by feel, with code you can run against your own price sheet and repo.
A big window changes cost, not speed
Let me clear up one misconception first. Widening the context does not make the model smarter, nor necessarily faster. What changes is what you can show it at once and what you pay every turn to do so.
As a conversation progresses, Claude Code repeatedly resends the prior exchange as input. Whatever sits in the window is billed as input tokens on every response. That is the crux. Keep a 10,000-token file resident across 20 round trips, and (outside of any cache) those 10,000 tokens can be billed roughly 20 times.
So the cost of a big window scales as "amount loaded × times you touch it." For a one-shot read it is noise; for long exploration or iterative refactoring, that multiplication is what bites.
The estimate: "everything resident" vs "sliced"
Let us put the decision in symbols.
Symbol
Meaning
T_ctx
Tokens kept resident in the window (e.g. the whole repo)
When you keep everything resident, the input cost is dominated by resending T_ctx every turn. The cached fraction c is billed at the cheaper cache-read rate, so the effective input cost is roughly:
input_cost(resident) ≈ N × T_ctx × p_in × s_in × (1 - c + c × r_cache)
Here r_cache is the ratio of cache-read price to normal input price (around 0.1 in many setups, i.e. about one tenth). The formula makes it visible: the higher c is, the cheaper resident becomes.
When you slice and feed only what each turn needs, you do not resend T_ctx; you send only the file fragment T_slice(i):
s_in' can be the no-surcharge multiplier (1.0) if the total you load stays under the long-context tier. That is where slicing earns its keep. Pricing commonly changes the surcharge based on whether you cross the 200K-token tier, so slicing under the tier lowers the multiplier itself.
Written out it looks obvious, but the practically important point is a single one: resident cost spikes precisely when N is large, c is low, and T_ctx straddles the tier.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How to estimate, with an explicit formula and Python, whether widening the window or slicing is cheaper — with the long-context surcharge left as a variable so you can plug in your own price sheet
✦A decision function that mechanically decides whether to use the 1M window from repo size, revisit count, and cache hit rate — with the reasoning behind each threshold
✦The typical ways a large window fails to help while quietly inflating cost, and how to avoid each
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Here is the formula as a function. Always confirm unit prices and surcharges against the current price sheet and pass them as arguments (below uses Sonnet 5's introductory price as an example, with the long-context surcharge parameterized as an upper-tier estimate — verify the actual multipliers on the official pricing page).
from dataclasses import dataclass@dataclassclass Pricing: p_in: float # input $/1M tokens p_out: float # output $/1M tokens long_threshold: int # above this token count, the long-context tier applies s_in_long: float # input surcharge in the long tier s_out_long: float # output surcharge in the long tier r_cache: float # cache-read price / normal input pricedef _mult(total_ctx_tokens: int, pr: Pricing): """Return surcharge multipliers if resident amount crosses the long tier.""" if total_ctx_tokens > pr.long_threshold: return pr.s_in_long, pr.s_out_long return 1.0, 1.0def cost_resident(T_ctx, T_q, T_out, N, pr: Pricing, cache_hit=0.0): """Approx cost ($) of keeping T_ctx resident across N round trips.""" s_in, s_out = _mult(T_ctx + T_q, pr) resident_factor = (1 - cache_hit) + cache_hit * pr.r_cache in_resident = N * T_ctx * pr.p_in * s_in * resident_factor in_query = N * T_q * pr.p_in * s_in out_cost = N * T_out * pr.p_out * s_out return (in_resident + in_query + out_cost) / 1_000_000def cost_sliced(slices, T_q, T_out, N, pr: Pricing): """Approx cost ($) of sending only slices[i] each round trip. slices is a list of length N (relevant tokens sent that turn).""" total = 0.0 for t_slice in slices: s_in, s_out = _mult(t_slice + T_q, pr) total += (t_slice + T_q) * pr.p_in * s_in total += T_out * pr.p_out * s_out return total / 1_000_000# --- Example using Sonnet 5 intro pricing (always verify current rates) ---pr = Pricing(p_in=2.0, p_out=10.0, long_threshold=200_000, s_in_long=2.0, s_out_long=1.5, # upper-tier estimate; verify r_cache=0.1)T_ctx = 260_000 # whole repo resident in the windowslices = [30_000] * 12 # sliced: send only 30K relevant tokens each turnargs = dict(T_q=1_500, T_out=1_200, N=12)resident = cost_resident(T_ctx, pr=pr, cache_hit=0.0, **args)sliced = cost_sliced(slices, pr=pr, **args)print(f"resident (no cache): ${resident:.2f}")print(f"sliced: ${sliced:.2f}")resident_cached = cost_resident(T_ctx, pr=pr, cache_hit=0.8, **args)print(f"resident (80% cache): ${resident_cached:.2f}")
Run it locally and the relationship shows up in numbers: resident with no cache is meaningfully pricier than slicing, and with an 80% cache hit rate resident drops to a genuinely reasonable level. The estimate teaches you that "should I use 1M" is nearly synonymous with "will the cache hit."
Deciding mechanically whether to use it
Once you can estimate, make the decision a function too. Running a full cost comparison every time is heavy, so in practice a two-stage approach — route with cheap conditions first, and only estimate precisely at the boundary — stays light.
def should_use_large_context(T_ctx, N, cache_expected, pr: Pricing, slice_ratio=0.12) -> tuple[bool, str]: """Decide whether to keep a 1M-scale context resident, and why.""" # 1) If it never crosses the tier, resident is fine (no surcharge) if T_ctx <= pr.long_threshold: return True, "under the tier, resident is safe (no surcharge)" # 2) A one-shot read is simplest and cheapest resident if N <= 2: return True, "few round trips, resend cost is small" # 3) Many revisits with low cache -> slice if N >= 6 and cache_expected < 0.5: return False, "many revisits + low cache make resends expensive" # 4) If the relevant slice is a small fraction, slicing wins if slice_ratio <= 0.2 and cache_expected < 0.8: return False, "narrow per-turn relevance; slicing stays under the tier" # 5) Otherwise it depends on cache. Measure the boundary. return (cache_expected >= 0.6, "boundary case; decide resident/slice by measured cache")for scenario in [ (260_000, 12, 0.0), (260_000, 12, 0.85), (260_000, 2, 0.0), (150_000, 20, 0.0),]: ok, why = should_use_large_context(*scenario, pr=pr) print(scenario, "->", "resident" if ok else "slice", "/", why)
The thresholds have reasons. Below two round trips, the resend multiplication does not bite (condition 2). Conversely, at six or more round trips with under 50% cache, resident resends tend to exceed slicing (condition 3). If the relevant range is under a fifth of the repo, slicing avoids the long-context tier entirely (condition 4). Tune the numbers against your own measurements. What matters is moving your practice from "widen it, roughly" to "decide by condition."
Ordering to make the cache hit
Because the decision so often lands on "it depends on cache," it is worth knowing how to earn cache hits. Prompt caching only applies to a contiguous prefix that matches from the start. So put what does not change (the stable base of the repo, conventions, type definitions) first, and what changes (the current question, the latest diff) last.
If the ordering shifts every turn, the cache is effectively disabled. When you give Claude Code a large context, pin the "immovable block" at the front. That alone moves the cache_hit in the estimate above from 0 toward 0.7 or 0.8, and brings the resident cost into a realistic range.
Concretely: load the base early in the session before starting work, do not reshuffle the order in which you present file groups mid-session, and do not wedge large unrelated outputs in between.
The typical "no help, more cost" failures
When a large window fails to deliver as hoped, the failures have shapes.
First, the window is wide but the relevant range is scattered, so the model re-hunts for the fragments it needs every turn. Putting something in the window does not mean it gets read. Point explicitly at the relevant files and hand it the search targets; that stabilizes things.
Second, the cache breaks every turn, and the resident context is resent at full price. If usage grows cleanly in proportion to round trips, suspect this. Pinning the order stops it.
Third, reaching for 1M on a one-shot read. For a single summary or a single-file edit, widening the window buys almost nothing. Note the asymmetry: the decision function returns "resident is fine" mostly for small jobs that do not cross the tier at all.
For all three, decomposing the usage screen into "round trips" and "input tokens per round trip" makes the cause quick to spot. As an indie developer running several sites' updates unattended, I started logging just those two axes after each session, and the gap between my estimates and reality shrank considerably.
Next step
First, list three jobs you run often, and for each write down T_ctx, N, and your expected cache on paper. Feeding just those into the decision function above sorts them mechanically into "resident" and "slice."
Then, for only the jobs that land on the boundary, measure the real cache_hit from actual usage. Once the decision lives in a formula, a model change or a price revision only means swapping the unit prices. Intro pricing has an expiry, and long-context surcharges can change. That is exactly why keeping the numbers as arguments — and the decision itself immovable — is the easiest preparation for running things over the long haul.
I hope this helps with your own setup. Thank you for reading.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.