Articles/Claude Code

⟐ Claude Code/2026-07-05Advanced

When to Use Claude Code's Native 1M Context — and When Not To: A Cost-Based Rule

With Sonnet 5 as the default, Claude Code now handles a native 1M-token context. A big window is convenient, but every token you park in it is billed again each turn. Should you load the whole repo, or feed slices? Here is an estimable token model and a decision rule that gives a concrete answer per situation, with working code and the traps to avoid.

Claude Code¹⁸⁰ Sonnet 5⁵ context⁶ cost optimization¹³ 1M

✦ Premium Article

I handed a large repository to Claude Code, felt reassured that it would "read all of it," and half an hour later opened the usage screen. My hand stopped.

That single session had consumed several times my usual tokens. The work itself finished correctly. But that job did not need that window size. Feed it only the slices it needed, and the same result would have cost far less.

On June 30, 2026, Claude Sonnet 5 became the default across all plans, and Claude Code gained a native one-million-token context. The old 1M was a beta limited to specific models. Now it is within reach by default. Which is exactly why the instinct to "load everything because the window is wide" quietly melts money.

This article turns "when to use a big window and when not to" into something you decide by arithmetic rather than by feel, with code you can run against your own price sheet and repo.

A big window changes cost, not speed

Let me clear up one misconception first. Widening the context does not make the model smarter, nor necessarily faster. What changes is what you can show it at once and what you pay every turn to do so.

As a conversation progresses, Claude Code repeatedly resends the prior exchange as input. Whatever sits in the window is billed as input tokens on every response. That is the crux. Keep a 10,000-token file resident across 20 round trips, and (outside of any cache) those 10,000 tokens can be billed roughly 20 times.

So the cost of a big window scales as "amount loaded × times you touch it." For a one-shot read it is noise; for long exploration or iterative refactoring, that multiplication is what bites.

The estimate: "everything resident" vs "sliced"

Let us put the decision in symbols.

Symbol	Meaning
T_ctx	Tokens kept resident in the window (e.g. the whole repo)
T_q	Tokens per instruction / question
T_out	Tokens per response
N	Round trips in the session
p_in / p_out	Input / output unit price (per million tokens)
s_in / s_out	Long-context surcharge multipliers (e.g. input ×2)
c	Prompt cache hit rate (0 to 1)

When you keep everything resident, the input cost is dominated by resending T_ctx every turn. The cached fraction c is billed at the cheaper cache-read rate, so the effective input cost is roughly:

input_cost(resident) ≈ N × T_ctx × p_in × s_in × (1 - c + c × r_cache)

Here r_cache is the ratio of cache-read price to normal input price (around 0.1 in many setups, i.e. about one tenth). The formula makes it visible: the higher c is, the cheaper resident becomes.

When you slice and feed only what each turn needs, you do not resend T_ctx; you send only the file fragment T_slice(i):

input_cost(sliced) ≈ Σ_i ( T_slice(i) × p_in × s_in' )

s_in' can be the no-surcharge multiplier (1.0) if the total you load stays under the long-context tier. That is where slicing earns its keep. Pricing commonly changes the surcharge based on whether you cross the 200K-token tier, so slicing under the tier lowers the multiplier itself.

Written out it looks obvious, but the practically important point is a single one: resident cost spikes precisely when N is large, c is low, and T_ctx straddles the tier.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to estimate, with an explicit formula and Python, whether widening the window or slicing is cheaper — with the long-context surcharge left as a variable so you can plug in your own price sheet

✦A decision function that mechanically decides whether to use the 1M window from repo size, revisit count, and cache hit rate — with the reasoning behind each threshold

✦The typical ways a large window fails to help while quietly inflating cost, and how to avoid each

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Turning the estimate into working code

Here is the formula as a function. Always confirm unit prices and surcharges against the current price sheet and pass them as arguments (below uses Sonnet 5's introductory price as an example, with the long-context surcharge parameterized as an upper-tier estimate — verify the actual multipliers on the official pricing page).

from dataclasses import dataclass
 
@dataclass
class Pricing:
    p_in: float          # input $/1M tokens
    p_out: float         # output $/1M tokens
    long_threshold: int  # above this token count, the long-context tier applies
    s_in_long: float     # input surcharge in the long tier
    s_out_long: float    # output surcharge in the long tier
    r_cache: float       # cache-read price / normal input price
 
def _mult(total_ctx_tokens: int, pr: Pricing):
    """Return surcharge multipliers if resident amount crosses the long tier."""
    if total_ctx_tokens > pr.long_threshold:
        return pr.s_in_long, pr.s_out_long
    return 1.0, 1.0
 
def cost_resident(T_ctx, T_q, T_out, N, pr: Pricing, cache_hit=0.0):
    """Approx cost ($) of keeping T_ctx resident across N round trips."""
    s_in, s_out = _mult(T_ctx + T_q, pr)
    resident_factor = (1 - cache_hit) + cache_hit * pr.r_cache
    in_resident = N * T_ctx * pr.p_in * s_in * resident_factor
    in_query    = N * T_q   * pr.p_in * s_in
    out_cost    = N * T_out * pr.p_out * s_out
    return (in_resident + in_query + out_cost) / 1_000_000
 
def cost_sliced(slices, T_q, T_out, N, pr: Pricing):
    """Approx cost ($) of sending only slices[i] each round trip.
    slices is a list of length N (relevant tokens sent that turn)."""
    total = 0.0
    for t_slice in slices:
        s_in, s_out = _mult(t_slice + T_q, pr)
        total += (t_slice + T_q) * pr.p_in * s_in
        total += T_out * pr.p_out * s_out
    return total / 1_000_000
 
# --- Example using Sonnet 5 intro pricing (always verify current rates) ---
pr = Pricing(p_in=2.0, p_out=10.0,
             long_threshold=200_000,
             s_in_long=2.0, s_out_long=1.5,  # upper-tier estimate; verify
             r_cache=0.1)
 
T_ctx = 260_000       # whole repo resident in the window
slices = [30_000] * 12   # sliced: send only 30K relevant tokens each turn
args = dict(T_q=1_500, T_out=1_200, N=12)
 
resident = cost_resident(T_ctx, pr=pr, cache_hit=0.0, **args)
sliced   = cost_sliced(slices, pr=pr, **args)
print(f"resident (no cache): ${resident:.2f}")
print(f"sliced:              ${sliced:.2f}")
 
resident_cached = cost_resident(T_ctx, pr=pr, cache_hit=0.8, **args)
print(f"resident (80% cache): ${resident_cached:.2f}")

Run it locally and the relationship shows up in numbers: resident with no cache is meaningfully pricier than slicing, and with an 80% cache hit rate resident drops to a genuinely reasonable level. The estimate teaches you that "should I use 1M" is nearly synonymous with "will the cache hit."

Deciding mechanically whether to use it

Once you can estimate, make the decision a function too. Running a full cost comparison every time is heavy, so in practice a two-stage approach — route with cheap conditions first, and only estimate precisely at the boundary — stays light.

def should_use_large_context(T_ctx, N, cache_expected, pr: Pricing,
                             slice_ratio=0.12) -> tuple[bool, str]:
    """Decide whether to keep a 1M-scale context resident, and why."""
    # 1) If it never crosses the tier, resident is fine (no surcharge)
    if T_ctx <= pr.long_threshold:
        return True, "under the tier, resident is safe (no surcharge)"
    # 2) A one-shot read is simplest and cheapest resident
    if N <= 2:
        return True, "few round trips, resend cost is small"
    # 3) Many revisits with low cache -> slice
    if N >= 6 and cache_expected < 0.5:
        return False, "many revisits + low cache make resends expensive"
    # 4) If the relevant slice is a small fraction, slicing wins
    if slice_ratio <= 0.2 and cache_expected < 0.8:
        return False, "narrow per-turn relevance; slicing stays under the tier"
    # 5) Otherwise it depends on cache. Measure the boundary.
    return (cache_expected >= 0.6,
            "boundary case; decide resident/slice by measured cache")
 
for scenario in [
    (260_000, 12, 0.0),
    (260_000, 12, 0.85),
    (260_000, 2, 0.0),
    (150_000, 20, 0.0),
]:
    ok, why = should_use_large_context(*scenario, pr=pr)
    print(scenario, "->", "resident" if ok else "slice", "/", why)

The thresholds have reasons. Below two round trips, the resend multiplication does not bite (condition 2). Conversely, at six or more round trips with under 50% cache, resident resends tend to exceed slicing (condition 3). If the relevant range is under a fifth of the repo, slicing avoids the long-context tier entirely (condition 4). Tune the numbers against your own measurements. What matters is moving your practice from "widen it, roughly" to "decide by condition."

Ordering to make the cache hit

Because the decision so often lands on "it depends on cache," it is worth knowing how to earn cache hits. Prompt caching only applies to a contiguous prefix that matches from the start. So put what does not change (the stable base of the repo, conventions, type definitions) first, and what changes (the current question, the latest diff) last.

If the ordering shifts every turn, the cache is effectively disabled. When you give Claude Code a large context, pin the "immovable block" at the front. That alone moves the cache_hit in the estimate above from 0 toward 0.7 or 0.8, and brings the resident cost into a realistic range.

Concretely: load the base early in the session before starting work, do not reshuffle the order in which you present file groups mid-session, and do not wedge large unrelated outputs in between.

The typical "no help, more cost" failures

When a large window fails to deliver as hoped, the failures have shapes.

First, the window is wide but the relevant range is scattered, so the model re-hunts for the fragments it needs every turn. Putting something in the window does not mean it gets read. Point explicitly at the relevant files and hand it the search targets; that stabilizes things.

Second, the cache breaks every turn, and the resident context is resent at full price. If usage grows cleanly in proportion to round trips, suspect this. Pinning the order stops it.

Third, reaching for 1M on a one-shot read. For a single summary or a single-file edit, widening the window buys almost nothing. Note the asymmetry: the decision function returns "resident is fine" mostly for small jobs that do not cross the tier at all.

For all three, decomposing the usage screen into "round trips" and "input tokens per round trip" makes the cause quick to spot. As an indie developer running several sites' updates unattended, I started logging just those two axes after each session, and the gap between my estimates and reality shrank considerably.

Next step

First, list three jobs you run often, and for each write down T_ctx, N, and your expected cache on paper. Feeding just those into the decision function above sorts them mechanically into "resident" and "slice."

Then, for only the jobs that land on the boundary, measure the real cache_hit from actual usage. Once the decision lives in a formula, a model change or a price revision only means swapping the unit prices. Intro pricing has an expiry, and long-context surcharges can change. That is exactly why keeping the numbers as arguments — and the decision itself immovable — is the easiest preparation for running things over the long haul.

I hope this helps with your own setup. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.