A Million-Token Window Made Me Stop Chunking My Prompts — Notes From Rebuilding My Preprocessing

When I handed a crash log from one of my wallpaper apps to Claude, the part that ate my time was never the analysis itself. It was deciding which fragments to include. The stack trace, the relevant Activity code, the most recent diff. None of it fit in the old context window all at once, so I chopped it into sections and pasted only the parts I guessed were important.

On June 9th, Claude Fable 5 reached general availability, and a one-million-token context window became something I could simply use from the API. The first thing I tried was not flashy. It was a quiet test of one question: can I stop chunking? The short answer is that my preprocessing got noticeably lighter. But it was not the "stuff everything in and it gets smarter" story I half expected, and it took me a few detours to feel why. These are those notes.

The Invisible Time I Spent on "Splitting"

Splitting sounds tidy, but what I was really doing was deciding what to throw away. If the cause of a crash lived outside the fragment I pasted, Claude's answer naturally could not reach it. So I would guess "probably around here," paste again, miss, and paste again — a loop of small re-tries.

The awkward part is that this loop never shows up in any log. A few minutes each time feels like nothing, but a few round trips per analysis, every day, quietly add up. As an indie developer, I have no one to ask "can you look at this part for me," so the work of choosing which fragment to hand over carries more weight than it seems. What I first hoped a larger window would buy me was not raw intelligence — it was the end of that loop.

Sending Everything Did Not Make It Smarter

In my first test I pasted the whole module and the full log as they were. The "missing fragment" failure did indeed disappear. But a different dullness crept in. Even on simple crashes with an obvious cause, the answers came back slightly blurrier than before, drifting toward the safe "there are a few possibilities here."

In hindsight this was obvious. Hand over unrelated code alongside the relevant part, and the model's attention is diluted across all of it. A large window is not a device that decides what matters for you. What I got to let go of was the work of picking fragments; what I could not let go of was the responsibility of signaling what is important.

That was the most useful realization for me. The real value of a million tokens is not "read it all and be brilliant." It is folding away the time you spend agonizing over what to include. So now I hand over context in one piece, while writing the question itself more pointedly than before about where I want the model to look.

How I Actually Rebuilt My Preprocessing

Concretely, I settled on putting the large, stable context — the whole module or the full log — at the front as a fixed block, and letting only the trailing question change. I also draw a cache boundary across that large block. Because I often ask the same log several questions from different angles, turning the later calls into cache reads makes the cost visibly settle down.

import anthropic
 
client = anthropic.Anthropic()  # the key is read from the ANTHROPIC_API_KEY environment variable
 
# read_module / read_log are plain helpers that return the target file as a string
stable_context = read_module("WallpaperListActivity.kt") + "\n\n" + read_log("crash_20260623.txt")
 
resp = client.messages.create(
    # confirm the exact model identifier against the latest API docs
    model="claude-fable-5",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You help analyze crashes in an Android app."},
        {
            "type": "text",
            "text": stable_context,
            "cache_control": {"type": "ephemeral"},  # cache up to this boundary
        },
    ],
    messages=[
        # only the trailing question changes; the large context is reused
        {"role": "user", "content": "For the IndexOutOfBoundsException in this log, which line and which broken assumption causes it? Show only the relevant spot."}
    ],
)
 
print(resp.content[0].text)
print(resp.usage)  # cache_read_input_tokens grows; full-rate input_tokens shrinks

The key is that the question is narrowed to "show only the relevant spot" rather than "read it all and think." The more context you provide, the more that one focusing sentence in the question earns its keep. Watching resp.usage on each call, you can confirm that later calls grow cache_read_input_tokens and shrink the full-rate input_tokens. I wrote more about dividing that budget in how to allocate the Claude API context budget.

From "Cram It In" to "Cache It"

A wider window tempts you to think "fit in as much as possible." But everything you add is billed at full rate every time — unless it lands in cache. The pivot I made in real operation was from "how much can I cram in" to "what can I cache."

Keep a stable, large context in a one-hour cache, and you can throw many angled questions at it while paying close to only the trailing difference. I detailed the design of splitting cache lifetimes between five minutes and one hour in designing Claude API prompt caching in two tiers, 5m and 1h. The wider window quietly rewrote the formula in my head from "tokens times intelligence" into "how well can I reuse the part that does not change."

Where I Still Keep Chunking

I did not abolish splitting entirely. For work where the context changes a lot from call to call, caching barely helps, so the advantage of sending it all thins out. There, the old approach of passing only the needed range was simply cleaner.

The other case is when I want the model to search. If I need it to find one anomaly buried in a huge log, narrowing the relevant range myself before sending it tends to sharpen the answer. A large window is a convenient tool, but it will not take over the job of deciding what to look at. Even in solo development, that part still comes down to my own judgment. For the mechanics of the 1M window itself, the one-million-token context migration notes are worth a look.

If You Want to Try It Next

Pick just one task you always split before sending, and rebuild it into the shape of "stable context at the front with a cache boundary, question narrowed at the end." If cache_read_input_tokens in resp.usage starts climbing, that task is a good candidate to enjoy the larger window honestly. I have not migrated all of my preprocessing yet — I am still verifying one at a time — but the shrinking of those round trips is real.