◉ Claude.ai/2026-06-15Advanced

When a Long-Running Agent's Context Quietly Decays — Budgeting and Compaction

An agent that runs all night gets sloppier by morning. The cause is dilution from accumulated context. Here is how to treat context as a budget, measure its decay, and keep it healthy with compaction — with working code and field notes.

context-engineering agents⁷ context-window⁶ prompt-caching¹³ production¹¹¹

✦ Premium Article

By Morning, the Judgment Had Gotten Sloppy

The first odd thing I noticed about an agent I run overnight was that its commit messages suddenly turned curt by morning. At midnight the output traced the context carefully; at six in the morning, on the same kind of task, it returned terse replies that seemed to ignore the first half of the instructions. I hadn't changed the model, and I hadn't changed the prompt. The only thing that had changed was the volume of context piled up in the session.

This is not unusual. A prompt that works perfectly in a one-off conversation stops behaving as expected when you let it run for a long time. Most of the time the cause is not the model's capability but the design of the context. In agents that call tools dozens of times in particular, past tool results and conversation history quietly accumulate and dilute the instructions you actually want to land.

This article is about keeping a long-running agent's context healthy. We'll treat context as a budget to allocate, measure its decay, and fire compaction at a threshold — three things, with working code. It is context engineering in the broad sense, but the focus is narrow: the specific way context rots in agents that keep running.

Why Context Rots — Accumulation, Dilution, Position Effects

What happens in a long session breaks down roughly into three things.

The first is accumulation. Every time the agent calls a tool, the input and output stay in the history. A single file read might be a few thousand tokens, a search result ten thousand — over dozens of turns, the share occupied by your actual instructions keeps shrinking.

The second is dilution. The context window may be wide, but the model's effective attention — what it can strongly reference at once — is not infinite. When an important constraint is buried under a mass of intermediate logs, its relative weight drops. In my own observations, running several agents autonomously as an indie developer, responses that broke the constraints stated at the top visibly increased once history crossed roughly 150K tokens, even with the same system prompt.

The third is the position effect. In long contexts, information placed in the middle tends to be overlooked. This is the "lost in the middle" phenomenon, and an important fact tucked into the center of a long tool log carries far less weight than you would expect.

The conclusion is simple. Context is not something where "more is always better." It needs to be treated as a finite resource whose allocation you design.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How accumulated tool results and history dilute context, plus three signals to catch it numerically

✦An implementation pattern that allocates a context budget across four layers and fires compaction at a threshold (Python, working code)

✦Where to place prompt-cache breakpoints, and how to think about token budgets after the 2026-06-15 billing change

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Allocating Context as a Budget

First, split context into role-based "layers" and assign each a token ceiling. The crux of this design is to decide the allocation rather than the total.

Layer	Role	Target share	Compactable
Fixed (system)	Role, constraints, output rules	5–10%	No (always full)
Knowledge (retrieved)	RAG, reference docs	20–30%	Yes (swapped per turn)
History (conversation, tools)	Past exchanges	40–50%	Yes (summarize / drop)
Working (recent)	Current task and last few turns	15–25%	No (freshness first)

The key is to separate the compactable layers from the non-compactable ones up front. The system prompt and the most recent working context live and die by freshness and completeness, so don't trim them. The knowledge and history layers, on the other hand, are designed to be swapped and summarized.

Expressed as a minimal class, budget management looks like this.

from dataclasses import dataclass
 
@dataclass
class ContextBudget:
    total: int = 200_000          # the model's window
    reserve_output: int = 16_000  # held back for output
    # allocate the usable input ceiling across layers
    def allocation(self) -> dict[str, int]:
        usable = self.total - self.reserve_output
        return {
            "system":  int(usable * 0.08),
            "work":    int(usable * 0.22),
            "knowledge": int(usable * 0.30),
            "history": int(usable * 0.40),
        }
 
def count_tokens(client, model, blocks) -> int:
    # measure with the official token-count API, not an estimate
    return client.messages.count_tokens(model=model, messages=blocks).input_tokens

It matters not to settle for an estimate like "characters ÷ 4." Token efficiency varies a lot between languages and between code and prose, so use values measured with count_tokens for budget decisions. Run on estimates and one day you'll simply crash on a window overflow.

Measure the Decay — Don't Compact Without Observing

When to fire compaction is decided by numbers, not instinct. Here are the three I watch constantly.

History occupancy is what fraction of its allotted budget the history layer is using. The closer it gets to 1.0, the less room there is for new tool results, and old information starts crowding out new instructions. Once it exceeds 0.8, I start considering compaction.

System-constraint distance is the actual token distance from the end of the system prompt to the current generation position. The longer the distance, the weaker the effect of the constraints at the top. If the distance exceeds 150K tokens, consider re-injecting the constraints into the working layer.

Retrieval hit rate is the share of documents placed in the knowledge layer that were actually referenced in the response. If this stays low while you keep thickening the knowledge layer, that is wasted budget. It's the cue to steer from stuffing toward narrowing.

def context_health(usage: dict, alloc: dict) -> dict:
    history_ratio = usage["history"] / alloc["history"]
    system_distance = usage["history"] + usage["knowledge"] + usage["work"]
    return {
        "history_ratio": round(history_ratio, 2),
        "system_distance": system_distance,
        "compact_now": history_ratio > 0.8 or system_distance > 150_000,
    }

When compact_now turns true, compact the history layer before the next turn. You tune the thresholds as you operate, but in long-running operation the stance that pays off is "get ahead of the signs of decay," not "deal with it after it crashes."

Implementing Compaction — When, What, and How to Drop

There are two main techniques for compaction: summarization and selective dropping.

Summarization replaces a batch of old conversation turns with one concise note. It keeps the rationale behind decisions while folding away verbose exchanges. The most recent turns are not summarized, preserving freshness.

SUMMARIZE_SYSTEM = (
    "Below is an agent's past log. Keep only the decisions, "
    "constraints, and open issues needed for later judgment, "
    "as concise bullet points. You may omit completed, verbose exchanges."
)
 
def compact_history(client, model, old_turns, keep_recent=6):
    archived = old_turns[:-keep_recent]      # only the old part is summarized
    recent = old_turns[-keep_recent:]        # leave the recent turns alone
    if not archived:
        return old_turns
    summary = client.messages.create(
        model=model,
        max_tokens=1024,
        system=SUMMARIZE_SYSTEM,
        messages=[{"role": "user",
                   "content": _render(archived)}],
    ).content[0].text
    memo = {"role": "user",
            "content": f"[Summary so far]\n{summary}"}
    return [memo] + recent

Selective dropping removes tool results you will no longer reference from the context. With the Context Editing feature of the Claude API, you can clear old tool results automatically. For agents that repeatedly read large files or run searches, this is what helps most.

resp = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=conversation,
    tools=tools,
    betas=["context-management-2025-06-27"],
    context_management={
        "edits": [{
            "type": "clear_tool_uses",
            # keep the latest tool results, clear from the oldest
            "keep": {"type": "tool_uses", "value": 8},
            "clear_at_least": {"type": "tokens", "value": 20_000},
        }]
    },
)

What to keep in mind here is that summarization and dropping have different jobs. Summarization is for keeping the "context of judgment"; dropping is for removing "raw data you no longer need." Drop large raw data like search results and file bodies, and keep the trail of reasoning through summarization — that division is the shortest path to a lighter context that stays smart.

Retrieval Is Narrowing, Not Stuffing

A common failure in the knowledge layer is dumping in the top 20 documents that look relevant. It eats tokens without raising the hit rate. The low retrieval hit rate from earlier is usually caused by exactly this.

What works is a two-stage approach: gather broadly with search, then rerank down to the few you actually need.

def retrieve(query, store, k_search=40, k_final=5):
    candidates = store.search(query, top_k=k_search)   # gather broadly first
    reranked = rerank(query, candidates)               # reorder by relevance
    picked = reranked[:k_final]                         # narrow to a few
    # fit within the knowledge-layer budget; drop from the tail if it overflows
    return trim_to_budget(picked, budget="knowledge")

The essence is discarding the "just in case" inclusions. Everything you add increases the risk of dilution and position effects, so responses are more stable when you leave out documents you aren't confident about. In my own operation, going from a naive top-20 dump to five after reranking reduced token consumption and, if anything, improved how precisely the references landed.

Prompt Cache and the Budget

In long-running agents, prompt-cache design affects both budget and cost. The key is to put what doesn't change in front, and what changes behind.

system = [
    {"type": "text", "text": ROLE_AND_RULES,
     "cache_control": {"type": "ephemeral"}},   # cache the fixed layer
]
# knowledge and history go behind the cache; swapping them won't invalidate it

Note where compaction design collides with cache design. Rewriting history through summarization invalidates the cache beyond that point. That's why it's better to compact in a batch when a threshold is crossed, rather than every turn — it preserves the cache hit rate. Get ahead with the decay signals, but keep the frequency down: that balance is where operational skill shows.

Cost is also impossible to ignore in 2026. The billing change that took effect on 2026-06-15 moved the Agent SDK and headless execution out from under the subscription cap and onto monthly credits billed at the actual API rate. In other words, needlessly bloated context comes straight back at you as token charges. Keeping context light has become a matter of operating cost, not just quality. Measure your real cost, and tune the compaction thresholds while watching the cost figures too.

The Next Step

If you're running long-lived agents, start by measuring one session's token breakdown with count_tokens. Once you can see which layer eats the budget and where history occupancy crosses 0.8, where to insert compaction decides itself. Compaction without observation is guesswork, but once the breakdown is visible, the design gets a lot more concrete.

I'm still tuning this myself, but ever since I began treating context as a finite budget, my morning agent works with the same care it shows at night. I hope it helps anyone wrestling with the same long-running-operation problem.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.