⬡ API & SDK/2026-06-16Advanced

Taming Token Bloat in Long-Running Agents with Context Editing and the Memory Tool

For long-running agents whose input tokens balloon as tool results pile up, here is how to pair context editing with the memory tool and measure the savings with count_tokens, including a working backend implementation.

Claude API⁷³ context editing memory tool agents⁷ token optimization

✦ Premium Article

As an indie developer, I once let a research agent in my automated publishing pipeline run uninterrupted, and somewhere around the twentieth tool call the input tokens crossed 70,000 and the per-turn cost stopped being something I could ignore. When I looked inside, most of the weight came from finished web_search results sitting untouched in the conversation history. The longer an agent runs, the more those stale tool results crowd the context. That is not unique to my setup; it is a common tax on any tool-heavy workflow.

The Claude API gives you two complementary levers for this. Context editing removes old tool results on the server side, and the memory tool offloads what you want to keep into files. Because their jobs are opposite, the practical move is to use them together rather than picking one. This article builds both up from a minimal setup to a production-ready one, and ends by measuring the savings with count_tokens before any real spend, with working backend code along the way.

Context editing "removes," the memory tool "retains" — opposite jobs

The first thing worth untangling is that these two features solve genuinely different problems.

Context editing (clear_tool_uses_20250919) automatically deletes old tool results on the server side once the history crosses a threshold. Each cleared block is replaced with a placeholder, so Claude knows a result used to be there and is now gone. In other words, search results or file reads you will not revisit get dropped without you having to trim them by hand.

The memory tool (memory_20250818), by contrast, is a client-side tool that Claude uses to write and read files itself. It saves what it learns into a /memories directory and can read it back on demand, even after context editing has cleared the history or a new session has started.

That contrast is the whole point. With context editing alone, an important finding buried in a cleared tool result disappears along with it. So you write just the "I will need this later" facts into memory, and you get both: a light history and persistent knowledge. Both are enabled with the beta header context-management-2025-06-27.

Enabling clear_tool_uses in its simplest form

Start with context editing on its own, in the plainest shape. You pass a single edit in context_management.

import anthropic
 
client = anthropic.Anthropic()
 
response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Research recent work on AI agents and summarize it"}],
    tools=[{"type": "web_search_20250305", "name": "web_search"}],
    betas=["context-management-2025-06-27"],
    context_management={"edits": [{"type": "clear_tool_uses_20250919"}]},
)

Whether clearing actually ran is reported in context_management.applied_edits. If you skip checking this and just assume "it must be working," you can easily miss that it never fired at all because the threshold was never reached.

applied = getattr(response, "context_management", None)
if applied and applied.applied_edits:
    for edit in applied.applied_edits:
        print(f"cleared tool uses: {edit.cleared_tool_uses} / "
              f"tokens saved: {edit.cleared_input_tokens}")
else:
    print("nothing cleared this turn (below threshold)")

The minimal setup omits trigger, so it runs at the default threshold. On short conversations it never fires, and that is correct behavior — not firing is not a bug. Watch the counts in your logs, then move on to tuning.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How to actually choose trigger / keep / clear_at_least / exclude_tools for clear_tool_uses_20250919

✦A working memory-tool backend with path-traversal protection built in

✦Measuring the exact token savings with count_tokens before you ship

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

How to choose trigger, keep, clear_at_least, and exclude_tools

In production, holding the behavior explicitly is more stable than trusting the defaults. Here is the configuration I settled on.

context_management={
    "edits": [
        {
            "type": "clear_tool_uses_20250919",
            # start clearing once input tokens exceed this
            "trigger": {"type": "input_tokens", "value": 30000},
            # keep the 3 most recent tool results
            "keep": {"type": "tool_uses", "value": 3},
            # clear at least this many tokens when it runs
            "clear_at_least": {"type": "input_tokens", "value": 5000},
            # never clear results from these tools
            "exclude_tools": ["web_search"],
        }
    ]
}

A note on each, in the order they actually bit me.

trigger is "when to start clearing." Set it too low and clearing runs every turn, which keeps invalidating your prompt cache and ends up costing more. I aim for roughly 20–30% of the context limit. keep is "how many recent tool results to retain." If your agent decides its next move from the immediately preceding result, setting this to 1 or 2 breaks its train of thought. Around 3 was the safe choice.

clear_at_least is a floor that ensures a meaningful chunk goes when clearing does happen. Without it, you can sit right at the threshold and repeatedly "clear a few hundred tokens, immediately exceed again," firing in tiny increments that keep wrecking the cache. exclude_tools is the inverse — it protects the tools you do not want cleared. Use it to keep search results that keep influencing later decisions while dropping a throwaway directory listing, for example.

Implementing the memory tool backend yourself

The memory tool is client-side, so declaring it is not enough to make it work. You have to handle each command — view, create, str_replace, insert, delete, rename — on your side. The single most important part here is path-traversal protection.

The official SDKs ship helpers such as BetaAbstractMemoryTool (Python), but when you want full control over where data lives, a thin backend makes the behavior easy to reason about. First, look at the dangerous version that is tempting to write.

# Before (unsafe): joins the received path directly
from pathlib import Path
 
ROOT = Path("./memories")
 
def view(path: str) -> str:
    target = ROOT / path.lstrip("/")          # cannot stop "../../etc/passwd"
    return target.read_text(encoding="utf-8")

If path arrives as something like /memories/../../etc/passwd, you can read and write outside /memories. Even without Claude acting maliciously, you must not leave a route by which a malfunction or unexpected input can touch files outside the sandbox. Here is the protected version.

# After (protected): resolve to canonical form, then verify it stays under /memories
from pathlib import Path
 
ROOT = Path("./memories").resolve()
 
def _safe(path: str) -> Path:
    # map the leading /memories onto the real dir, normalize, verify containment
    rel = path.replace("/memories", "", 1).lstrip("/")
    target = (ROOT / rel).resolve()
    try:
        target.relative_to(ROOT)             # raises ValueError if outside
    except ValueError:
        raise PermissionError(f"path points outside /memories: {path}")
    return target
 
def handle_memory(cmd: dict) -> str:
    c = cmd["command"]
    if c == "view":
        p = _safe(cmd["path"])
        if p.is_dir():
            items = "\n".join(f"{f.stat().st_size}\t/memories/{f.relative_to(ROOT)}"
                              for f in sorted(p.rglob("*")) if f.is_file())
            return f"Files under /memories:\n{items}"
        lines = p.read_text(encoding="utf-8").splitlines()
        return "\n".join(f"{i+1:>6}\t{ln}" for i, ln in enumerate(lines))
    if c == "create":
        p = _safe(cmd["path"])
        if p.exists():
            return f"Error: File {cmd['path']} already exists"
        p.parent.mkdir(parents=True, exist_ok=True)
        p.write_text(cmd["file_text"], encoding="utf-8")
        return f"File created successfully at: {cmd['path']}"
    if c == "str_replace":
        p = _safe(cmd["path"])
        body = p.read_text(encoding="utf-8")
        if body.count(cmd["old_str"]) != 1:
            return "No replacement was performed; old_str must appear exactly once."
        p.write_text(body.replace(cmd["old_str"], cmd["new_str"]), encoding="utf-8")
        return "The memory file has been edited."
    if c == "delete":
        p = _safe(cmd["path"])
        p.unlink(missing_ok=True)
        return f"Successfully deleted {cmd['path']}"
    return f"Error: unsupported command {c}"

The core is resolving .. away with Path.resolve() and then confirming containment with relative_to(ROOT). Any path that fails that check is rejected with PermissionError. I left out insert and rename for space, but routing them through the same _safe protects them identically. Matching the return strings to the shapes the docs expect makes it easier for Claude to interpret the results.

Wiring both into the agent loop

Context editing and the memory tool coexist in the same create call. You add memory to the tool list and let context_management do the clearing.

def run_turn(messages):
    response = client.beta.messages.create(
        model="claude-opus-4-8",
        max_tokens=4096,
        messages=messages,
        tools=[
            {"type": "memory_20250818", "name": "memory"},
            {"type": "web_search_20250305", "name": "web_search"},
        ],
        betas=["context-management-2025-06-27"],
        context_management={
            "edits": [{
                "type": "clear_tool_uses_20250919",
                "trigger": {"type": "input_tokens", "value": 30000},
                "keep": {"type": "tool_uses", "value": 3},
            }]
        },
    )
    return response
 
# when a memory command arrives, run it through the backend and return a tool_result
def dispatch_tools(response, messages):
    results = []
    for block in response.content:
        if block.type == "tool_use" and block.name == "memory":
            out = handle_memory(block.input)
            results.append({"type": "tool_result",
                            "tool_use_id": block.id, "content": out})
    if results:
        messages.append({"role": "user", "content": results})
    return messages

The design trick is to steer Claude to write only the findings it would hurt to lose. Enabling the memory tool auto-injects a system instruction to always check the memory directory before working, but left alone Claude tends to spawn lots of tiny notes. A single line such as "only record information relevant to in memory" cut the file clutter noticeably for me.

Measuring the savings with count_tokens before you ship

This is the part that paid off most in practice. You can estimate the effect of context editing before going live, with count_tokens. Pass the same context_management, and it returns the token counts before and after clearing.

counted = client.beta.messages.count_tokens(
    model="claude-opus-4-8",
    messages=long_messages,                       # your actual bloated history
    betas=["context-management-2025-06-27"],
    context_management={
        "edits": [{
            "type": "clear_tool_uses_20250919",
            "trigger": {"type": "input_tokens", "value": 30000},
            "keep": {"type": "tool_uses", "value": 5},
        }]
    },
)
 
original = counted.context_management.original_input_tokens
after = counted.input_tokens
print(f"before: {original} / after: {after} / saved: {original - after}")

Measured against my research agent's history, roughly 70,000 tokens before clearing dropped to about 25,000 after. The size of the reduction depends heavily on how you use tools, so rather than taking that number at face value, always measure on your own typical history before adjusting trigger and keep. Being able to probe with count_tokens, which is not billed, is a dependable safety valve before you move real production cost.

Pitfalls that are easy to hit in operation

To close, three things I only noticed after integrating.

First, the interaction with prompt caching. When context editing runs, the cached prefix changes, so the turn where clearing happens does not benefit from the cache. Set trigger too low and every-turn clearing keeps breaking the cache, turning an intended saving into a net cost. Keep clearing occasional rather than constant.

Second, memory file bloat. Claude will keep adding notes if you let it, so it is safer to add operational guards on your side: cap the characters per file, limit how much view returns and let Claude paginate, and periodically delete files that have not been accessed for a long time.

Third, edit ordering. If you also clear thinking blocks (clear_thinking_20251015), that edit must come first in the edits array. Reverse the order and it will not take effect as intended.

If you want long-running agents to behave in production, start by pushing one of your bloated histories through count_tokens and measuring the before/after difference. Only then do the right places for trigger and keep in your particular workflow become visible as numbers.

Reference: Context editing (Claude API docs) and Memory tool (Claude API docs)

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.