●MODEL — Claude Opus 4.8 lands, improving coding, agentic, and reasoning over 4.7 at the same price●CODE — Opus 4.8's Fast mode runs at 2.5x speed and is now three times cheaper than earlier models●CODE — Auto-mode command classification expands, with denial tracking and live bash path autocomplete●ENTERPRISE — Connector permissions in custom roles let admins control which tools each role can use●TEAM — Tag Claude directly in Slack and hand off tasks while you focus elsewhere●MCP — MCP servers now show startup auth notices, making connection status easier to track●MODEL — Claude Opus 4.8 lands, improving coding, agentic, and reasoning over 4.7 at the same price●CODE — Opus 4.8's Fast mode runs at 2.5x speed and is now three times cheaper than earlier models●CODE — Auto-mode command classification expands, with denial tracking and live bash path autocomplete●ENTERPRISE — Connector permissions in custom roles let admins control which tools each role can use●TEAM — Tag Claude directly in Slack and hand off tasks while you focus elsewhere●MCP — MCP servers now show startup auth notices, making connection status easier to track
When Context Editing Made My Agent Re-run the Same Search — Field Notes on Clear Boundaries and Cache Invalidation
After turning on Context Editing to auto-clear tool results, the agent forgot what it had just read, re-ran the same tool, and the cache rebuilt every turn so costs went up. Field notes on instrumenting the silent regression and setting trigger, keep, and clear_at_least from measured data.
I had been fighting context bloat in a long-running agent, so I added a single line of Context Editing — clear_tool_uses_20250919. The token graph dropped, just as expected. But responses didn't feel faster, and the end-of-month token bill had actually gone up.
Lining up the logs, I saw the cause: the agent kept forgetting content it had searched for and read moments earlier, then calling the same web_search again. Only the placeholder for the cleared result remained, so Claude concluded "something was here but it's gone" and went back to fetch the missing information. The tokens I had trimmed by clearing were being clawed right back by fresh tool results.
As an indie developer running content generation and monitoring across several sites unattended, this "the numbers improved but the real outcome got worse" state is the worst kind to debug. Nothing throws an error. Cost and quality just quietly erode. These notes are about dragging that silent regression into view with instrumentation, and tuning Context Editing into a setting that doesn't cost you money to enable.
Suspect First That the Clear Boundary Doesn't Match the Meaning Boundary
clear_tool_uses_20250919 clears the oldest tool results first. The trouble is that the line between "results Claude no longer needs" and "results still informing its next decision" can't be measured in tokens. If keep (the number of tool uses retained) is too small, results you still want to reference get wiped.
You can tell whether you're in a re-run loop just by plotting two series over time: the count of duplicate calls with the same tool and arguments, and the cache hit rate before and after each clear event. If duplicates rise and the cache is being rebuilt frequently, your clearing is too aggressive.
# Pull the actual clear and cache numbers out of a Context-Editing response# Goal: observe "how much the clear saved" and "what the broken cache cost" on the same lineimport anthropic, json, hashlibclient = anthropic.Anthropic(api_key="YOUR_API_KEY")def call_with_context_editing(messages, tools): resp = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=4096, messages=messages, tools=tools, betas=["context-management-2025-06-27"], context_management={ "edits": [{ "type": "clear_tool_uses_20250919", "trigger": {"type": "input_tokens", "value": 30000}, "keep": {"type": "tool_uses", "value": 3}, "clear_at_least": {"type": "input_tokens", "value": 5000}, }] }, ) u = resp.usage applied = getattr(u, "context_management", None) print(json.dumps({ "input_tokens": u.input_tokens, "cache_read": getattr(u, "cache_read_input_tokens", 0), "cache_write": getattr(u, "cache_creation_input_tokens", 0), "context_edits": str(applied), })) return resp
If cache_read_input_tokens stays near zero while cache_creation_input_tokens keeps climbing, the cache prefix is breaking on every clear. That is exactly what "lighter but more expensive" looks like.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A concrete way to detect the tool re-run loop by plotting two log series: duplicate tool-call counts and cache hit rate around each clear event
✦A minimal verification loop and code for setting trigger, keep, clear_at_least, and exclude_tools from measured token distributions instead of guesses
✦A decision rule that compares Prompt Caching prefix-invalidation cost against the tokens a clear saves, so you can reject settings that lose money before shipping
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Settle it with counts, not vibes. Bucket tool calls by a hash of their arguments and count how many times the same call appears.
# From one agent run's transcript, count how often it redid the same investigation# Goal: compare duplicate counts before/after clearing to detect over-aggressive settingsfrom collections import Counterdef count_redundant_tool_calls(transcript_blocks): seen = Counter() redundant = 0 for b in transcript_blocks: if b.get("type") != "tool_use": continue key = hashlib.sha1( (b["name"] + json.dumps(b["input"], sort_keys=True)).encode() ).hexdigest() seen[key] += 1 if seen[key] > 1: redundant += 1 total = sum(seen.values()) rate = redundant / total if total else 0.0 return {"total_tool_calls": total, "redundant": redundant, "redundant_rate": round(rate, 3)}# Rule of thumb: if redundant_rate jumps 0.1+ above the no-clearing baseline, raise keep
In my own runs, raising keep from 3 to 6 dropped the redundant rate from 0.28 to 0.05 on one occasion. Tokens rise slightly, but because the re-run results no longer pile up, total input tokens stay flat and response quality clearly recovers. What you want to drop is "old results," not "results still doing work" — an obvious point that only landed once I had the numbers.
Set trigger and clear_at_least From Measured Token Distributions
Set trigger too low and clearing fires while you still have headroom, so cache invalidation happens first with little benefit. Too high and you fail to contain the bloat. Don't guess here; plot the real per-turn input token curve once, then decide.
The order of decisions:
With clearing disabled, run a few representative workloads and record input_tokens per turn.
Look at the value just before tokens plateau, and use 80–90% of it as your initial trigger.
Set clear_at_least above the measured average cache_creation_input_tokens. This guarantees one clear saves more than it costs to rebuild the cache.
Add long-lived search and spec-reference tools to exclude_tools to protect them.
If clear_at_least falls below the cache write volume, you endlessly "clear a little, then rebuild the cache," which is the most expensive failure mode of all. This is the one knob to keep conservatively large. The token measurement basics are in measuring Claude API costs with token counting.
The Extra Trap When Combining Extended Thinking
While you're focused on clearing tool results, it's easy to miss the Extended Thinking side. If Thinking is enabled and you don't specify clear_thinking_20251015, only the most recent turn's thinking is retained by default. For an agent that needs multi-step reasoning, that becomes the reason it "forgets the chain of thought it was halfway through and starts reasoning from scratch."
# Manage tool results and thinking blocks with separate retention policies# Goal: make the asymmetry explicit — keep investigation shallow, keep reasoning deepresp = client.beta.messages.create( model="claude-opus-4-6", max_tokens=16000, temperature=1, # required when Extended Thinking is on thinking={"type": "enabled", "budget_tokens": 10000}, messages=conversation_messages, tools=my_tools, betas=["context-management-2025-06-27"], context_management={ "edits": [ { "type": "clear_tool_uses_20250919", "trigger": {"type": "input_tokens", "value": 50000}, "keep": {"type": "tool_uses", "value": 6}, "clear_at_least": {"type": "input_tokens", "value": 8000}, "exclude_tools": ["web_search"], }, { "type": "clear_thinking_20251015", "keep": {"type": "thinking_turns", "value": 2}, }, ] },)# Effect: tool results trimmed to the last 6, while reasoning continuity is kept for 2 turns
Tool results and thinking blocks are retained for different reasons. Tool results are an "inventory of facts," so older ones are usually safe to drop; thinking blocks are "continuity of judgment," so dropping them resets the reasoning. Apply the same keep instinct to both and you'll over-trim the Thinking side.
Reject Cache-Hostile Settings Before Shipping
Finally, estimate the trade-off once before you enable it. Roughly, compare the tokens a single clear saves (felt on later turns) against the cache rebuild tokens incurred at that moment.
Observed state
Meaning
Action
cache_read stays high and stable
Clear boundary is stable; prefix is preserved
Hold. Lower trigger gradually to find more savings
cache_creation keeps rising
Cache breaks on every clear
Raise clear_at_least; reduce clear frequency
redundant_rate rises
Still-useful results are being wiped
Raise keep; add protected tools to exclude_tools
tokens drop but quality falls
Thinking over-trimmed
Raise clear_thinking keep, or set it to all
The core judgment is one thing. Context Editing isn't a "reduce context" feature; it's a "control when, what, and how much to reduce" feature. The moment reduction becomes the goal, re-runs and cache destruction will turn it against you. For broader production cost design, see tightening Claude API cost optimization in production.
Next Step
First, with clearing disabled, run three representative workloads and put the input_tokens curve and the cache_read / cache_creation ratio into a single log. Then set trigger to 80% of the token plateau, clear_at_least above the average cache write, and start keep at the smallest value where the redundant rate doesn't spike. Enable it in that order and you'll avoid the "lighter but more expensive" trap. I hope it saves you from sinking time into the same pitfall.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.