⬡ API & SDK/2026-06-21Advanced

Don't Carry Search Results Twice: Trimming Consumed Blocks with response_inclusion

When an agent runs dynamic filtering, output tokens balloon because the raw search-result blocks a code execution call already consumed get echoed back into the response. Here is when response_inclusion: excluded is safe to use, when you must keep full, with implementation and a decision table.

claude-api⁶⁶ web-search⁴ tool-use¹⁸ context-management³ cost-optimization²⁰

✦ Premium Article

I had been running a search-driven agent around the clock, and one morning, lining up the usage logs side by side, I stopped. The web_search_requests count had barely moved, yet output_tokens had nearly doubled from the week before.

The cause did not click right away. I had dynamic filtering enabled, so I assumed the search results were narrowed down by code the model wrote and only the relevant slice ever reached the context window. But when I counted the blocks in the response content one by one, the raw web_search_tool_result blocks — the ones already used for filtering — were still riding along as output. The model held the filtered information, and yet the unfiltered originals were being carried a second time, this time as output tokens.

The response_inclusion parameter introduced in web_search_20260318 (and web_fetch_20260318) in March 2026 exists precisely to cut this "carry twice" pattern. It is a quiet addition, but for anyone keeping a search-driven agent in continuous production it lands straight on output cost. Starting from the pitfall I actually hit running the auto-posting pipeline for my sites, let me lay out exactly where the excluded boundary is safe, with implementation and a decision table.

Search results cost tokens on both the input and the output side

First, let me pin down exactly where search results consume tokens. If that stays fuzzy, you can apply one fix and never feel it work.

Web search billing is metered at ten dollars per 1,000 searches, plus the retrieved content riding along as tokens. The official docs state plainly that search results are counted as input tokens "in search iterations executed during a single turn and in subsequent conversation turns." So retrieved content can be counted:

once as input, for the model to read in the turn it was fetched, and
once as output, riding in that turn's response content as web_search_tool_result blocks.

Dynamic filtering is the mechanism that trims the first, the input side. The model writes code inside code execution and selects results before they land in the context window, so the input side genuinely gets lighter.

The second, the output side — the raw result blocks echoed into the response — does not disappear from dynamic filtering alone. The originals, already spent on filtering, linger in the response you send back to the client. For a workflow like my automation, where I only need the final summary and never show the user raw search content, that output was pure waste.

response_inclusion only drops results a completed code execution consumed

The default for response_inclusion is "full". Set it to "excluded" and the API drops, entirely, the nested server_tool_use and result-block pairs for search results that a completed code execution call consumed within the same turn.

Two conditions do the heavy lifting here. They are the parts that turn into incidents if you skim them, so let me be explicit.

First, only results consumed by a completed code execution are dropped. Dynamic filtering runs through code execution, so the search results are consumed by that code. If the execution ran to completion within the turn, the model already holds the filtered information — the raw blocks become nothing but an echo in the response, and so they are safe to drop. That is the reasoning.

Second, results from a direct call (a plain search that does not go through dynamic filtering) and results from a code execution that paused with pause_turn are always returned in full, even if you set excluded. These need to be sent back on the next turn for citations or continuation, so the API protects them from being dropped. It helps to read response_inclusion as "drop only the blocks that are definitively no longer needed on the next turn."

One more thing: the citation fields cited_text, title, and url are not counted as input or output tokens in the first place. What excluded drops is the heavy web_search_tool_result block on the source-content side; the citations themselves remain on the text blocks. Do not jump to the conclusion that "excluded removes my citations."

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Understand how raw search-result blocks get double-counted as output tokens under dynamic filtering, and how to read it directly from the usage object

✦Take away a decision table for when response_inclusion can be set to excluded (only results a completed code execution consumed in the same turn) versus when full must be preserved for citations and continuation

✦Separate the three levers — response_inclusion, context editing's clear_tool_uses, and client-side compression — along the axis of cutting output versus input tokens

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

The minimal setup to enable excluded

The implementation is a single line added to the tool definition. Dynamic filtering presupposes that code execution is enabled, so you pass both. A nice operational detail: when used alongside search or fetch, the code execution calls are billed at no extra charge.

import anthropic
 
client = anthropic.Anthropic()
 
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": (
            "Look up the latest inference prices for the major clouds and "
            "return only a table of the relative cost per 1M tokens. "
            "You don't need to return the raw page content."
        ),
    }],
    tools=[
        {"type": "code_execution_20250825", "name": "code_execution"},
        {
            "type": "web_search_20260318",
            "name": "web_search",
            "max_uses": 8,
            "response_inclusion": "excluded",
        },
    ],
)
 
print(resp.usage)

In this request, the model iterates its searches, narrows the results inside code execution, and returns only the relative-cost table at the end. Because excluded is set, the search originals consumed during filtering drop out of the response, and output_tokens carries only the final table and the reasoning that led to it.

Whether the reduction is working becomes obvious the moment you count what is in the response content. The helper below prints the number of echoed raw result blocks alongside usage.

def summarize(resp) -> None:
    echoed = sum(
        1 for b in resp.content
        if b.type == "web_search_tool_result"
    )
    sts = resp.usage.server_tool_use
    print(f"output_tokens        = {resp.usage.output_tokens}")
    print(f"web_search_requests  = {sts.web_search_requests if sts else 0}")
    print(f"echoed_result_blocks = {echoed}")

Run the same prompt once with full and once with excluded, and you see web_search_requests stay identical while echoed_result_blocks drops to 0, with output_tokens falling in step. The same number of searches, a lighter output — that shape is your proof the setting is taking effect.

When it is safe to drop, and when to keep full

excluded is not a universal switch. Depending on how the response is used, there are cases where you must keep full. Here is the line I draw in production.

Situation	Recommended	Why
Search results a completed code execution consumed in the same turn	Can be excluded	The model already holds the filtered information; the raw blocks only echo into output and aren't needed next turn
Direct-call results not routed through dynamic filtering	Always full (cannot be dropped)	The API retains them to send back for citations and continuation
Results from a code execution that paused with pause_turn	Always full (cannot be dropped)	They must be sent back as-is when the turn resumes
Showing or auditing raw search content for end users	Keep full	The raw blocks are needed to display sources and trace the evidence
A server-side agent passing only a summary, score, or verdict downstream	Prefer excluded	No value in returning originals; a clean reduction in output tokens

The deciding axis is simple: will you use that raw block again after receiving this turn's response? If yes, full; if no, excluded. The research agent in my auto-posting pipeline hands only structured findings to the next stage, so I tip almost all of its searches to excluded. The one agent that shows readers source links is the only one I leave on full.

How it differs from clear_tool_uses and client-side compression

There are other levers for the problem of tool returns bloating context. To avoid stacking two fixes in the same place, it is worth pinning down that they act in different places.

Technique	Tokens it cuts	When it applies	Main target
response_inclusion: excluded	Output (echoed, already-consumed result blocks)	Resolves within a single turn	web_search / web_fetch with dynamic filtering
clear_tool_uses (context editing)	Input (tool_use / tool_result accumulated over past turns)	Against buildup across many turns	History of any tool returns
Client-side compression (schema projection, reference handles)	Input (huge returns from your own tools)	At the moment a tool returns	DB rows, scraped HTML, and other custom-tool output

These three do not compete. response_inclusion lightens this turn's output. Context editing's clear_tool_uses removes past tool history that accumulated across turns from the input. Client-side compression of your own tools narrows the return of tools you wrote yourself — not server tools — at the moment they return.

In my own setup I keep them in three layers: response_inclusion for server tools (search and fetch), client-side compression for my custom MCP tools, and context editing for the loop as a whole when it runs long. Try to cover everything with just one of them and something always slips through.

Confirm the output reduction with usage before you ship

Finally, the check I run before putting the setting into production. Numbers, not gut feel.

With the same prompt and the same max_uses, run full and excluded once each and record the difference in output_tokens. The heavier the search load, the larger the gap. In one of my research agents, on tasks with around six web_search_requests, switching to excluded cut output_tokens by roughly 30–40%. On shallow single-shot tasks the gap was small — the deeper the loop and the more searches, the more it pays, which is the intuitive trend.

def compare(make_request) -> None:
    """make_request(inclusion) returns a response"""
    for mode in ("full", "excluded"):
        r = make_request(mode)
        echoed = sum(1 for b in r.content if b.type == "web_search_tool_result")
        print(f"[{mode:8}] out={r.usage.output_tokens:>6}  echoed_blocks={echoed}")

I confirm two things before shipping: that the difference shows up as expected, and that with excluded the substance of the final response (summary, table, verdict) has not degraded. A lighter output is worthless if the conclusion drifts. The one path where I show sources to users stays on full through this comparison by deliberate choice.

Where it settles for solo automation

response_inclusion is the kind of feature whose value only clicks once you stare at the billing and realize search charges you on both input and output. If you stop at trimming the input with dynamic filtering, you miss the originals carried twice on the output side.

Running pipelines for several sites alone as an indie developer, small leaks like this pile up a little each day. I have settled on excluding by default for agents whose purpose is research and filtering and that never pass originals downstream, and keeping full only on the path that shows readers their sources.

Start with your most search-heavy agent: run full and excluded once each and look at the gap in output_tokens. If a gap appears, that is the cost you were quietly paying on every run. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.