CLAUDE LABJP
SANDBOX — Claude Managed Agents can now run in your own sandbox and connect to private MCP servers (self-hosted beta, MCP tunnels in preview)PLATFORM — The Claude Developer Platform adds new code execution, web search, and web fetch tools, exposing a 90-second per-cell limitCONTEXT — response_inclusion trims consumed result blocks to save context in agentic workflowsMCP — Enterprise-managed MCP connectors (Okta) continue: zero-touch access across Claude, Claude Code, and Cowork (Team/Enterprise beta)CODE — Claude Code adds /cd, a post-session hook, and a safe mode while tightening MCP policy enforcementMODEL — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; Fable 5 is available from Claude CodeSANDBOX — Claude Managed Agents can now run in your own sandbox and connect to private MCP servers (self-hosted beta, MCP tunnels in preview)PLATFORM — The Claude Developer Platform adds new code execution, web search, and web fetch tools, exposing a 90-second per-cell limitCONTEXT — response_inclusion trims consumed result blocks to save context in agentic workflowsMCP — Enterprise-managed MCP connectors (Okta) continue: zero-touch access across Claude, Claude Code, and Cowork (Team/Enterprise beta)CODE — Claude Code adds /cd, a post-session hook, and a safe mode while tightening MCP policy enforcementMODEL — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; Fable 5 is available from Claude Code
Articles/API & SDK
API & SDK/2026-06-21Advanced

Don't Carry Search Results Twice: Trimming Consumed Blocks with response_inclusion

When an agent runs dynamic filtering, output tokens balloon because the raw search-result blocks a code execution call already consumed get echoed back into the response. Here is when response_inclusion: excluded is safe to use, when you must keep full, with implementation and a decision table.

claude-api66web-search4tool-use18context-management3cost-optimization20

Premium Article

I had been running a search-driven agent around the clock, and one morning, lining up the usage logs side by side, I stopped. The web_search_requests count had barely moved, yet output_tokens had nearly doubled from the week before.

The cause did not click right away. I had dynamic filtering enabled, so I assumed the search results were narrowed down by code the model wrote and only the relevant slice ever reached the context window. But when I counted the blocks in the response content one by one, the raw web_search_tool_result blocks — the ones already used for filtering — were still riding along as output. The model held the filtered information, and yet the unfiltered originals were being carried a second time, this time as output tokens.

The response_inclusion parameter introduced in web_search_20260318 (and web_fetch_20260318) in March 2026 exists precisely to cut this "carry twice" pattern. It is a quiet addition, but for anyone keeping a search-driven agent in continuous production it lands straight on output cost. Starting from the pitfall I actually hit running the auto-posting pipeline for my sites, let me lay out exactly where the excluded boundary is safe, with implementation and a decision table.

Search results cost tokens on both the input and the output side

First, let me pin down exactly where search results consume tokens. If that stays fuzzy, you can apply one fix and never feel it work.

Web search billing is metered at ten dollars per 1,000 searches, plus the retrieved content riding along as tokens. The official docs state plainly that search results are counted as input tokens "in search iterations executed during a single turn and in subsequent conversation turns." So retrieved content can be counted:

  • once as input, for the model to read in the turn it was fetched, and
  • once as output, riding in that turn's response content as web_search_tool_result blocks.

Dynamic filtering is the mechanism that trims the first, the input side. The model writes code inside code execution and selects results before they land in the context window, so the input side genuinely gets lighter.

The second, the output side — the raw result blocks echoed into the response — does not disappear from dynamic filtering alone. The originals, already spent on filtering, linger in the response you send back to the client. For a workflow like my automation, where I only need the final summary and never show the user raw search content, that output was pure waste.

response_inclusion only drops results a completed code execution consumed

The default for response_inclusion is "full". Set it to "excluded" and the API drops, entirely, the nested server_tool_use and result-block pairs for search results that a completed code execution call consumed within the same turn.

Two conditions do the heavy lifting here. They are the parts that turn into incidents if you skim them, so let me be explicit.

First, only results consumed by a completed code execution are dropped. Dynamic filtering runs through code execution, so the search results are consumed by that code. If the execution ran to completion within the turn, the model already holds the filtered information — the raw blocks become nothing but an echo in the response, and so they are safe to drop. That is the reasoning.

Second, results from a direct call (a plain search that does not go through dynamic filtering) and results from a code execution that paused with pause_turn are always returned in full, even if you set excluded. These need to be sent back on the next turn for citations or continuation, so the API protects them from being dropped. It helps to read response_inclusion as "drop only the blocks that are definitively no longer needed on the next turn."

One more thing: the citation fields cited_text, title, and url are not counted as input or output tokens in the first place. What excluded drops is the heavy web_search_tool_result block on the source-content side; the citations themselves remain on the text blocks. Do not jump to the conclusion that "excluded removes my citations."

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Understand how raw search-result blocks get double-counted as output tokens under dynamic filtering, and how to read it directly from the usage object
Take away a decision table for when response_inclusion can be set to excluded (only results a completed code execution consumed in the same turn) versus when full must be preserved for citations and continuation
Separate the three levers — response_inclusion, context editing's clear_tool_uses, and client-side compression — along the axis of cutting output versus input tokens
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-05-29
Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops
Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.
API & SDK2026-05-26
Stabilizing Claude API Structured Responses in Production — Notes on tool_use, JSON Schema, and Layered Validation
Getting Claude to return JSON takes a few lines. Keeping that JSON usable in production is a different problem. Here is the layered design I landed on after running a wallpaper classification pipeline through Claude API, built around tool_use, JSON Schema, and domain validation.
API & SDK2026-05-25
A Two-Tier Setup — Haiku 4.5 Orchestrator with Opus 4.6 Worker for Balancing Cost and Quality
How an indie developer's two-tier setup — Haiku 4.5 as the orchestrator and Opus 4.6 as the worker — cuts monthly API spend by roughly 70% without sacrificing the quality readers pay for.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →