⬡ API & SDK/2026-06-29Advanced

Let Claude Actually See the Images Your Tools Return — Use Image Blocks in tool_result and Cut Tokens by Roughly 10x

Stuffing a base64 string into a tool_result makes the same image cost roughly 10–20x more tokens. Here is how to return it as an image content block instead, with SDK code, a token-cost estimate, and the gotchas I hit in production.

Claude API⁹² tool use⁴ vision⁷ tool_result token optimization²

✦ Premium Article

There is a trap in tool implementations that is surprisingly easy to miss: your tool returns an image, but Claude never actually sees it. I ran into this myself when I wrote a tool that lets an agent judge wallpaper thumbnails. The response came back fine, but the judgments were oddly vague. When I dug in, Claude was reading a long base64 string as text, not looking at the picture.

The annoying part is that a tool_result accepts almost anything, so the wrong shape still runs. It works, but it costs you. This article walks through how to return images so Claude genuinely sees them, with the actual numbers attached.

When your tool returns an image but Claude isn't looking

When you answer a tool_use, most implementations put a string into the content of a tool_result. For tools that return text, that is exactly right. But when you want to return an image, it is tempting to write this:

# Anti-pattern: stuffing the image base64 in as a "string"
tool_result = {
    "type": "tool_result",
    "tool_use_id": tool_use_id,
    "content": f"image data: {base64_png}",  # treated as text
}

The API will not raise an error here. Claude receives the base64 as text, and on the surface processing continues. But Claude never looks at the pixels, so it cannot make any judgment based on the image. Worse, those tens of thousands of base64 characters are billed as input tokens.

Reports in the official SDK repositories and community threads describe exactly this: tool-result images that are not converted into native image blocks and instead get sent as text, consuming around 15,000–25,000 tokens per image. The same image attached directly as a user message costs about 1,600 tokens, so the gap is roughly 10–20x. It is the classic case of paying ten times more for something that appears to work.

The correct shape is an image content block inside tool_result

The content of a tool_result accepts not just a string but an array of content blocks. Put an image block there and Claude recognizes it as an image and reads the pixels with its vision capabilities.

# Correct shape: make content an array and include an image block
tool_result = {
    "type": "tool_result",
    "tool_use_id": tool_use_id,
    "content": [
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_png,
            },
        },
        {"type": "text", "text": "Here is the current thumbnail candidate. Rate its legibility."},
    ],
}

Two things matter: make content an array, and pass the image as an {"type": "image", ...} block. If you want to add a text note, just place a text block alongside it in the same array. On Claude's side, this image is handled exactly like an image a user attached, which means it is also billed at image rates rather than as tens of thousands of text tokens.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why putting base64 into a tool_result as a string makes the image count as text and burns roughly 15,000–25,000 tokens per image, and how to avoid it

✦The real cost when you return it as an image block (around 1,600 tokens) plus a formula to predict tokens from width and height before you send

✦A working agent loop that shows wallpaper thumbnails and App Store screenshots to Claude, with the size limits, media types, and the not-rendered-to-users pitfall

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Pin down the before/after cost with real numbers

To make the difference concrete, here is an estimate for passing the same 1024×1024 PNG two ways.

How you pass it	What Claude sees	Approx. input tokens	Can judge image content
base64 as a string in `content`	Text	~15,000–25,000	No
image block in `content`	Image	~1,400–1,600	Yes

Image token count is roughly "height × width ÷ 750". For 1024×1024 that is 1024 * 1024 / 750 ≈ 1,398 tokens. Knowing this formula lets you decide, before sending, whether a tool's image needs downscaling. Returning a 2048×2048 image costs about 5,600 tokens, but if 1024 is enough for the judgment, downscaling before you send cuts that to a quarter.

Conversely, the string anti-pattern makes this estimate useless. Text is billed per character, and base64 inflates the original by about 1.37x. That is how a single image becomes 20,000 tokens.

Wire it into the agent loop

A real loop looks like this: Claude returns a tool_use, your code runs the tool and gets an image, you return that image as an image block in the tool_result, and Claude looks at the image to make its next decision.

import base64, anthropic
 
client = anthropic.Anthropic()
 
def render_thumbnail(candidate_id: str) -> bytes:
    # In practice, generate or fetch the thumbnail. Returns PNG bytes here.
    ...
 
def run_review(candidate_id: str):
    messages = [{
        "role": "user",
        "content": f"Generate the thumbnail for candidate {candidate_id} and rate its legibility from 1 to 5.",
    }]
 
    while True:
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=[{
                "name": "render_thumbnail",
                "description": "Generate and return the thumbnail PNG for a candidate ID",
                "input_schema": {
                    "type": "object",
                    "properties": {"candidate_id": {"type": "string"}},
                    "required": ["candidate_id"],
                },
            }],
            messages=messages,
        )
 
        if resp.stop_reason != "tool_use":
            return resp  # the final rating came back
 
        # Handle the tool_use blocks
        tool_results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            png = render_thumbnail(block.input["candidate_id"])
            b64 = base64.standard_b64encode(png).decode("ascii")
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": [{
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": b64},
                }],
            })
 
        messages.append({"role": "assistant", "content": resp.content})
        messages.append({"role": "user", "content": tool_results})

If you forget messages.append({"role": "assistant", "content": resp.content}), you get the "tool_result block missing corresponding tool_use block" error. The required order is: append the assistant's tool_use back into the history, then return the tool_result in the very next user turn.

The shape is identical in the TypeScript SDK

The structure does not change with the language. As long as you make content an array and include an image block, it works.

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
 
const toolResult = {
  type: "tool_result" as const,
  tool_use_id: toolUseId,
  content: [
    {
      type: "image" as const,
      source: { type: "base64" as const, media_type: "image/png" as const, data: b64 },
    },
    { type: "text" as const, text: "This is the latest App Store screenshot." },
  ],
};

After I swap out App Store screenshots, I show the new image to Claude through a tool and have it do a first-pass check for clipped text or content spilling outside the device frame. Showing the actual image gets far more specific feedback than describing it in words. Back when I passed base64 as a string here, I was wasting tens of thousands of tokens on every check.

URL and the Files API as alternatives

Embedding base64 in every message makes requests heavy. If you reference the same image repeatedly, you can change the source type.

source type	Good for	Watch out for
base64	A one-off, single-use image	Bloats the request; repeated sends waste bandwidth
url	An image reachable at a public URL	Claude's side needs to be able to fetch it
file_id (Files API)	Referencing the same image many times	Requires uploading first; check the beta header

In a pipeline that shows the same figure over and over, uploading once to the Files API and referencing by file_id avoids resending base64 each time. For a loop like mine that inspects thumbnails in a morning batch, I upload the fixed reference image up front and only send the changing candidate image as base64.

Pitfall checklist

A few things that trip people up mid-implementation.

Get the media type right: use exactly one of image/png, image/jpeg, image/gif, or image/webp, matching the real bytes. Declaring a PNG as image/jpeg makes decoding fail.
Downscale before sending: an oversized image costs more tokens and gets downscaled internally anyway. Dropping it to the resolution you actually need keeps both cost and reproducibility stable.
Not auto-shown to users: an image returned as a tool result is visible to Claude, but it is not automatically rendered inline in your app's final answer. To show it in a chat UI, you have to draw it explicitly in your own frontend.
Order and pairing: keep the tool_use (assistant) → tool_result (user) correspondence intact. If you call multiple tools in parallel, return a result for every tool_use_id.
Per-image size limit: the API caps the size of a single image. Rather than throwing the full-resolution original at it, design your tool to return a downscaled inspection copy.

The one line to change first

If you already have a tool returning images, the smallest improvement you can ship today is clear. Find where you put base64 into content as a string, make content an array, and replace it with a {"type": "image", "source": {...}} block. That alone cuts the token cost of the same image to roughly a tenth, and for the first time lets Claude actually see the content and judge it.

I hope this helps with your own implementation. I am still finding the sweet spot as I run this in practice, but when you write a tool that returns images, just checking one thing first — "am I making Claude read this as text, or letting it see it as an image?" — already goes a long way.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.