●CORPS — Anthropic unveils Claude Corps (Jun 11), a $150M national fellowship placing 1,000 early-career workers inside US nonprofits; the first cohort starts in October●SUBAGENTS — Claude Code sub-agents can now spawn their own sub-agents, up to 5 levels deep — multi-stage delegation workflows out of the box●WORKFLOWS — Dynamic workflows arrive in research preview across CLI, Desktop, and VS Code for codebase-wide bug hunts and large migrations (Max/Team/Enterprise)●BILLING — 2 days to the Jun 15 change: Agent SDK, headless runs, and GitHub Actions move to monthly credits ($20/$100/$200); Sonnet 4 and Opus 4 retire from the API the same day●FABLE5 — Fable 5 remains included free on Pro, Max, Team, and Enterprise through Jun 22●CODE80 — IPO coverage reports Claude now writes over 80% of its own code, up from under 10% in February 2025●CORPS — Anthropic unveils Claude Corps (Jun 11), a $150M national fellowship placing 1,000 early-career workers inside US nonprofits; the first cohort starts in October●SUBAGENTS — Claude Code sub-agents can now spawn their own sub-agents, up to 5 levels deep — multi-stage delegation workflows out of the box●WORKFLOWS — Dynamic workflows arrive in research preview across CLI, Desktop, and VS Code for codebase-wide bug hunts and large migrations (Max/Team/Enterprise)●BILLING — 2 days to the Jun 15 change: Agent SDK, headless runs, and GitHub Actions move to monthly credits ($20/$100/$200); Sonnet 4 and Opus 4 retire from the API the same day●FABLE5 — Fable 5 remains included free on Pro, Max, Team, and Enterprise through Jun 22●CODE80 — IPO coverage reports Claude now writes over 80% of its own code, up from under 10% in February 2025
Claude Vision API in Production — Implementation Patterns for Image Analysis, PDF Processing, and OCR
Implementation patterns for taking Claude's vision capabilities to production: choosing between Base64, URL, and the Files API, native PDF processing, schema-enforced extraction with Tool Use, batch cost reduction, and error recovery — all with working code.
The Three Places a "Working" Vision Integration Breaks in Production
Encode an image to Base64, pass it to messages.create, and Claude describes it on the spot. That part takes thirty minutes.
The trouble starts afterward. Building image-analysis pipelines as an indie developer, I ran into three walls that never showed up during prototyping.
The first is cost. Images consume far more tokens than text. Stream high-resolution photos through without resizing and your invoice lands at several times the estimate.
The second is output instability. Asking for JSON in the prompt works nine times out of ten. The tenth time, a preamble sneaks in, json.loads throws, and your overnight batch dies at 3 a.m.
The third is PDF handling. If you carry over the old convert-pages-to-images approach, you throw away the text layer entirely — and both accuracy and cost suffer for it.
This article walks through those three walls in order. Every code sample is complete Python you can run as-is.
Three Input Methods — Decide by Reuse, Not Habit
There are three ways to hand Claude an image: inline Base64, a URL reference, or the Files API. The right choice comes down to two questions: how many times will you analyze this image, and can it be public?
| Method | Best for | Watch out for |
|------|----------|------|
| Base64 | One-shot analysis, private images | Request size inflation |
| URL | Already-public assets on a CDN | Useless for private images |
| Files API | Repeated analysis of the same image | One extra upload step |
Inline Base64 — the default starting point
For a private image you analyze once, Base64 is the most direct route.
import anthropicimport base64from pathlib import Pathclient = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from the environmentMEDIA_TYPES = { ".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp",}def encode_image(path: str) -> tuple[str, str]: """Base64-encode an image and return it with its media type.""" p = Path(path) media_type = MEDIA_TYPES.get(p.suffix.lower(), "image/jpeg") data = base64.standard_b64encode(p.read_bytes()).decode("utf-8") return data, media_typedef analyze_image(path: str, prompt: str) -> str: data, media_type = encode_image(path) message = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}, {"type": "text", "text": prompt}, ], }], ) return message.content[0].textprint(analyze_image("screenshot.png", "Extract every error message visible on this screen."))
One thing to keep in mind: the total request size limit is 32MB, and Base64 inflates files by roughly 1.33x. Bundle several 20MB images into one request and you sail past the limit. If your design involves multiple images, always resize first (covered below).
URL references — for assets you already serve
If the image is already on a CDN, just pass the URL. Requests get lighter and the encoding step disappears.
message = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "url", "url": "https://example.com/assets/diagram.png"}}, {"type": "text", "text": "Describe the processing flow in this diagram as a bullet list."}, ], }],)
The URL must be reachable from Anthropic's servers. Intranet-only URLs and unsigned links to authenticated storage will fail with an invalid_request_error. If you adopt the URL approach, wire that error to a Base64 fallback and the pipeline stays stable.
Files API — when the same image gets analyzed repeatedly
When your design sends multiple requests against the same image — classify first, then deep-analyze, then extract metadata — re-sending Base64 every time is wasteful. Upload once with the Files API and reference by file_id.
# Upload onceuploaded = client.beta.files.upload( file=("design.png", open("design.png", "rb"), "image/png"),)# Reference by file_id from then onmessage = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=1024, betas=["files-api-2025-04-14"], messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "file", "file_id": uploaded.id}}, {"type": "text", "text": "List the color palette used in this UI design."}, ], }],)
My personal rule: two or more reuses means Files API, already public means URL, everything else is Base64. Start with Base64 and migrate when transfer volume starts to bother you — that ordering works in practice.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A decision framework for choosing between Base64, URL, and Files API image inputs based on reuse frequency and privacy requirements
✦Schema-enforced extraction with Tool Use that reduces OCR and table-parsing failures to nearly zero in practice
✦Combining the Message Batches API with prompt caching to cut large-scale vision processing costs by 50% or more
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
The biggest cost lever in vision work is not model choice. It is resizing. Token consumption is roughly:
tokens ≈ (width × height) ÷ 750
A 1920×1080 screenshot costs about 2,765 tokens. A 4032×3024 phone photo costs about 16,257 if sent raw. The API automatically scales down images whose long edge exceeds 1568px, but resizing client-side is still worth it — you save transfer volume and keep control of the aspect ratio.
from PIL import Imageimport io, base64def resize_for_claude(path: str, max_edge: int = 1568, quality: int = 85) -> tuple[str, str]: """Fit the long edge to max_edge, compress to JPEG, return Base64.""" img = Image.open(path) if img.mode in ("RGBA", "P"): img = img.convert("RGB") ratio = max_edge / max(img.size) if ratio < 1: img = img.resize( (int(img.width * ratio), int(img.height * ratio)), Image.LANCZOS, ) buf = io.BytesIO() img.save(buf, format="JPEG", quality=quality) data = base64.standard_b64encode(buf.getvalue()).decode("utf-8") return data, "image/jpeg"def estimate_image_tokens(width: int, height: int) -> int: return (width * height) // 750
In my testing, text-reading tasks — OCR, table extraction — hold their accuracy best at the full 1568px long edge. Classification tasks ("what is in this image?") degrade very little down to 1092px, which nearly halves the token count. Varying the resize target by task type is a small habit with a direct line to your monthly bill.
For scale: a 1024×1024 image is about 1,398 tokens. At $3 per million input tokens that is roughly $0.0042 per image, or about $42 for ten thousand images — and the Batch API described below halves it again.
Let the Native document Block Handle PDFs
The once-standard pdf2image conversion pipeline no longer has a reason to exist. Claude accepts PDFs directly.
import base64from pathlib import Pathpdf_data = base64.standard_b64encode(Path("report.pdf").read_bytes()).decode("utf-8")message = client.messages.create( model="claude-sonnet-4-6", max_tokens=4096, messages=[{ "role": "user", "content": [ {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data}}, {"type": "text", "text": "Extract revenue, operating profit, and year-over-year change from this report as a table."}, ], }],)print(message.content[0].text)
Native handling wins on two fronts. Each page is interpreted as both its text layer and its rendered image, so typed text is read precisely while charts and figures are understood visually — the image-conversion approach silently discarded that text layer. And the API takes over page splitting, resolution handling, and ordering, which means the pdf2image-plus-poppler dependency disappears from your project entirely.
The limits are 100 pages and 32MB. Beyond that, split first:
from pypdf import PdfReader, PdfWriterimport iodef split_pdf(path: str, pages_per_chunk: int = 90) -> list[bytes]: """Split a PDF to stay under the 100-page limit.""" reader = PdfReader(path) chunks = [] for start in range(0, len(reader.pages), pages_per_chunk): writer = PdfWriter() for page in reader.pages[start:start + pages_per_chunk]: writer.add_page(page) buf = io.BytesIO() writer.write(buf) chunks.append(buf.getvalue()) return chunks
Enable citations to get "which page says so"
Put document analysis in front of a business team and the first question is always: where did that number come from? Enable citations and every part of the answer carries its source location.
message = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=[{ "role": "user", "content": [ {"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data}, "citations": {"enabled": True}}, {"type": "text", "text": "Summarize the key points of the termination clause."}, ], }],)for block in message.content: if block.type == "text": print(block.text) for c in getattr(block, "citations", None) or []: print(f" └ source: pages {c.start_page_number}–{c.end_page_number - 1}")
It transforms the verification workload, so for internal document-summary pipelines I recommend turning it on by default.
OCR and Table Extraction — Stop Asking for JSON
This is the section I most want you to take away.
The pattern of writing "respond in JSON format" in the prompt and calling json.loads(message.content[0].text) will break in production. Guaranteed. A preamble like "Here are the extracted results" appears, or the JSON arrives wrapped in a code fence, or a few responses per thousand are simply malformed.
The fix is Tool Use. Define the extraction structure as a tool's input schema and force the tool with tool_choice. The model's output is constrained to the schema at the API level — no preambles, no fences, no structural drift.
TABLE_TOOL = { "name": "record_tables", "description": "Records table data extracted from an image", "input_schema": { "type": "object", "properties": { "tables": { "type": "array", "items": { "type": "object", "properties": { "title": {"type": "string", "description": "Table title or heading"}, "headers": {"type": "array", "items": {"type": "string"}}, "rows": { "type": "array", "items": {"type": "array", "items": {"type": "string"}}, }, "notes": {"type": "string", "description": "Notes on missing or illegible cells"}, }, "required": ["headers", "rows"], }, } }, "required": ["tables"], },}def extract_tables(image_path: str) -> dict: data, media_type = resize_for_claude(image_path) # keep 1568px for OCR-type tasks message = client.messages.create( model="claude-sonnet-4-6", max_tokens=4096, tools=[TABLE_TOOL], tool_choice={"type": "tool", "name": "record_tables"}, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}, {"type": "text", "text": "Extract every table in this image. " "Use an empty string for illegible cells and record their locations in notes."}, ], }], ) for block in message.content: if block.type == "tool_use": return block.input # always a schema-conforming dict return {"tables": []}
After switching to this pattern, parse-related failures in my own pipeline effectively dropped to zero. The try-except around json.loads went away. The regex that stripped code fences went away. Less code, more reliability — improvements like that don't come along often.
Extracted tables feed straight into pandas:
import pandas as pdresult = extract_tables("financial_report.png")for i, table in enumerate(result["tables"]): df = pd.DataFrame(table["rows"], columns=table["headers"]) df.to_csv(f"table_{i}.csv", index=False) if table.get("notes"): print(f"table_{i}: needs review — {table['notes']}")
Small techniques that lift handwriting and low-quality scans
Three adjustments make a visible difference on handwritten documents and old scans:
Give the model an escape hatch. Instruct it to record illegible characters as [unreadable]. Without one, the model guesses — and confident guesses are the most dangerous failure mode.
Provide domain vocabulary up front. For invoices, a candidate list of account names and vendor names in the prompt sharply improves resolution of ambiguous characters.
Ask for self-reported confidence. Add a confidence field (high/medium/low) to the schema and route only the lows to human review. Compared with eyeballing every document, the review load drops dramatically.
Screenshot Understanding as a UI Review Step
Vision is not only for documents. Wired into a development workflow, screenshots become reviewable artifacts.
My own habit is to run pre-release screenshot sets through a check for truncated text, insufficient contrast, and tap targets under 44×44pt. It does not replace human review — it clears the mechanical findings first, so the human pass can focus on judgment calls.
One caution: do not ask for coordinates. "Return this button's position in pixels" produces errors too large for automation. Take locations as descriptions ("the save button at top right") and leave precise targeting to another layer — the accessibility tree or the DOM. That division of labor is the realistic one.
Processing Images at Scale — Parallel Calls vs. the Batch API
Past a few hundred images you have two options: parallel execution when you need results now, and the Message Batches API when tomorrow morning is fine.
The decision rule is single: do you need the results immediately? The Batch API discounts every request by 50% in exchange for completion within 24 hours (usually under one). Running overnight-tolerant workloads through ThreadPoolExecutor is simply paying double for the same answers.
Immediate: parallel execution under rate limits
from concurrent.futures import ThreadPoolExecutor, as_completeddef analyze_one(path: str, prompt: str) -> tuple[str, str]: data, media_type = resize_for_claude(path, max_edge=1092) # classification tolerates smaller images message = client.messages.create( model="claude-haiku-4-5-20251001", # Haiku is enough for classification max_tokens=256, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}}, {"type": "text", "text": prompt}, ], }], ) return path, message.content[0].textdef analyze_parallel(paths: list[str], prompt: str, workers: int = 4) -> dict: results, errors = {}, {} with ThreadPoolExecutor(max_workers=workers) as pool: futures = {pool.submit(analyze_one, p, prompt): p for p in paths} for future in as_completed(futures): try: path, text = future.result() results[path] = text except anthropic.RateLimitError: errors[futures[future]] = "rate_limit" except Exception as e: errors[futures[future]] = str(e) return {"results": results, "errors": errors}
Start with modest concurrency. Vision requests carry heavy input tokens, and treating them like text requests at 20 workers hits the tokens-per-minute ceiling almost immediately. I start at 4 and raise it only while 429s stay absent.
A batch holds up to 100,000 requests or 256MB. With images you hit the megabytes before the count, so 500–1,000 resized images per batch is a comfortable working size.
Model selection multiplies the savings. Recognition tasks — what is this, which category — run fine on Haiku. Haiku plus the Batch API lands at under a tenth of the per-image cost of immediate Sonnet calls. Before running everything through Sonnet, validate your accuracy floor on Haiku first. I mean that as a strong recommendation.
Prompt Caching for Repeated Analysis of the Same Document
In interactive workloads — same image or PDF, changing questions — prompt caching pays off. Attach cache_control to the image or document block and subsequent reads cost one-tenth of the input price.
The cache lives five minutes by default and refreshes on every hit. In a chat UI where a user loads a 100-page document and asks question after question, every question past the first pays 10% for the document portion. Cache writes carry a 25% premium, so a one-shot analysis actually loses money — but from the second question on, you are almost certainly ahead.
A local result cache is worth running alongside it. Storing results keyed by a hash of image bytes plus prompt prevents duplicate billing on retries and re-runs:
Production Error Handling — Classify Before You Retry
The errors a vision pipeline actually encounters fall into four classes, each with a different correct response:
| Error | Typical cause | Response |
|------|----------|------|
| 400 invalid_request_error | Corrupt image, unsupported format, size limit | Retrying is pointless. Validate and quarantine |
| 413 request_too_large | Total request over 32MB | Resize or split, then resend |
| 429 rate_limit_error | Tokens-per-minute ceiling | Exponential backoff |
| 529 overloaded_error | Transient API overload | Wait and resend |
The crucial habit is excluding 400s from retries. A corrupt image fails identically on every attempt. Let 400s into the retry loop and the failure queue clogs while wasted requests burn through your rate limit.
One more easily missed habit: log token usage on every call. Record message.usage and watch the daily average input tokens per image. A regression in your resize step — someone changed max_edge, raw images slipped into the queue — shows up first as a jump in that number. Finding out from the invoice is far too late.
Comparing Two Documents — Diff Detection
Contract revisions, design comps against implemented screens — wanting "the differences between two images" comes up more often than you might expect. The basic shape: multiple image blocks in one message, each preceded by a text label.
DIFF_TOOL = { "name": "record_differences", "description": "Records differences between two documents", "input_schema": { "type": "object", "properties": { "differences": { "type": "array", "items": { "type": "object", "properties": { "section": {"type": "string", "description": "Where the difference is"}, "document_a": {"type": "string", "description": "Content in the first document"}, "document_b": {"type": "string", "description": "Content in the second document"}, "significance": {"type": "string", "enum": ["substantive", "formatting"]}, }, "required": ["section", "document_a", "document_b", "significance"], }, }, "identical": {"type": "boolean"}, }, "required": ["differences", "identical"], },}def compare_documents(path_a: str, path_b: str) -> dict: data_a, type_a = resize_for_claude(path_a) data_b, type_b = resize_for_claude(path_b) message = client.messages.create( model="claude-sonnet-4-6", max_tokens=4096, tools=[DIFF_TOOL], tool_choice={"type": "tool", "name": "record_differences"}, messages=[{ "role": "user", "content": [ {"type": "text", "text": "First document (old version):"}, {"type": "image", "source": {"type": "base64", "media_type": type_a, "data": data_a}}, {"type": "text", "text": "Second document (new version):"}, {"type": "image", "source": {"type": "base64", "media_type": type_b, "data": data_b}}, {"type": "text", "text": "Compare the two documents and record every difference. " "Distinguish substantive wording changes from layout-only changes."}, ], }], ) for block in message.content: if block.type == "tool_use": return block.input return {"differences": [], "identical": False}
The labels before each image matter. Without an explicit "first" and "second," answers drift into ambiguous references — "the other image" — that downstream code cannot work with.
The significance field separating substantive from formatting changes comes from experience: a font swap once buried a review under dozens of cosmetic diffs. Classifying noise at the schema level changes the load on everything downstream.
What Vision Is Bad At — Avoid These at Design Time
Finally, the tasks I have learned not to assign to vision. Knowing them late means redesigning the feature.
Precise counting. Past roughly twenty small elements in frame, miscounts become common. Inventory-photo counting belongs to a detection model (YOLO-family) with Claude classifying and describing, not counting.
Pixel coordinates. As noted earlier, pixel positions are not accurate enough for automation. Combine with accessibility APIs or DOM data when you need targets.
Strict color matching. "Are these two colors identical?" or "does this match brand color #FF6B35?" cannot be trusted through JPEG compression and resizing. Read pixel values directly with PIL — more reliable, and free.
Identifying people. The API is designed not to identify individuals from faces. This is a deliberate property rather than a limitation; if your requirements include it, you need a different approach altogether.
Inverted, this list is reassuring: outside those zones — reading documents, extracting structure, describing content, flagging quality issues — the accuracy is stable enough to replace dedicated OCR engines. Deciding the division of labor is the designer's job.
A First Step — Replace Table Extraction with Tool Use
You do not need to adopt everything here at once. Ranked by impact, my experience says:
Replace output parsing with schema-enforced Tool Use (a step change in reliability)
Move non-urgent batches to the Message Batches API (the same work at half price)
The first item touches existing code the least — swap the "respond in JSON" prompt and its parsing for a tool definition, and half a day covers the migration. Worth doing before tonight's batch dies on a json.loads exception.
Thank you for reading. If you are taking vision processing to production too, I hope something here saves you a wall or two.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.