●WWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27●BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly credit●OUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retries●DYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verification●ULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflow●OPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skills●WWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27●BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly credit●OUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retries●DYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verification●ULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflow●OPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skills
Vertex AI × Claude Enterprise Integration Guide: Prompt Caching, Multimodal, and Agent Design
A practical guide to enterprise-grade Claude integrations on Google Cloud Vertex AI. Covers prompt caching, BigQuery logging, multimodal processing, agent design, RAG, and production-ready patterns.
Getting Claude running on Vertex AI is straightforward — as covered in the setup guide. But running a stable, cost-efficient production service at scale requires significantly more than a basic API call.
This guide goes deep into the following areas, with working code for each:
Prompt caching: Dramatically reduce costs by reusing repeated context
BigQuery logging: Compliance, quality monitoring, and cost analysis
Multimodal processing: Handling images, PDFs, and complex document inputs
Agent design: Tool calling and multi-agent orchestration
RAG integration: Connecting Claude to your internal knowledge base
Production patterns: Retry logic, circuit breakers, and cost controls
1. Prompt Caching: Cut API Costs by Up to 90%
How Prompt Caching Works
Claude supports prompt caching, which lets you cache long system prompts or context blocks after the first request. Subsequent requests that hit the cache are charged at approximately 10–20% of the standard input token price — a massive cost reduction for applications that reuse the same context repeatedly.
Prompt caching is especially valuable when:
Your system prompt runs to thousands of tokens (persona definitions, rules, knowledge)
Multiple questions are asked against the same document in a RAG system
A code assistant repeatedly references the same large codebase
Implementation: Caching System Prompts
from anthropic import AnthropicVertexclient = AnthropicVertex(project_id="your-project", region="asia-southeast1")system_prompt = """You are a customer support agent for Acme Corp.Follow these guidelines at all times.[Product Catalog — 2,500 Products]Product ID: P001 — SmartWatch Pro XPrice: $299Specs: Heart rate monitor, GPS, 5ATM waterproof, 7-day battery...[This section may span thousands of tokens — exactly where caching shines][Support Policy]1. Returns accepted within 30 days of purchase2. Repair support available Monday–Friday, 9AM–6PM3. Escalate urgent cases to senior support..."""# First request: cache is createdresponse1 = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=1024, betas=["prompt-caching-2024-07-31"], system=[ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "Tell me about product P001"}])print("Cache stats:", response1.usage)# {'input_tokens': 2800, 'cache_creation_input_tokens': 2500, 'cache_read_input_tokens': 0, ...}# Second request: cache is hitresponse2 = client.beta.messages.create( model="claude-sonnet-4-6", max_tokens=1024, betas=["prompt-caching-2024-07-31"], system=[ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"} } ], messages=[{"role": "user", "content": "What's the return policy?"}])print("Cache stats (2nd request):", response2.usage)# {'input_tokens': 300, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 2500, ...}# → System prompt tokens are now served from cache!
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Cut API costs by up to 90% with prompt caching — with real implementation code
✦Enterprise security design using BigQuery logging, IAM, and VPC on Google Cloud
✦Production-ready patterns for multimodal processing, agent orchestration, and RAG
Secure payment via Stripe · Cancel anytime
2. BigQuery Logging for Compliance and Monitoring
Architecture Overview
Enterprise AI deployments increasingly require detailed input/output logging for compliance, quality audits, and cost tracking. Vertex AI supports request/response logging directly to BigQuery — but only on regional endpoints (not the global endpoint).
Important: Logging requires region="asia-southeast1" or another regional endpoint. The global endpoint does not support this feature.
Custom BigQuery Logging Client
For more granular control, you can implement your own logging layer directly:
-- Daily cost breakdown by modelSELECT DATE(request_time) as date, model, COUNT(*) as request_count, SUM(input_tokens) as total_input_tokens, SUM(output_tokens) as total_output_tokens, ROUND(SUM(input_tokens) * 3 / 1000000, 2) as input_cost_usd, ROUND(SUM(output_tokens) * 15 / 1000000, 2) as output_cost_usdFROM `your-project.claude_logs.messages`WHERE DATE(request_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)GROUP BY date, modelORDER BY date DESC;
3. Multimodal Processing: Images, PDFs, and Documents
Image Analysis
Claude can understand images directly. Here's how to implement it on Vertex AI:
import base64from anthropic import AnthropicVertexclient = AnthropicVertex(project_id="your-project", region="global")def analyze_image(image_path: str, question: str) -> str: with open(image_path, "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8") media_type = ( "image/png" if image_path.endswith(".png") else "image/jpeg" ) response = client.messages.create( model="claude-sonnet-4-6", max_tokens=2048, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": media_type, "data": image_data } }, {"type": "text", "text": question} ] }] ) return response.content[0].text# Example: analyze a sales chartresult = analyze_image( "monthly_report.png", "Identify the key trends in this sales chart and highlight areas needing attention.")print(result)
Parallel Image Batch Processing
from concurrent.futures import ThreadPoolExecutor, as_completeddef batch_analyze_images(client, tasks, max_workers=5): """Process multiple images in parallel.""" results = [] def process_single(task): try: result = analyze_image(task["image_path"], task["question"]) return {"id": task["id"], "result": result, "status": "success"} except Exception as e: return {"id": task["id"], "error": str(e), "status": "error"} with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = {executor.submit(process_single, t): t for t in tasks} for future in as_completed(futures): results.append(future.result()) return results
4. Tool Calling and Agent Design
Building a Tool-Using Agent
Claude's tool calling (function calling) lets you connect it to external APIs, databases, and internal systems.
from google.cloud import discoveryengine_v1alpha as discoveryengineclass RAGPipeline: def __init__(self, claude_project, search_project, data_store_id): self.claude = AnthropicVertex(project_id=claude_project, region="asia-southeast1") self.search = discoveryengine.SearchServiceClient() self.search_project = search_project self.data_store_id = data_store_id def answer(self, question: str, top_k: int = 5) -> dict: # 1. Retrieve relevant documents docs = self._search(question, top_k) if not docs: return {"answer": "No relevant documents found.", "sources": []} # 2. Build context context = "\n\n".join([ f"[{d['title']}]\n{d['snippet']}" for d in docs ]) # 3. Generate answer with Claude response = self.claude.messages.create( model="claude-sonnet-4-6", max_tokens=2048, system="Answer questions accurately based on the provided context. " "If the context doesn't contain the answer, say so clearly.", messages=[{ "role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}" }] ) return { "answer": response.content[0].text, "sources": [{"title": d["title"], "link": d["link"]} for d in docs] }
6. Production Reliability Patterns
Circuit Breaker
from enum import Enumfrom datetime import datetime, timedeltaimport threadingclass CircuitState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open"class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60, half_open_max_calls=3): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max_calls = half_open_max_calls self.state = CircuitState.CLOSED self.failure_count = 0 self.last_failure_time = None self.half_open_calls = 0 self._lock = threading.Lock() def call(self, func, *args, **kwargs): with self._lock: if self.state == CircuitState.OPEN: elapsed = datetime.now() - self.last_failure_time if elapsed > timedelta(seconds=self.recovery_timeout): self.state = CircuitState.HALF_OPEN self.half_open_calls = 0 else: raise Exception("Circuit breaker is OPEN. Retry later.") try: result = func(*args, **kwargs) with self._lock: if self.state == CircuitState.HALF_OPEN: self.half_open_calls += 1 if self.half_open_calls >= self.half_open_max_calls: self.state = CircuitState.CLOSED self.failure_count = 0 else: self.failure_count = 0 return result except Exception as e: with self._lock: self.failure_count += 1 self.last_failure_time = datetime.now() if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN raise
Token Budget Manager
class TokenBudgetManager: """Per-user daily token budget enforcement.""" def __init__(self, daily_budget_per_user: int = 100_000): self.daily_budget = daily_budget_per_user self.usage = {} def check_and_deduct(self, user_id: str, estimated_tokens: int): today = datetime.now().date().isoformat() if user_id not in self.usage: self.usage[user_id] = {} current = self.usage[user_id].get(today, 0) if current + estimated_tokens > self.daily_budget: remaining = self.daily_budget - current raise Exception( f"Daily token budget exceeded. Remaining: {remaining:,} tokens." ) self.usage[user_id][today] = current + estimated_tokens
A Note from an Indie Developer
Enterprise Deployment Roadmap
A phased approach reduces risk and delivers early wins:
Phase 1 (Weeks 1–2): Foundation
Enable Vertex AI API and design IAM policies
Enable models in Model Garden
Implement basic API calls with error handling
Phase 2 (Month 1): Cost Optimization
Introduce prompt caching for your highest-volume use cases
Implement token budget management
Set up BigQuery cost dashboards
Phase 3 (Month 2): Advanced Features
Multimodal processing (image and document analysis)
Tool calling and agent orchestration
RAG with internal knowledge bases
Phase 4 (Ongoing): Scale and Reliability
Circuit breakers and reliability patterns
Evaluate Provisioned Throughput for peak loads
Model version management and A/B testing
By combining Google Cloud's enterprise infrastructure with Claude's advanced reasoning capabilities, your team can build production-grade AI applications with the security, compliance, and cost controls your organization requires.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.