⬡ API & SDK/2026-04-07Advanced

Vertex AI × Claude Enterprise Integration Guide: Prompt Caching, Multimodal, and Agent Design

A practical guide to enterprise-grade Claude integrations on Google Cloud Vertex AI. Covers prompt caching, BigQuery logging, multimodal processing, agent design, RAG, and production-ready patterns.

Vertex AI² Google Cloud Enterprise⁴ Prompt Caching⁵ Multi-agent BigQuery

✦ Premium Article

Why You Need an Enterprise Architecture

Getting Claude running on Vertex AI is straightforward — as covered in the setup guide. But running a stable, cost-efficient production service at scale requires significantly more than a basic API call.

This guide goes deep into the following areas, with working code for each:

Prompt caching: Dramatically reduce costs by reusing repeated context
BigQuery logging: Compliance, quality monitoring, and cost analysis
Multimodal processing: Handling images, PDFs, and complex document inputs
Agent design: Tool calling and multi-agent orchestration
RAG integration: Connecting Claude to your internal knowledge base
Production patterns: Retry logic, circuit breakers, and cost controls

1. Prompt Caching: Cut API Costs by Up to 90%

How Prompt Caching Works

Claude supports prompt caching, which lets you cache long system prompts or context blocks after the first request. Subsequent requests that hit the cache are charged at approximately 10–20% of the standard input token price — a massive cost reduction for applications that reuse the same context repeatedly.

Prompt caching is especially valuable when:

Your system prompt runs to thousands of tokens (persona definitions, rules, knowledge)
Multiple questions are asked against the same document in a RAG system
A code assistant repeatedly references the same large codebase

Implementation: Caching System Prompts

from anthropic import AnthropicVertex
 
client = AnthropicVertex(project_id="your-project", region="asia-southeast1")
 
system_prompt = """You are a customer support agent for Acme Corp.
Follow these guidelines at all times.
 
[Product Catalog — 2,500 Products]
Product ID: P001 — SmartWatch Pro X
Price: $299
Specs: Heart rate monitor, GPS, 5ATM waterproof, 7-day battery
...
[This section may span thousands of tokens — exactly where caching shines]
 
[Support Policy]
1. Returns accepted within 30 days of purchase
2. Repair support available Monday–Friday, 9AM–6PM
3. Escalate urgent cases to senior support
...
"""
 
# First request: cache is created
response1 = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    betas=["prompt-caching-2024-07-31"],
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Tell me about product P001"}]
)
print("Cache stats:", response1.usage)
# {'input_tokens': 2800, 'cache_creation_input_tokens': 2500, 'cache_read_input_tokens': 0, ...}
 
# Second request: cache is hit
response2 = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    betas=["prompt-caching-2024-07-31"],
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "What's the return policy?"}]
)
print("Cache stats (2nd request):", response2.usage)
# {'input_tokens': 300, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 2500, ...}
# → System prompt tokens are now served from cache!

Multi-turn Conversation with Caching

class CachedConversationManager:
    """Multi-turn conversation manager with prompt caching."""
 
    def __init__(self, client: AnthropicVertex, system_prompt: str):
        self.client = client
        self.system_prompt = system_prompt
        self.conversation_history = []
        self.total_cache_hits = 0
 
    def chat(self, user_message: str) -> str:
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })
 
        response = self.client.beta.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            betas=["prompt-caching-2024-07-31"],
            system=[
                {
                    "type": "text",
                    "text": self.system_prompt,
                    "cache_control": {"type": "ephemeral"}
                }
            ],
            messages=self.conversation_history
        )
 
        assistant_message = response.content[0].text
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })
 
        cache_read = response.usage.cache_read_input_tokens
        if cache_read:
            self.total_cache_hits += cache_read
            print(f"💰 Cache hit: {cache_read} tokens saved at ~90% discount")
 
        return assistant_message

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Cut API costs by up to 90% with prompt caching — with real implementation code

✦Enterprise security design using BigQuery logging, IAM, and VPC on Google Cloud

✦Production-ready patterns for multimodal processing, agent orchestration, and RAG

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

2. BigQuery Logging for Compliance and Monitoring

Architecture Overview

Enterprise AI deployments increasingly require detailed input/output logging for compliance, quality audits, and cost tracking. Vertex AI supports request/response logging directly to BigQuery — but only on regional endpoints (not the global endpoint).

Important: Logging requires region="asia-southeast1" or another regional endpoint. The global endpoint does not support this feature.

Custom BigQuery Logging Client

For more granular control, you can implement your own logging layer directly:

from anthropic import AnthropicVertex
from google.cloud import bigquery
from datetime import datetime
import json, uuid
 
class LoggedClaudeClient:
    def __init__(self, project_id: str, region: str, bq_dataset: str, bq_table: str):
        self.client = AnthropicVertex(project_id=project_id, region=region)
        self.bq_client = bigquery.Client(project=project_id)
        self.table_ref = f"{project_id}.{bq_dataset}.{bq_table}"
 
    def create_message(self, messages, model="claude-sonnet-4-6",
                       max_tokens=1024, user_id=None, session_id=None, **kwargs):
        request_id = str(uuid.uuid4())
        request_time = datetime.utcnow()
 
        try:
            response = self.client.messages.create(
                model=model, max_tokens=max_tokens,
                messages=messages, **kwargs
            )
            self._log({
                "request_id": request_id,
                "request_time": request_time.isoformat(),
                "response_time": datetime.utcnow().isoformat(),
                "model": model, "user_id": user_id, "session_id": session_id,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "status": "success",
                "messages_json": json.dumps(messages),
                "response_text": response.content[0].text,
                "error_message": None
            })
            return response
        except Exception as e:
            self._log({
                "request_id": request_id,
                "request_time": request_time.isoformat(),
                "response_time": datetime.utcnow().isoformat(),
                "model": model, "user_id": user_id, "session_id": session_id,
                "input_tokens": 0, "output_tokens": 0, "status": "error",
                "messages_json": json.dumps(messages),
                "response_text": None, "error_message": str(e)
            })
            raise
 
    def _log(self, row: dict):
        try:
            errors = self.bq_client.insert_rows_json(self.table_ref, [row])
            if errors:
                print(f"BigQuery write error: {errors}")
        except Exception as e:
            print(f"BigQuery logging failed (non-blocking): {e}")

BigQuery Analysis Queries

-- Daily cost breakdown by model
SELECT
  DATE(request_time) as date,
  model,
  COUNT(*) as request_count,
  SUM(input_tokens) as total_input_tokens,
  SUM(output_tokens) as total_output_tokens,
  ROUND(SUM(input_tokens) * 3 / 1000000, 2) as input_cost_usd,
  ROUND(SUM(output_tokens) * 15 / 1000000, 2) as output_cost_usd
FROM `your-project.claude_logs.messages`
WHERE DATE(request_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY date, model
ORDER BY date DESC;

3. Multimodal Processing: Images, PDFs, and Documents

Image Analysis

Claude can understand images directly. Here's how to implement it on Vertex AI:

import base64
from anthropic import AnthropicVertex
 
client = AnthropicVertex(project_id="your-project", region="global")
 
def analyze_image(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
    media_type = (
        "image/png" if image_path.endswith(".png")
        else "image/jpeg"
    )
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data
                    }
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text
 
# Example: analyze a sales chart
result = analyze_image(
    "monthly_report.png",
    "Identify the key trends in this sales chart and highlight areas needing attention."
)
print(result)

Parallel Image Batch Processing

from concurrent.futures import ThreadPoolExecutor, as_completed
 
def batch_analyze_images(client, tasks, max_workers=5):
    """Process multiple images in parallel."""
    results = []
 
    def process_single(task):
        try:
            result = analyze_image(task["image_path"], task["question"])
            return {"id": task["id"], "result": result, "status": "success"}
        except Exception as e:
            return {"id": task["id"], "error": str(e), "status": "error"}
 
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_single, t): t for t in tasks}
        for future in as_completed(futures):
            results.append(future.result())
 
    return results

4. Tool Calling and Agent Design

Building a Tool-Using Agent

Claude's tool calling (function calling) lets you connect it to external APIs, databases, and internal systems.

tools = [
    {
        "name": "get_product_info",
        "description": "Retrieve product details (price, stock, specs) by product ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string", "description": "Product ID (e.g., P001)"}
            },
            "required": ["product_id"]
        }
    },
    {
        "name": "check_inventory",
        "description": "Check current stock count for a product",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string"},
                "warehouse": {
                    "type": "string",
                    "enum": ["us-east", "us-west", "eu-central"]
                }
            },
            "required": ["product_id"]
        }
    },
    {
        "name": "create_order",
        "description": "Create a purchase order. Always verify stock first.",
        "input_schema": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string"},
                "quantity": {"type": "integer"},
                "customer_id": {"type": "string"}
            },
            "required": ["product_id", "quantity", "customer_id"]
        }
    }
]
 
def run_agent(user_message: str) -> str:
    """Execute the agent loop."""
    messages = [{"role": "user", "content": user_message}]
 
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )
 
        if response.stop_reason == "end_turn":
            return response.content[0].text
 
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
 
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
 
            messages.append({"role": "user", "content": tool_results})

5. RAG Integration with Vertex AI Search

from google.cloud import discoveryengine_v1alpha as discoveryengine
 
class RAGPipeline:
    def __init__(self, claude_project, search_project, data_store_id):
        self.claude = AnthropicVertex(project_id=claude_project, region="asia-southeast1")
        self.search = discoveryengine.SearchServiceClient()
        self.search_project = search_project
        self.data_store_id = data_store_id
 
    def answer(self, question: str, top_k: int = 5) -> dict:
        # 1. Retrieve relevant documents
        docs = self._search(question, top_k)
        if not docs:
            return {"answer": "No relevant documents found.", "sources": []}
 
        # 2. Build context
        context = "\n\n".join([
            f"[{d['title']}]\n{d['snippet']}" for d in docs
        ])
 
        # 3. Generate answer with Claude
        response = self.claude.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system="Answer questions accurately based on the provided context. "
                   "If the context doesn't contain the answer, say so clearly.",
            messages=[{
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }]
        )
 
        return {
            "answer": response.content[0].text,
            "sources": [{"title": d["title"], "link": d["link"]} for d in docs]
        }

6. Production Reliability Patterns

Circuit Breaker

from enum import Enum
from datetime import datetime, timedelta
import threading
 
class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"
 
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60, half_open_max_calls=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0
        self._lock = threading.Lock()
 
    def call(self, func, *args, **kwargs):
        with self._lock:
            if self.state == CircuitState.OPEN:
                elapsed = datetime.now() - self.last_failure_time
                if elapsed > timedelta(seconds=self.recovery_timeout):
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    raise Exception("Circuit breaker is OPEN. Retry later.")
 
        try:
            result = func(*args, **kwargs)
            with self._lock:
                if self.state == CircuitState.HALF_OPEN:
                    self.half_open_calls += 1
                    if self.half_open_calls >= self.half_open_max_calls:
                        self.state = CircuitState.CLOSED
                        self.failure_count = 0
                else:
                    self.failure_count = 0
            return result
        except Exception as e:
            with self._lock:
                self.failure_count += 1
                self.last_failure_time = datetime.now()
                if self.failure_count >= self.failure_threshold:
                    self.state = CircuitState.OPEN
            raise

Token Budget Manager

class TokenBudgetManager:
    """Per-user daily token budget enforcement."""
 
    def __init__(self, daily_budget_per_user: int = 100_000):
        self.daily_budget = daily_budget_per_user
        self.usage = {}
 
    def check_and_deduct(self, user_id: str, estimated_tokens: int):
        today = datetime.now().date().isoformat()
        if user_id not in self.usage:
            self.usage[user_id] = {}
        current = self.usage[user_id].get(today, 0)
        if current + estimated_tokens > self.daily_budget:
            remaining = self.daily_budget - current
            raise Exception(
                f"Daily token budget exceeded. Remaining: {remaining:,} tokens."
            )
        self.usage[user_id][today] = current + estimated_tokens

A Note from an Indie Developer

Enterprise Deployment Roadmap

A phased approach reduces risk and delivers early wins:

Phase 1 (Weeks 1–2): Foundation

Enable Vertex AI API and design IAM policies
Enable models in Model Garden
Implement basic API calls with error handling

Phase 2 (Month 1): Cost Optimization

Introduce prompt caching for your highest-volume use cases
Implement token budget management
Set up BigQuery cost dashboards

Phase 3 (Month 2): Advanced Features

Multimodal processing (image and document analysis)
Tool calling and agent orchestration
RAG with internal knowledge bases

Phase 4 (Ongoing): Scale and Reliability

Circuit breakers and reliability patterns
Evaluate Provisioned Throughput for peak loads
Model version management and A/B testing

By combining Google Cloud's enterprise infrastructure with Claude's advanced reasoning capabilities, your team can build production-grade AI applications with the security, compliance, and cost controls your organization requires.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.