CLAUDE LABJP
WWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skillsWWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skills
Articles/API & SDK
API & SDK/2026-04-07Advanced

Vertex AI × Claude Enterprise Integration Guide: Prompt Caching, Multimodal, and Agent Design

A practical guide to enterprise-grade Claude integrations on Google Cloud Vertex AI. Covers prompt caching, BigQuery logging, multimodal processing, agent design, RAG, and production-ready patterns.

Vertex AI3Google Cloud3Enterprise10Prompt Caching4Multi-agentBigQuery

Premium Article

Why You Need an Enterprise Architecture

Getting Claude running on Vertex AI is straightforward — as covered in the setup guide. But running a stable, cost-efficient production service at scale requires significantly more than a basic API call.

This guide goes deep into the following areas, with working code for each:

  • Prompt caching: Dramatically reduce costs by reusing repeated context
  • BigQuery logging: Compliance, quality monitoring, and cost analysis
  • Multimodal processing: Handling images, PDFs, and complex document inputs
  • Agent design: Tool calling and multi-agent orchestration
  • RAG integration: Connecting Claude to your internal knowledge base
  • Production patterns: Retry logic, circuit breakers, and cost controls

1. Prompt Caching: Cut API Costs by Up to 90%

How Prompt Caching Works

Claude supports prompt caching, which lets you cache long system prompts or context blocks after the first request. Subsequent requests that hit the cache are charged at approximately 10–20% of the standard input token price — a massive cost reduction for applications that reuse the same context repeatedly.

Prompt caching is especially valuable when:

  • Your system prompt runs to thousands of tokens (persona definitions, rules, knowledge)
  • Multiple questions are asked against the same document in a RAG system
  • A code assistant repeatedly references the same large codebase

Implementation: Caching System Prompts

from anthropic import AnthropicVertex
 
client = AnthropicVertex(project_id="your-project", region="asia-southeast1")
 
system_prompt = """You are a customer support agent for Acme Corp.
Follow these guidelines at all times.
 
[Product Catalog — 2,500 Products]
Product ID: P001 — SmartWatch Pro X
Price: $299
Specs: Heart rate monitor, GPS, 5ATM waterproof, 7-day battery
...
[This section may span thousands of tokens — exactly where caching shines]
 
[Support Policy]
1. Returns accepted within 30 days of purchase
2. Repair support available Monday–Friday, 9AM–6PM
3. Escalate urgent cases to senior support
...
"""
 
# First request: cache is created
response1 = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    betas=["prompt-caching-2024-07-31"],
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Tell me about product P001"}]
)
print("Cache stats:", response1.usage)
# {'input_tokens': 2800, 'cache_creation_input_tokens': 2500, 'cache_read_input_tokens': 0, ...}
 
# Second request: cache is hit
response2 = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    betas=["prompt-caching-2024-07-31"],
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "What's the return policy?"}]
)
print("Cache stats (2nd request):", response2.usage)
# {'input_tokens': 300, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 2500, ...}
# → System prompt tokens are now served from cache!

Multi-turn Conversation with Caching

class CachedConversationManager:
    """Multi-turn conversation manager with prompt caching."""
 
    def __init__(self, client: AnthropicVertex, system_prompt: str):
        self.client = client
        self.system_prompt = system_prompt
        self.conversation_history = []
        self.total_cache_hits = 0
 
    def chat(self, user_message: str) -> str:
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })
 
        response = self.client.beta.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            betas=["prompt-caching-2024-07-31"],
            system=[
                {
                    "type": "text",
                    "text": self.system_prompt,
                    "cache_control": {"type": "ephemeral"}
                }
            ],
            messages=self.conversation_history
        )
 
        assistant_message = response.content[0].text
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })
 
        cache_read = response.usage.cache_read_input_tokens
        if cache_read:
            self.total_cache_hits += cache_read
            print(f"💰 Cache hit: {cache_read} tokens saved at ~90% discount")
 
        return assistant_message

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Cut API costs by up to 90% with prompt caching — with real implementation code
Enterprise security design using BigQuery logging, IAM, and VPC on Google Cloud
Production-ready patterns for multimodal processing, agent orchestration, and RAG
Secure payment via Stripe · Cancel anytime
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-04-05
Claude API on Google Cloud Vertex AI — Complete Integration Guide for GCP
Learn how to use Claude Sonnet 4.6 via Google Cloud Vertex AI. Covers project setup, IAM configuration, Python and TypeScript implementations, Cloud Run deployment, and cost management with practical code examples.
API & SDK2026-04-07
How to Use Claude on Google Cloud Vertex AI: Complete Model Garden Setup Guide
A step-by-step guide to setting up Claude through Google Cloud's Vertex AI Model Garden. From enabling APIs to building with the AnthropicVertex SDK — everything you need to get started.
API & SDK2026-05-05
Building an Internal Document Search Agent with Claude API — Hybrid RAG, Role-Based Access Control, and Audit Logging in Production
Build a production-grade internal document search agent using Claude API and Python. Covers hybrid RAG (pgvector + BM25), department-level RBAC via PostgreSQL RLS, and compliance-ready audit logging — with working code for each component.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →