⬡ API & SDK/2026-06-13Advanced

Claude API Python Advanced Cookbook: 20 Production Patterns You'll Actually Use

20 battle-tested Python patterns for the Claude API—retry logic, parallel processing, cost optimization, testing, and monitoring. Copy-paste ready code recipes.

Claude API¹¹⁵ Python¹⁷ production¹¹¹ patterns optimization⁴ practical

✦ Premium Article

The first night I shipped a script that ran flawlessly on my laptop, a cascade of 429s stalled the whole pipeline at 2 AM. The culprit wasn't the rate limit itself—it was my own design, which had no retry logic. Run the Claude API in production with Python and you'll meet at least one of these "invisible-on-localhost" walls.

This cookbook is the set of patterns I've written down each time I hit one of those walls, organized into twenty recipes. It's less something to read front-to-back and more something to copy from when you need it.

One thing worth saying up front: you don't need all twenty from day one. Production resilience isn't a "kitchen sink"—it's built by defusing the specific landmines your app actually steps on, in order. As an indie developer running several services, I didn't start with all of them either; I added each one after an incident forced my hand. To make that mapping clear, I've grouped the patterns into four clusters.

Patterns 1–4: Resilient API Foundation

Pattern 1: Exponential Backoff with Jitter

Distinguishes between rate limits (429) and server errors (5xx), backing off appropriately for each.

import anthropic
import time
import random
from typing import TypeVar, Callable
 
T = TypeVar('T')
 
def with_retry(
    func: Callable[..., T],
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True
) -> T:
    """Exponential backoff + jitter for API calls."""
    for attempt in range(max_attempts):
        try:
            return func()
        except anthropic.RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            retry_after = getattr(e, 'retry_after', None)
            delay = retry_after if retry_after else min(
                base_delay * (2 ** attempt), max_delay
            )
            if jitter:
                delay *= (0.5 + random.random() * 0.5)
            print(f"Rate limited. Retrying in {delay:.1f}s ({attempt+1}/{max_attempts})")
            time.sleep(delay)
        except anthropic.APIStatusError as e:
            if e.status_code < 500 or attempt == max_attempts - 1:
                raise
            delay = min(base_delay * (2 ** attempt), max_delay)
            time.sleep(delay)
 
client = anthropic.Anthropic()
response = with_retry(
    lambda: client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Hello"}]
    )
)

Pattern 2: Circuit Breaker

When failures cascade, stop hitting the API entirely and wait for recovery.

import time
from enum import Enum
from threading import Lock
 
class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"
 
class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 60.0,
        success_threshold: int = 2
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0.0
        self._lock = Lock()
 
    def call(self, func, *args, **kwargs):
        with self._lock:
            if self.state == CircuitState.OPEN:
                elapsed = time.time() - self.last_failure_time
                if elapsed >= self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise Exception(f"Circuit OPEN. Retry in {self.recovery_timeout - elapsed:.0f}s")
        try:
            result = func(*args, **kwargs)
            with self._lock:
                if self.state == CircuitState.HALF_OPEN:
                    self.success_count += 1
                    if self.success_count >= self.success_threshold:
                        self.state = CircuitState.CLOSED
                        self.failure_count = 0
                else:
                    self.failure_count = 0
            return result
        except Exception:
            with self._lock:
                self.failure_count += 1
                self.last_failure_time = time.time()
                if self.failure_count >= self.failure_threshold:
                    self.state = CircuitState.OPEN
            raise
 
claude_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30.0)

Pattern 3: Streaming with Timeout

Streaming can hang indefinitely on network issues. Always set a deadline.

import asyncio
import anthropic
 
async def stream_with_timeout(prompt: str, timeout_seconds: float = 30.0) -> str:
    client = anthropic.AsyncAnthropic()
    collected_text = []
 
    async def _stream():
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        ) as stream:
            async for text in stream.text_stream:
                collected_text.append(text)
        return "".join(collected_text)
 
    try:
        return await asyncio.wait_for(_stream(), timeout=timeout_seconds)
    except asyncio.TimeoutError:
        partial = "".join(collected_text)
        raise TimeoutError(
            f"Stream timed out after {timeout_seconds}s. "
            f"Got {len(partial)} chars before cutoff."
        )

Pattern 4: Composite Guard (Retry + Breaker + Timeout)

All three combined into a single production-ready wrapper.

class RobustClaudeClient:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.breaker = CircuitBreaker()
    
    def complete(self, prompt: str, **kwargs) -> str:
        def _call():
            response = self.client.messages.create(
                model=kwargs.get("model", "claude-sonnet-4-6"),
                max_tokens=kwargs.get("max_tokens", 1024),
                messages=[{"role": "user", "content": prompt}],
                timeout=kwargs.get("timeout", 30.0)
            )
            return response.content[0].text
        
        return self.breaker.call(lambda: with_retry(_call, max_attempts=3))

These first four aren't alternatives—they stack. The circuit breaker is the one people skip, but hammering a temporarily degraded endpoint with retries only slows its recovery and burns your own rate budget. Giving up on what's down and quietly resuming a minute later does more for overall availability than any single retry tweak.

Patterns 5–8: Async and Parallel Processing

Pattern 5: Parallel Requests with Semaphore

Run multiple requests concurrently without blowing past rate limits.

import asyncio
import anthropic
from typing import List
 
async def parallel_completions(
    prompts: List[str],
    max_concurrent: int = 5,
    model: str = "claude-haiku-4-5-20251001"
) -> List[str]:
    client = anthropic.AsyncAnthropic()
    semaphore = asyncio.Semaphore(max_concurrent)
 
    async def _process(prompt: str) -> str:
        async with semaphore:
            response = await client.messages.create(
                model=model,
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
 
    return await asyncio.gather(
        *[_process(p) for p in prompts],
        return_exceptions=True
    )

Pattern 6: Batching Queue

Buffer requests and flush them in parallel bursts, maximizing throughput.

import asyncio
from collections import deque
from dataclasses import dataclass
from typing import Optional
import anthropic
 
@dataclass
class BatchItem:
    prompt: str
    future: asyncio.Future
    model: str = "claude-haiku-4-5-20251001"
    max_tokens: int = 512
 
class BatchQueue:
    def __init__(self, flush_interval=0.1, max_batch_size=10, max_concurrent=5):
        self.queue: deque[BatchItem] = deque()
        self.flush_interval = flush_interval
        self.max_batch_size = max_batch_size
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.client = anthropic.AsyncAnthropic()
 
    async def submit(self, prompt: str, **kwargs) -> str:
        future = asyncio.get_event_loop().create_future()
        self.queue.append(BatchItem(prompt=prompt, future=future, **kwargs))
        return await future
 
    async def _flush_loop(self):
        while True:
            await asyncio.sleep(self.flush_interval)
            batch = []
            while self.queue and len(batch) < self.max_batch_size:
                batch.append(self.queue.popleft())
            if batch:
                await asyncio.gather(*[self._process(item) for item in batch])
 
    async def _process(self, item: BatchItem):
        async with self.semaphore:
            try:
                resp = await self.client.messages.create(
                    model=item.model,
                    max_tokens=item.max_tokens,
                    messages=[{"role": "user", "content": item.prompt}]
                )
                item.future.set_result(resp.content[0].text)
            except Exception as e:
                item.future.set_exception(e)

Pattern 7: Streaming with Stats

Stream output while tracking tokens and speed in real time.

import anthropic, time
 
def stream_with_progress(prompt: str, show_stats: bool = True):
    client = anthropic.Anthropic()
    full_text, start_time = [], time.time()
 
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_text.append(text)
        message = stream.get_final_message()
    
    if show_stats:
        usage = message.usage
        tps = usage.output_tokens / (time.time() - start_time)
        print(f"\n\n[in: {usage.input_tokens} / out: {usage.output_tokens} / {tps:.1f} tok/s]")
    
    return "".join(full_text)

Pattern 8: Auto-Compressing Conversation

Summarize old turns automatically when the context gets long.

from typing import List
import anthropic
 
class CompressibleConversation:
    def __init__(self, model="claude-sonnet-4-6", max_tokens_before_compression=80_000):
        self.client = anthropic.Anthropic()
        self.model = model
        self.max_tokens = max_tokens_before_compression
        self.messages: List[dict] = []
        self.compressed_summary = ""
 
    def _estimate_tokens(self) -> int:
        return sum(len(m["content"]) // 4 for m in self.messages)
 
    def _compress(self):
        if len(self.messages) < 4:
            return
        to_compress, recent = self.messages[:-4], self.messages[-4:]
        conv_text = "\n".join(f"{m['role']}: {m['content']}" for m in to_compress)
        resp = self.client.messages.create(
            model="claude-haiku-4-5-20251001", max_tokens=512,
            messages=[{"role": "user", "content": f"Summarize in 100 words:\n{conv_text}"}]
        )
        self.compressed_summary = resp.content[0].text
        self.messages = recent
 
    def chat(self, user_message: str) -> str:
        if self._estimate_tokens() > self.max_tokens:
            self._compress()
        system = f"[Summary of prior conversation]: {self.compressed_summary}\n\n" if self.compressed_summary else ""
        self.messages.append({"role": "user", "content": user_message})
        response = self.client.messages.create(
            model=self.model, max_tokens=2048,
            system=system + "You are a helpful assistant.",
            messages=self.messages
        )
        reply = response.content[0].text
        self.messages.append({"role": "assistant", "content": reply})
        return reply

Parallelism and cost live next door to each other. Crank up concurrency to chase throughput and you hit rate limits sooner, trigger more retries, and end up slower. Tune the semaphore not by "bigger is faster" but by measuring and settling on the highest value that doesn't produce 429s—the long way round is usually the fast one here.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Resilient API client combining retry with exponential backoff, circuit breaker, and timeout

✦Async parallel processing that maximizes throughput while respecting rate limits

✦Cost reduction, testing strategies, and production monitoring—20 patterns with runnable code

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Patterns 9–12: Cost Optimization

Pattern 9: Intelligent Model Routing

Pick the cheapest model that can handle the task.

import re, anthropic
 
def route_model(prompt: str) -> str:
    tokens = len(prompt) // 4
    is_complex = any([
        tokens > 2000,
        bool(re.search(r'(math|proof|optimize|architect|design|analyze)', prompt, re.I)),
        bool(re.search(r'(code|implement|debug|refactor|algorithm)', prompt, re.I)),
        prompt.count('?') >= 3
    ])
    is_simple = tokens < 200 and not is_complex
    
    if is_simple:
        return "claude-haiku-4-5-20251001"   # ~25x cheaper than Sonnet
    if is_complex:
        return "claude-opus-4-8"             # reserve the top tier for genuinely hard work
    return "claude-sonnet-4-6"

In practice, the routing logic stays more reliable when you resist over-engineering it. Every extra regex is another chance to fling an unexpected input at a pricier model. Three tiers—drop only short, formulaic calls to Haiku, send the rest to Sonnet, escalate genuinely hard work to Opus—was enough for a long time. Opus 4.8 is excellent but costs more per token, so it only pays off once you've narrowed it to a few percent of requests. You can also let Haiku classify the request itself, just remember that extra call's latency and cost ride on top of the real work.

Pattern 10: Prompt Cache Maximization

Cache large system prompts and save up to 90% on repeated calls.

import anthropic
 
def create_cached_client(system_prompt: str):
    client = anthropic.Anthropic()
    cached_system = [{
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"}
    }]
    
    def complete(user_message: str, **kwargs) -> tuple[str, dict]:
        response = client.messages.create(
            model=kwargs.get("model", "claude-sonnet-4-6"),
            max_tokens=kwargs.get("max_tokens", 1024),
            system=cached_system,
            messages=[{"role": "user", "content": user_message}]
        )
        return response.content[0].text, {
            "cache_created": getattr(response.usage, "cache_creation_input_tokens", 0),
            "cache_read": getattr(response.usage, "cache_read_input_tokens", 0)
        }
    
    return complete

Pattern 11: Pre-flight Token Estimation

Check cost before sending to prevent budget overruns.

import anthropic
 
def safe_complete(prompt: str, max_cost_usd: float = 0.05, **kwargs) -> str:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": prompt}]
    
    estimate = client.messages.count_tokens(
        model="claude-sonnet-4-6", max_tokens=1, messages=messages
    )
    cost = estimate.input_tokens * 3 / 1_000_000  # $3/M for Sonnet input
    
    if cost > max_cost_usd:
        raise ValueError(
            f"Estimated cost ${cost:.4f} exceeds limit ${max_cost_usd:.4f} "
            f"({estimate.input_tokens} tokens)"
        )
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=kwargs.get("max_tokens", 1024),
        messages=messages
    )
    return response.content[0].text

Pattern 12: Batch API for Bulk Processing

50% cost reduction on non-urgent batch jobs (up to 24-hour turnaround).

import anthropic, time
from typing import List, Dict
 
def batch_process(prompts: List[str], poll_interval: int = 60) -> Dict[str, str]:
    client = anthropic.Anthropic()
    requests = [
        {"custom_id": f"item_{i}", "params": {
            "model": "claude-sonnet-4-6", "max_tokens": 512,
            "messages": [{"role": "user", "content": p}]
        }} for i, p in enumerate(prompts)
    ]
    
    batch = client.beta.messages.batches.create(requests=requests)
    print(f"Batch created: {batch.id} ({len(prompts)} items)")
    
    while batch.processing_status == "in_progress":
        time.sleep(poll_interval)
        batch = client.beta.messages.batches.retrieve(batch.id)
        c = batch.request_counts
        print(f"Processing: {c.processing}/{c.processing + c.succeeded + c.errored}")
    
    return {
        r.custom_id: r.result.message.content[0].text
        if r.result.type == "succeeded"
        else f"ERROR: {r.result.error}"
        for r in client.beta.messages.batches.results(batch.id)
    }

The biggest cost win turned out to be unglamorous: making fewer calls you didn't need. Before reaching for caching or routing, check whether you're sending the same query twice, or firing something synchronously that the Batch API could handle later. That audit alone moves the invoice.

Patterns 13–16: Testing

Pattern 13: Mock Factory for Deterministic Tests

Stop flaky tests that depend on real LLM output.

from unittest.mock import MagicMock
 
def create_mock_response(text: str, model: str = "claude-sonnet-4-6") -> MagicMock:
    mock = MagicMock()
    mock.content = [MagicMock(text=text, type="text")]
    mock.usage = MagicMock(input_tokens=100, output_tokens=50)
    mock.model = model
    mock.stop_reason = "end_turn"
    return mock
 
class MockAnthropicClient:
    def __init__(self, responses: dict):
        self.responses = responses
        self.calls = []
 
    @property
    def messages(self):
        return self
 
    def create(self, **kwargs) -> MagicMock:
        prompt = kwargs["messages"][-1]["content"]
        self.calls.append({"prompt": prompt})
        for keyword, text in self.responses.items():
            if keyword in prompt:
                return create_mock_response(text)
        return create_mock_response("Default response")

Pattern 14: Golden File Tests

Catch quality regressions by comparing against saved "golden" outputs.

import json, hashlib
from pathlib import Path
 
class GoldenTester:
    def __init__(self, golden_dir="tests/golden"):
        self.golden_dir = Path(golden_dir)
        self.golden_dir.mkdir(parents=True, exist_ok=True)
 
    def assert_similar(self, prompt: str, actual: str, update=False, min_overlap=0.6):
        key = hashlib.md5(prompt.encode()).hexdigest()[:8]
        path = self.golden_dir / f"{key}.json"
        
        if update or not path.exists():
            path.write_text(json.dumps({"prompt": prompt[:100], "output": actual}, indent=2))
            return
        
        golden = json.loads(path.read_text())["output"]
        actual_words, golden_words = set(actual.split()), set(golden.split())
        overlap = len(actual_words & golden_words) / len(golden_words) if golden_words else 1.0
        
        assert overlap >= min_overlap, (
            f"Quality regression detected. Overlap: {overlap:.2%} (min: {min_overlap:.2%})\n"
            f"Golden: {golden[:80]}...\nActual: {actual[:80]}..."
        )

Pattern 15: Pydantic-Validated Structured Output

Type-safe extraction using tool use and Pydantic v2.

from pydantic import BaseModel, field_validator
from typing import Literal
import anthropic
 
class ClassifiedMessage(BaseModel):
    intent: Literal["question", "request", "complaint", "other"]
    sentiment: Literal["positive", "neutral", "negative"]
    priority: int
    summary: str
 
    @field_validator("priority")
    @classmethod
    def check_range(cls, v: int) -> int:
        assert 1 <= v <= 5, f"Priority must be 1-5, got {v}"
        return v
 
def classify_message(text: str) -> ClassifiedMessage:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=256,
        tools=[{"name": "classify", "description": "Classify message",
                "input_schema": ClassifiedMessage.model_json_schema()}],
        tool_choice={"type": "tool", "name": "classify"},
        messages=[{"role": "user", "content": f"Classify: {text}"}]
    )
    tool_use = next(b for b in response.content if b.type == "tool_use")
    return ClassifiedMessage(**tool_use.input)

Patterns 16–20: Monitoring

Pattern 16: Structured Request Logger

Essential for debugging production issues after the fact.

import logging, time, json, uuid
from functools import wraps
 
logger = logging.getLogger("claude_api")
 
def logged_completion(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        req_id = str(uuid.uuid4())[:8]
        start = time.time()
        log_data = {"request_id": req_id, "model": kwargs.get("model", "unknown")}
        try:
            result = func(*args, **kwargs)
            log_data.update({
                "status": "success",
                "elapsed_ms": round((time.time() - start) * 1000),
                "input_tokens": getattr(result.usage, "input_tokens", 0),
                "output_tokens": getattr(result.usage, "output_tokens", 0)
            })
            logger.info(json.dumps(log_data))
            return result
        except Exception as e:
            log_data.update({"status": "error", "error": str(e)[:200]})
            logger.error(json.dumps(log_data))
            raise
    return wrapper

Patterns 17–20: Usage Tracking, Health Check, Cost Alerting, Graceful Shutdown

from collections import defaultdict
from datetime import date
import anthropic, time
 
# Pattern 17: Daily usage tracker
class UsageTracker:
    def __init__(self):
        self.daily = defaultdict(lambda: {"input": 0, "output": 0, "requests": 0, "errors": 0})
    
    def record(self, input_tokens: int, output_tokens: int, error: bool = False):
        d = self.daily[date.today()]
        d["input"] += input_tokens; d["output"] += output_tokens
        d["requests"] += 1
        if error: d["errors"] += 1
    
    def cost_today_usd(self) -> float:
        d = self.daily[date.today()]
        return (d["input"] * 3 + d["output"] * 15) / 1_000_000
 
# Pattern 18: Health check
async def health_check() -> dict:
    client = anthropic.AsyncAnthropic()
    start = time.time()
    try:
        await client.messages.create(
            model="claude-haiku-4-5-20251001", max_tokens=5,
            messages=[{"role": "user", "content": "OK"}]
        )
        return {"status": "healthy", "latency_ms": round((time.time() - start) * 1000)}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Where to Start

Twenty patterns is a lot to face at once, but on day one of production these three do the heavy lifting: exponential backoff retry (Pattern 1), prompt caching (Pattern 10), and the structured request logger (Pattern 16). They prevent the surprises, hold down cost, and cut your investigation time when something does break.

Everything else is a solution to a specific problem—reach for it when you actually hit that problem, not before. Debt from premature optimization is its own kind of production incident.

For a concrete first step: wrap just the Pattern 16 logger around your existing client. After a single day of logs, the pattern your app truly needs—rate limits, cost, or error rate—shows up as a number. Let that number tell you which pattern to add next.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.