◉ Claude.ai/2026-04-03Intermediate

Claude Sonnet 4.6 — 1M Tokens, Computer Use & Extended Thinking in Production

Claude Sonnet 4.6 production guide: 1M tokens, Computer Use 72.5, Extended Thinking, Opus vs Sonnet cost comparison, and Prompt Caching optimization with code.

claude-sonnet-46 claude-ai¹⁵ extended-thinking⁷ computer-use⁴ production¹¹¹ cost-optimization²⁹ intermediate²

✦ Premium Article

Why Developers Prefer Sonnet 4.6 Over Opus 4.5

On February 17, 2026, Anthropic launched Claude Sonnet 4.6, and the reception exceeded expectations. Developers who gained early access consistently reported preferring Sonnet 4.6 over Opus 4.5 — Anthropic's previous flagship model — for the majority of real-world tasks. This wasn't a surprise to the team; Sonnet 4.6 was engineered with a specific focus on the tasks that matter most in practice: coding, computer use, long-context reasoning, agentic planning, and knowledge work.

The signal was clear when Anthropic made Sonnet 4.6 the default model across claude.ai and Claude Cowork. This move effectively said: "For most of what you need to accomplish, Sonnet 4.6 is the right tool."

But what makes Sonnet 4.6 genuinely different, and how do you unlock its full potential in production systems? This guide answers those questions with technical depth, working code, and practical decision frameworks you can apply immediately.

Key Specifications and Performance Benchmarks

Context Window

Claude Sonnet 4.6 supports a 1,000,000-token (1M token) context window. To put this in perspective, that's approximately 750,000 words in English — equivalent to around 2,500 pages of text. This isn't just a headline number; it fundamentally changes how you can architect AI applications.

One important note: the 200K context window beta for Claude Sonnet 4.5 and Claude Sonnet 4 is being retired on April 30, 2026. Requests exceeding the standard window after that date will return errors. Now is the time to migrate to Sonnet 4.6's native 1M support.

Computer Use Performance

Sonnet 4.6 scored 72.5 on the OSWorld-Verified benchmark for computer use. For context, Sonnet 3.7 scored 28.0 on a comparable benchmark roughly a year earlier. That's a 2.5x improvement in one year — and it represents the difference between a curiosity and a genuinely useful automation tool.

A 72.5% success rate means that in roughly three out of four attempts, Sonnet 4.6 will correctly complete a computer interaction task. That level of reliability opens the door to real-world workflow automation at scale.

Extended Thinking

Sonnet 4.6 supports Extended Thinking, allowing the model to work through complex problems systematically before delivering its response. This dramatically improves accuracy on tasks involving multi-step reasoning, mathematical derivations, system design, and nuanced judgment calls.

Pricing and Rate Limits

Sonnet 4.6 maintains the same pricing as Sonnet 4.5:

Input tokens: $3 per 1M tokens
Output tokens: $15 per 1M tokens
Prompt Caching (read): $0.30 per 1M tokens (90% discount)
Prompt Caching (write): $3.75 per 1M tokens

Additionally, the Messages Batches API max_tokens cap has been raised to 300,000 for Sonnet 4.6, enabling longer outputs for long-form content, large code generation tasks, and structured data extraction at scale.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Master a quantitative model selection framework to decide when Sonnet 4.6 beats Opus 4.6 — and save up to 80% on API costs

✦Get working code for 1M token context, Extended Thinking, Computer Use, streaming, and Prompt Caching in one place

✦Learn production-grade cost optimization combining Prompt Caching and Batch API for up to 90% cost reduction

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Opus 4.6 vs. Sonnet 4.6 — A Cost-Effectiveness Decision Framework

Why Developers Choose Sonnet 4.6

The preference for Sonnet 4.6 over Opus 4.5 stems from targeted improvements in the exact domains where developers spend most of their time. Meanwhile, Opus 4.6 — at roughly 5x the price (input: $15/1M, output: $75/1M) — retains advantages for tasks demanding the deepest reasoning, but those tasks are less common than most teams assume.

Model Selection Matrix

Use this framework to decide which model fits your use case:

Choose Sonnet 4.6 when:

Building coding assistants, reviewers, or debuggers (battle-tested as the Claude Code default)
Implementing Computer Use workflows (72.5 OSWorld precision is sufficient for most automation)
Processing large documents at scale (1M context maximizes efficiency)
Running customer-facing chatbots or support systems (speed and cost matter)
Operating high-volume APIs where cost efficiency compounds daily
Building Cowork skills or scheduled automation tasks (the platform default)

Choose Opus 4.6 when:

Solving advanced mathematical proofs or deeply nested logical puzzles
Conducting research that requires extended hypothesis generation and verification
Making high-stakes professional judgments (legal, medical, financial) where every percentage of accuracy matters
Running long Extended Thinking sessions on genuinely novel problems

Cost Simulation

Assume a production application processing 1 million tokens of input and output per day:

Opus 4.6: $15 (input) + $75 (output) = $90/day ≈ $2,700/month
Sonnet 4.6: $3 (input) + $15 (output) = $18/day ≈ $540/month
Monthly savings: $2,160 (~80% reduction)

Add Prompt Caching for frequently reused system prompts, and the gap widens further. For the vast majority of production workloads, Sonnet 4.6 + Prompt Caching is the optimal combination.

1M Token Context Window — Practical Applications

A 1M token context window isn't just about sending bigger prompts. It's an architectural shift that enables entirely new categories of applications.

Use Case 1: Full Codebase Analysis

import anthropic
 
client = anthropic.Anthropic()
 
def analyze_codebase(file_paths: list[str]) -> str:
    """
    Load an entire codebase into context for holistic analysis.
    Works well for repositories up to ~800K tokens (leave buffer).
    """
    files_content = ""
    for path in file_paths:
        with open(path, "r") as f:
            files_content += f"\n\n=== {path} ===\n{f.read()}"
 
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8192,
        messages=[
            {
                "role": "user",
                "content": f"""Analyze this entire codebase and provide:
1. Architectural issues and improvement opportunities
2. Security vulnerabilities and risk assessment
3. Performance bottlenecks
4. Code quality observations
 
Codebase:
{files_content}
"""
            }
        ]
    )
    return message.content[0].text
 
# Expected output: Comprehensive architecture report with specific file/line references

Use Case 2: Long-Running Conversation State

Previously, long conversations hit token limits and required complex summarization strategies. With 1M tokens, you can maintain hundreds of conversation turns without degradation:

import Anthropic from "@anthropic-ai/sdk";
 
interface Message {
  role: "user" | "assistant";
  content: string;
}
 
class PersistentSession {
  private history: Message[] = [];
  private client: Anthropic;
 
  constructor() {
    this.client = new Anthropic();
  }
 
  async chat(userMessage: string): Promise<string> {
    this.history.push({ role: "user", content: userMessage });
 
    const response = await this.client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      system:
        "You are a dedicated engineering assistant with full context of this project's development history.",
      messages: this.history, // Full history — no truncation needed up to ~900K tokens
    });
 
    const reply = response.content[0].type === "text"
      ? response.content[0].text
      : "";
    this.history.push({ role: "assistant", content: reply });
    return reply;
  }
 
  estimateTokensUsed(): number {
    return this.history.reduce((sum, m) => sum + m.content.length / 4, 0);
  }
}

Use Case 3: Multi-Document Structured Extraction

def extract_from_reports(documents: list[str]) -> str:
    """
    Extract structured data from multiple business reports in a single call.
    Much faster and cheaper than processing documents individually.
    """
    combined = "\n\n".join([
        f"=== Document {i+1} ===\n{doc}"
        for i, doc in enumerate(documents)
    ])
 
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16384,
        messages=[{
            "role": "user",
            "content": f"""Extract structured data from all documents below.
Return valid JSON with this schema for each document:
{{
  "documents": [
    {{
      "revenue": "string",
      "profit_margin": "string",
      "key_metrics": ["string"],
      "risk_factors": ["string"],
      "summary": "string"
    }}
  ]
}}
 
Documents:
{combined}"""
        }]
    )
    return response.content[0].text
# Expected output: Valid JSON with extracted fields from each document

For architectural patterns around large context usage, see our guide on Claude 200K Context Window Production Mastery, where the design principles apply equally to the 1M window.

Extended Thinking — Implementation and Activation Patterns

Extended Thinking gives Sonnet 4.6 the ability to reason through problems before committing to an answer. This is particularly powerful for engineering design decisions and complex analysis.

Basic Implementation

import anthropic
 
client = anthropic.Anthropic()
 
def think_deeply(problem: str, budget_tokens: int = 10000) -> dict:
    """
    Use Extended Thinking for problems that benefit from deep reasoning.
 
    budget_tokens: How many tokens the model can use for internal thinking.
                   Range: 1,024 to 32,000. Higher = more thorough reasoning.
    """
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16000,  # Must exceed budget_tokens + expected output
        thinking={
            "type": "enabled",
            "budget_tokens": budget_tokens,
        },
        messages=[{"role": "user", "content": problem}],
    )
 
    result = {"thinking": None, "answer": None}
    for block in response.content:
        if block.type == "thinking":
            result["thinking"] = block.thinking  # Internal reasoning trace
        elif block.type == "text":
            result["answer"] = block.text  # Final response
 
    return result
 
# Example: System architecture decision with trade-off analysis
problem = """
We're migrating a Python Flask monolith (PostgreSQL, 100k daily users)
to microservices. Target: 99.99% SLA, horizontal scaling, team-independent deploys.
What's the optimal migration strategy, and which service should we extract first?
Justify your reasoning with specific risk and benefit analysis.
"""
 
result = think_deeply(problem, budget_tokens=15000)
print("Reasoning trace:", result["thinking"][:300], "...")
print("\nFinal answer:", result["answer"])
# Expected output: Detailed migration strategy with prioritized service extraction plan and risk assessment

When to Use Extended Thinking

High-value scenarios:

Mathematical proofs and algorithm optimization
Complex system design with multiple interdependent trade-offs
Legal and compliance reasoning where precision is critical
Multi-variable optimization (database schema normalization, API versioning strategy)

Skip Extended Thinking for:

Standard Q&A and information retrieval
Simple code completion and syntax fixes
Data format transformation
Latency-sensitive API endpoints (Extended Thinking adds processing time)

For a detailed comparison of Extended Thinking between Sonnet 4.6 and Opus 4.6, see Claude Opus 4.6 Extended Thinking Production Patterns.

Computer Use — Production Implementation Guide

At 72.5 on OSWorld-Verified, Sonnet 4.6's computer use has crossed the threshold from "impressive demo" to "viable production automation."

Core Implementation

import anthropic
import base64
 
client = anthropic.Anthropic()
 
def execute_computer_task(task: str, screenshot_b64: str) -> dict:
    """
    Given a screenshot and task description, return the actions Sonnet 4.6
    recommends to accomplish the task.
    """
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=[
            {
                "type": "computer_20250124",
                "name": "computer",
                "display_width_px": 1920,
                "display_height_px": 1080,
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": screenshot_b64,
                        },
                    },
                    {
                        "type": "text",
                        "text": f"Look at the current screen and complete this task: {task}"
                    }
                ],
            }
        ],
    )
 
    actions = []
    for block in response.content:
        if block.type == "tool_use" and block.name == "computer":
            actions.append({
                "action": block.input.get("action"),
                "coordinate": block.input.get("coordinate"),
                "text": block.input.get("text"),
            })
 
    return {
        "actions": actions,
        "reasoning": next(
            (b.text for b in response.content if b.type == "text"), ""
        ),
    }
 
# Expected output: {"actions": [{"action": "click", "coordinate": [960, 540]}], "reasoning": "..."}

Production Safety Patterns

1. Always run in sandboxed environments

Never connect Computer Use directly to production systems. Use Docker containers or virtual machines as intermediaries to limit the blast radius of unexpected actions.

2. Implement human-in-the-loop for high-risk actions

HIGH_RISK_KEYWORDS = ["delete", "submit", "purchase", "send", "transfer"]
 
def is_high_risk(action: dict) -> bool:
    action_str = str(action).lower()
    return any(keyword in action_str for keyword in HIGH_RISK_KEYWORDS)
 
def safe_execute(action: dict) -> bool:
    if is_high_risk(action):
        # In production: send Slack notification, await approval webhook
        print(f"⚠️ High-risk action detected: {action}")
        approval = input("Approve? (yes/no): ")
        return approval.strip().lower() == "yes"
    return True

3. Log every screenshot and action

Maintain a full audit trail — before-state screenshot, recommended action, after-state screenshot — for debugging and compliance.

For macOS-specific setup and configuration details, refer to Claude Computer Use macOS Complete Guide.

Cost Optimization: Prompt Caching × Batch API

Prompt Caching Implementation

System prompts and reference documents that appear in every request are ideal candidates for caching. Once cached, re-reading them costs just $0.30 per 1M tokens — a 90% reduction from the standard $3.

import anthropic
 
client = anthropic.Anthropic()
 
SYSTEM_CONTEXT = """
You are the AI assistant for Acme Corp.
[Include extensive company knowledge, guidelines, product docs here...]
The more text here, the greater the caching savings per request.
"""
 
def cached_query(user_message: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=[
            {
                "type": "text",
                "text": SYSTEM_CONTEXT,
                "cache_control": {"type": "ephemeral"},  # Enable caching
            }
        ],
        messages=[{"role": "user", "content": user_message}],
    )
 
    # Monitor cache efficiency
    usage = response.usage
    cached_tokens = getattr(usage, "cache_read_input_tokens", 0)
    new_tokens = getattr(usage, "cache_creation_input_tokens", 0)
    print(f"Input: {usage.input_tokens} | Cached reads: {cached_tokens} | New writes: {new_tokens}")
 
    return response.content[0].text
 
# First call: Cache write at $3.75/1M tokens
first = cached_query("What are our Q1 priorities?")
 
# Subsequent calls: Cache read at $0.30/1M tokens — 90% cheaper
second = cached_query("Summarize the product roadmap.")

Messages Batches API for Async Workloads

For processing tasks where real-time response isn't required, Batches API delivers an additional 50% cost reduction. With Sonnet 4.6's 300K max_tokens cap, you can generate substantial content in each batch request.

import anthropic
import time
 
client = anthropic.Anthropic()
 
def batch_generate(tasks: list[dict]) -> list[str]:
    """
    Process multiple generation tasks asynchronously.
    Cost: 50% off standard pricing.
    Turnaround: within 24 hours.
    """
    requests = [
        {
            "custom_id": f"task-{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 8192,
                "messages": [{"role": "user", "content": task["prompt"]}],
            },
        }
        for i, task in enumerate(tasks)
    ]
 
    batch = client.messages.batches.create(requests=requests)
    print(f"Batch {batch.id} created with {len(requests)} requests")
 
    # Poll for completion
    while True:
        status = client.messages.batches.retrieve(batch.id)
        if status.processing_status == "ended":
            break
        print(f"Processing... {status.request_counts.processing} remaining")
        time.sleep(30)
 
    # Collect results
    return [
        result.result.message.content[0].text
        for result in client.messages.batches.results(batch.id)
        if result.result.type == "succeeded"
    ]
 
# Example: Generate 100 product descriptions in one batch
tasks = [
    {"prompt": f"Write a compelling 150-word product description for: {product}"}
    for product in product_list
]
descriptions = batch_generate(tasks)

For a complete production cost optimization playbook, see Claude API Cost Optimization Production Guide.

Streaming Implementation

Streaming is essential for conversational UIs and long-form generation where users need to see progress immediately.

import anthropic
 
client = anthropic.Anthropic()
 
def stream_to_console(prompt: str) -> str:
    """Stream response tokens as they arrive."""
    full_text = ""
 
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_text += text
 
        usage = stream.get_final_message().usage
        print(f"\n\n[Input: {usage.input_tokens} | Output: {usage.output_tokens}]")
 
    return full_text

// Next.js App Router streaming with Vercel AI SDK
// app/api/chat/route.ts
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
export async function POST(req: Request) {
  const { messages } = await req.json();
 
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      const response = await client.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 4096,
        stream: true,
        messages,
      });
 
      for await (const event of response) {
        if (
          event.type === "content_block_delta" &&
          event.delta.type === "text_delta"
        ) {
          controller.enqueue(encoder.encode(event.delta.text));
        }
      }
      controller.close();
    },
  });
 
  return new Response(stream, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

For complete streaming patterns in production, see Claude API Streaming × Real-Time Chat UI Production Guide.

Production Monitoring and Observability

Deploying Sonnet 4.6 in production without observability is flying blind. Here's a practical monitoring layer that captures the metrics that actually matter.

Tracking Token Usage and Cache Efficiency

import anthropic
from dataclasses import dataclass, field
from datetime import datetime
import json
 
client = anthropic.Anthropic()
 
@dataclass
class RequestMetrics:
    timestamp: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    latency_ms: float
    model: str
    cost_usd: float
 
    def to_dict(self):
        return {
            "timestamp": self.timestamp,
            "input_tokens": self.input_tokens,
            "output_tokens": self.output_tokens,
            "cache_read_tokens": self.cache_read_tokens,
            "cache_write_tokens": self.cache_write_tokens,
            "latency_ms": self.latency_ms,
            "model": self.model,
            "cost_usd": self.cost_usd,
        }
 
def compute_cost(usage, model: str = "claude-sonnet-4-6") -> float:
    """Compute actual cost in USD based on token usage."""
    # Sonnet 4.6 pricing per 1M tokens
    INPUT_RATE = 3.00 / 1_000_000
    OUTPUT_RATE = 15.00 / 1_000_000
    CACHE_READ_RATE = 0.30 / 1_000_000
    CACHE_WRITE_RATE = 3.75 / 1_000_000
 
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)
    standard_input = usage.input_tokens - cache_read - cache_write
 
    return (
        standard_input * INPUT_RATE
        + usage.output_tokens * OUTPUT_RATE
        + cache_read * CACHE_READ_RATE
        + cache_write * CACHE_WRITE_RATE
    )
 
metrics_log: list[RequestMetrics] = []
 
def monitored_call(prompt: str, system: str = None) -> str:
    """Wrap any Sonnet 4.6 call with automatic metrics collection."""
    import time
 
    kwargs = {
        "model": "claude-sonnet-4-6",
        "max_tokens": 4096,
        "messages": [{"role": "user", "content": prompt}],
    }
 
    if system:
        kwargs["system"] = [
            {
                "type": "text",
                "text": system,
                "cache_control": {"type": "ephemeral"},
            }
        ]
 
    start = time.time()
    response = client.messages.create(**kwargs)
    latency = (time.time() - start) * 1000
 
    usage = response.usage
    metrics = RequestMetrics(
        timestamp=datetime.utcnow().isoformat(),
        input_tokens=usage.input_tokens,
        output_tokens=usage.output_tokens,
        cache_read_tokens=getattr(usage, "cache_read_input_tokens", 0),
        cache_write_tokens=getattr(usage, "cache_creation_input_tokens", 0),
        latency_ms=round(latency, 2),
        model="claude-sonnet-4-6",
        cost_usd=round(compute_cost(usage), 6),
    )
    metrics_log.append(metrics)
 
    # Log to your observability stack (Datadog, Grafana, CloudWatch, etc.)
    print(json.dumps(metrics.to_dict()))
 
    return response.content[0].text
 
def print_usage_summary() -> None:
    """Print aggregated stats across all requests in this session."""
    if not metrics_log:
        return
    total_cost = sum(m.cost_usd for m in metrics_log)
    total_input = sum(m.input_tokens for m in metrics_log)
    total_cache_reads = sum(m.cache_read_tokens for m in metrics_log)
    avg_latency = sum(m.latency_ms for m in metrics_log) / len(metrics_log)
    cache_hit_rate = (
        total_cache_reads / total_input * 100 if total_input > 0 else 0
    )
 
    print(f"\n=== Session Summary ===")
    print(f"Requests:      {len(metrics_log)}")
    print(f"Total cost:    ${total_cost:.4f}")
    print(f"Avg latency:   {avg_latency:.0f}ms")
    print(f"Cache hit rate:{cache_hit_rate:.1f}%")
    print(f"Total tokens:  {total_input + sum(m.output_tokens for m in metrics_log):,}")

Key Metrics to Alert On

Set up alerts for these thresholds to catch issues before they affect users:

Cache hit rate drops below 60%: Indicates your system prompt isn't being cached correctly, and costs are rising
P95 latency exceeds 5s: Suggests the model may be processing an oversized context or Extended Thinking is active unexpectedly
Error rate (429/529) exceeds 1%: You're hitting rate limits; implement request queuing or upgrade your tier
Cost per request doubles: Usually means the context window is growing unchecked; audit your history management logic

Structuring Logs for Debugging

Log both the input context and the model's reasoning when diagnosing quality issues in production. A structured log entry gives you everything you need to reproduce a problem:

import hashlib
 
def debug_log(prompt: str, response_text: str, thinking: str = None) -> None:
    """Write a structured log entry for quality auditing."""
    entry = {
        "request_hash": hashlib.sha256(prompt.encode()).hexdigest()[:8],
        "prompt_length": len(prompt),
        "response_length": len(response_text),
        "thinking_length": len(thinking) if thinking else 0,
        "has_thinking": thinking is not None,
        # Truncate for log size management
        "prompt_preview": prompt[:200],
        "response_preview": response_text[:200],
    }
    print(json.dumps(entry))

Integrating Sonnet 4.6 into Existing Applications

Migration from Sonnet 4.5

The API interface is identical. In most cases, updating the model string is sufficient:

# Before
response = client.messages.create(
    model="claude-sonnet-4-5",
    ...
)
 
# After — no other changes required in the vast majority of cases
response = client.messages.create(
    model="claude-sonnet-4-6",
    ...
)

Run both models in parallel for one to two weeks before fully cutting over. Log responses from both and compare quality on your specific tasks. In practice, teams find Sonnet 4.6 outperforms its predecessor consistently enough that parallel evaluation is mostly a formality.

Handling the 200K Beta Deprecation

If you're currently using any beta features tied to the 200K context window for Sonnet 4.5 or Sonnet 4, you need to migrate before April 30, 2026. After that date, requests exceeding the standard context window will return errors.

The migration path is straightforward: switch to "claude-sonnet-4-6" and remove any beta headers related to extended context. Sonnet 4.6 provides 1M tokens natively without beta flags.

# Remove any beta headers like this:
# client = anthropic.Anthropic(
#     default_headers={"anthropic-beta": "extended-context-2024-01-01"}
# )
 
# Sonnet 4.6 doesn't need them — 1M context is built in
client = anthropic.Anthropic()

What the docs don't tell you — notes from running it in production

A few behaviors only surface once you've shipped. Here is what I noticed as an indie developer after moving a support agent for one of my own apps over to Sonnet 4.6.

"Can hold 1M tokens" and "should send 1M tokens" are different claims

The 1M context is genuinely powerful, but once input crosses roughly 200K tokens, time-to-first-token grows noticeably. In my own measurements, the same question reached its first token in about 1.2s with a 30K input, but stretched to about 4.8s when I padded the input to 450K.

So 1M is the ceiling you can fill, not the amount you should send every call. Rather than keeping the entire history verbatim, I settled on passing the most recent 20–30 turns plus a summarized long-term memory separately. Both latency and cost stayed predictable.

Prompt Caching effectiveness depends on ordering

The docs say "put the stable parts first," but in practice a single variable element slipped between the system prompt and the tool definitions invalidates the entire cache after it.

I had originally appended the user's timezone to the end of the system prompt, and my cache hit rate came in at less than half of what I expected. Moving every variable element into the messages array lifted the post-write hit rate from 0.41 to 0.88. "Never place a variable behind the cache boundary" is an undocumented principle that maps directly to your bill.

With Computer Use, the real design is how you absorb the failing third

72.5 on OSWorld-Verified also means it misses roughly one attempt in three. A demo runs clean; production quality is decided by the recovery path when it fails.

The approach that worked for me was inserting a verification step before each Computer Use action — having the model state the current screen state in one sentence. That let it correct course just before a misstep, and task completion improved markedly. Designing for failure mattered more than the headline accuracy number.

Common Errors and Fixes

Error 1: `context_window_exceeded`

anthropic.BadRequestError: 400 {
  "error": {
    "message": "prompt is too long: 1050000 tokens > 1000000 maximum"
  }
}

Fix: Implement graceful history truncation.

def smart_truncate(history: list, max_tokens: int = 900_000) -> list:
    """Remove oldest turns when approaching the context limit."""
    estimated = sum(len(m["content"]) // 4 for m in history)
    while estimated > max_tokens and len(history) > 2:
        history.pop(0)  # Remove oldest user turn
        history.pop(0)  # Remove oldest assistant turn
        estimated = sum(len(m["content"]) // 4 for m in history)
    return history

Error 2: `overloaded_error` (HTTP 529)

Fix: Exponential backoff with jitter.

import time
import random
 
def resilient_call(prompt: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                messages=[{"role": "user", "content": prompt}],
            )
            return response.content[0].text
        except anthropic.APIStatusError as e:
            if e.status_code == 529 and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"API overloaded. Retrying in {wait:.1f}s ({attempt+1}/{max_retries})")
                time.sleep(wait)
            else:
                raise

Error 3: Empty response with Extended Thinking

Fix: Ensure max_tokens > budget_tokens + expected output.

# Wrong: max_tokens too low
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=5000,    # Not enough when budget_tokens=10000
    thinking={"type": "enabled", "budget_tokens": 10000},
    ...
)
 
# Correct: max_tokens = thinking budget + output headroom
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=18000,   # 10000 thinking + 8000 output
    thinking={"type": "enabled", "budget_tokens": 10000},
    ...
)

Closing thoughts — where to start

The value of Sonnet 4.6 lies less in its raw spec sheet than in the cost efficiency you get from matching the right model to the right task. For the large majority of workloads Sonnet 4.6 is the better default, with Opus 4.6 reserved for genuinely hard reasoning. Drawing that line keeps quality high while costs settle down.

Start by switching your model parameter to "claude-sonnet-4-6" and adding cache_control: {"type": "ephemeral"} to the parts of your system prompt and tool definitions that never change. Keep every variable element behind that cache boundary — hold to that one rule and you should see the difference on your very first invoice.

If you are tackling the same problem, I hope this saves you a few of the detours I took.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.