CLAUDE LABJP
WWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skillsWWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skills
Articles/API & SDK
API & SDK/2026-03-24Intermediate

Claude API Token Counting Guide — How to Estimate Token Usage and Optimize Costs Before Sending Requests

Learn how to use the Claude API Token Counting endpoint to estimate token usage before sending messages. Covers cost management, context window optimization, and production implementation patterns.

token-countingapi58cost-optimization25claude-api71python32

Why Token Counting Matters for Your API Workflow

If you've ever been surprised by unexpected costs after sending large multimodal requests to Claude, you're not alone. Images, PDFs, tool definitions, and lengthy system prompts can consume far more tokens than you'd expect — and by the time you check your usage dashboard, the bill is already there.

Anthropic provides a dedicated Token Counting endpoint that solves this problem. It lets you determine exactly how many tokens a request will consume before you send it, giving you full control over costs and context window management.

Token Counting Endpoint Basics

What It Does

The Token Counting endpoint accepts the same request structure as the Messages API and returns the total number of input tokens — without actually generating a response.

Key features:

  • Free to use: There's no charge for token counting requests
  • Works with all models: Claude Opus 4.6, Sonnet 4.6, Haiku 4.5, and all other active models
  • Multimodal support: Counts tokens for text, images, PDFs, and tool definitions
  • Independent rate limits: Token counting has its own rate limits, separate from the Messages API, so counting won't eat into your message quota

Basic Usage

Here's the simplest way to count tokens using the Python SDK:

import anthropic
 
client = anthropic.Anthropic()
 
# Count tokens for a message
response = client.messages.count_tokens(
    model="claude-sonnet-4-6-20260320",
    messages=[
        {
            "role": "user",
            "content": "Explain how token counting works in the Claude API."
        }
    ]
)
 
print(f"Input tokens: {response.input_tokens}")
# Output: Input tokens: 22

And the same thing in TypeScript:

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
const response = await client.messages.countTokens({
  model: "claude-sonnet-4-6-20260320",
  messages: [
    {
      role: "user",
      content: "Explain how token counting works in the Claude API.",
    },
  ],
});
 
console.log(`Input tokens: ${response.input_tokens}`);
// Output: Input tokens: 22

Counting Tokens with System Prompts and Tools

In real-world applications, system prompts and tool definitions often make up a significant portion of your input tokens. The Token Counting endpoint accounts for these as well.

import anthropic
 
client = anthropic.Anthropic()
 
# Count tokens including system prompt + tool definitions + messages
response = client.messages.count_tokens(
    model="claude-sonnet-4-6-20260320",
    system="You are a helpful weather assistant. Provide accurate weather information based on user queries.",
    tools=[
        {
            "name": "get_weather",
            "description": "Get the current weather for a specified city",
            "input_schema": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city to check the weather for"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What's the weather like in Tokyo today?"
        }
    ]
)
 
print(f"Input tokens (system + tools + messages): {response.input_tokens}")
# Output: Input tokens (system + tools + messages): 372

For applications with many tool definitions, tokens can add up to hundreds or even thousands. Knowing this upfront helps you decide which tools to include in each request.

Counting Tokens for Images and PDFs

Multimodal requests are where token counting becomes especially valuable, since image and PDF token consumption is hard to predict without measuring it.

import anthropic
import base64
 
client = anthropic.Anthropic()
 
# Read and encode an image file
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
response = client.messages.count_tokens(
    model="claude-sonnet-4-6-20260320",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what you see in this image."
                }
            ]
        }
    ]
)
 
print(f"Input tokens (with image): {response.input_tokens}")
# Output: Input tokens (with image): 1,584

Image token counts vary significantly based on resolution — a high-resolution screenshot might consume several thousand tokens. Checking this before sending prevents unpleasant surprises on your bill.

Production Patterns

Pattern 1: Context Window Management

For chatbots that maintain long conversations, you need to trim old messages before hitting the context window limit. Token Counting makes this precise rather than guesswork.

import anthropic
 
client = anthropic.Anthropic()
 
MODEL = "claude-sonnet-4-6-20260320"
MAX_INPUT_TOKENS = 180_000  # Leave a safety margin
SYSTEM_PROMPT = "You are a helpful customer support assistant."
 
def manage_conversation(messages: list, new_message: dict) -> list:
    """Manage conversation history to stay within the context window."""
    candidate = messages + [new_message]
 
    # Count current tokens
    count_response = client.messages.count_tokens(
        model=MODEL,
        system=SYSTEM_PROMPT,
        messages=candidate
    )
 
    # Remove oldest message pairs until we're under the limit
    while count_response.input_tokens > MAX_INPUT_TOKENS and len(candidate) > 1:
        candidate = candidate[2:]  # Remove oldest user-assistant pair
        count_response = client.messages.count_tokens(
            model=MODEL,
            system=SYSTEM_PROMPT,
            messages=candidate
        )
        print(f"Trimmed conversation: {count_response.input_tokens} tokens")
 
    return candidate

Pattern 2: Batch Cost Estimation Dashboard

Before kicking off a large batch job, estimate total costs upfront.

import anthropic
 
client = anthropic.Anthropic()
 
# Input token pricing per 1M tokens (USD)
PRICING = {
    "claude-opus-4-6-20260205": 15.0,
    "claude-sonnet-4-6-20260320": 3.0,
    "claude-haiku-4-5-20251001": 0.80,
}
 
def estimate_batch_cost(
    model: str,
    tasks: list[dict],
    system: str = ""
) -> dict:
    """Estimate costs for a batch of tasks before execution."""
    total_tokens = 0
 
    for task in tasks:
        response = client.messages.count_tokens(
            model=model,
            system=system,
            messages=[{"role": "user", "content": task["content"]}]
        )
        total_tokens += response.input_tokens
 
    price_per_token = PRICING.get(model, 3.0) / 1_000_000
    estimated_cost = total_tokens * price_per_token
 
    return {
        "total_input_tokens": total_tokens,
        "estimated_input_cost_usd": round(estimated_cost, 4),
        "average_tokens_per_task": total_tokens // len(tasks),
        "task_count": len(tasks)
    }
 
# Usage example
tasks = [
    {"content": "Analyze the sentiment of this product review: ..."},
    {"content": "Summarize the following article: ..."},
    {"content": "Find the bug in this code snippet: ..."},
]
 
estimate = estimate_batch_cost("claude-sonnet-4-6-20260320", tasks)
print(f"Total input tokens: {estimate['total_input_tokens']:,}")
print(f"Estimated input cost: ${estimate['estimated_input_cost_usd']}")
# Output:
# Total input tokens: 1,245
# Estimated input cost: $0.0037

Pattern 3: Combining with Prompt Caching

If you use [prompt caching]((/articles/api-sdk/prompt-caching), Token Counting helps you optimize cache breakpoint placement. By knowing exactly how many tokens each section of your prompt consumes, you can ensure cache-eligible portions meet the minimum threshold.

import anthropic
 
client = anthropic.Anthropic()
 
# Check if a system prompt qualifies for caching
large_system_prompt = "..." * 1000  # Large system prompt
 
response = client.messages.count_tokens(
    model="claude-sonnet-4-6-20260320",
    system=large_system_prompt,
    messages=[{"role": "user", "content": "test"}]
)
 
print(f"Input tokens with system prompt: {response.input_tokens}")
# Verify it meets the minimum for prompt caching (1,024 tokens)
if response.input_tokens >= 1024:
    print("✅ Eligible for prompt caching")
else:
    print("⚠️ Below the 1,024-token minimum for prompt caching")

Wrapping Up

The Token Counting endpoint is one of those tools that's easy to overlook but incredibly valuable once you start using it. Here's where it makes the biggest difference:

  • Pre-flight cost checks for batch processing: Know what you'll spend before you commit to processing thousands of requests
  • Smart conversation management: Keep chatbots running smoothly by monitoring context window usage in real time
  • Multimodal optimization: Understand exactly how much images and PDFs cost in tokens, and adjust resolutions or content accordingly

Since it's free to use, there's really no reason not to integrate it into your workflow. Start by identifying which requests in your application consume the most tokens — that's where you'll find the best optimization opportunities.

For a broader introduction to the Claude API, check out the [API Quickstart guide]((/articles/api-sdk/api-quickstart). For managing rate limits effectively, see [Rate Limits Best Practices]((/articles/api-sdk/rate-limits-best-practices).

Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API & SDK2026-05-06
Claude API × Python in Practice: Building an AI Assistant with Tool Calling and Streaming
A practical guide to combining Claude API's Tool Use and Streaming in Python. Build a working AI assistant with real tool execution, complete source code included, plus a breakdown of the tricky parts that trip up most developers.
API & SDK2026-05-02
Cancelling Claude API Streams the Right Way: AbortController, Token Billing, and Connection Hygiene
How to cancel Claude API streams with AbortController, what gets billed when you stop mid-stream, and the production gotchas — Node.js + Python.
API & SDK2026-05-29
Splitting Claude API prompt cache into 5m and 1h tiers — separate TTLs cut cost and stabilize ops
Anthropic's cache_control supports two TTLs: 5 minutes and 1 hour. Splitting them into a two-tier layout — 1h for static system/tools, 5m for variable few-shot — meaningfully changed both my costs and my on-call life. Here's the design with the numbers I observed.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →