⬡ API & SDK/2026-06-20Advanced

Routing the effort Parameter Per Stage to Balance Claude's Output Cost and Latency

Claude's effort parameter governs all output tokens — thinking, prose, and tool calls. This guide replaces a blanket high setting with per-stage tiers and a dynamic router, grounded in measurements from a solo developer's automation pipeline.

Claude API⁸⁰ effort Cost optimization Opus 4.8 Agent²

✦ Premium Article

Running four blogs on autopilot as an indie developer means calling the Claude API dozens of times a day. One evening I was looking at a cost breakdown and something caught my eye. The call that merely "decides which category an article belongs to" and the call that "reviews a full draft and surfaces contradictions" were running at roughly the same token economics. The former should finish in a blink; the latter deserves careful thought. Yet I was sending both at the default — the setting that thinks as hard as possible.

The key to stopping this "maximum effort for everything" habit is the effort parameter. Without switching models, you can dial how eagerly a single call spends tokens. This article first pins down what effort actually controls, then builds a per-stage routing scheme and a dynamic router that picks a level based on the input — with honest measurements from my own indie-developer automation along the way.

effort controls more than "thinking"

The most common misconception is that effort is a switch for extended-thinking depth. That's half right, and the missing half matters.

The official definition is broader: effort affects every token in the response. Concretely, three kinds:

Target	Behavior at lower effort
Prose and explanations	Skips preamble, answers concisely
Tool calls and arguments	Fewer calls, combines operations
Extended thinking (when enabled)	May skip thinking on simple problems

What makes this important is that effort works even on requests where thinking isn't enabled. In a tool-heavy agent, lowering effort shifts behavior from "explain the plan at length, then act" toward "act quietly and report briefly." Treat effort as a knob on token spend itself — independent of whether thinking is on — and your design stays coherent.

Five levels, with high as the default

There are five levels. high is the default and behaves exactly the same as omitting effort entirely.

Level	Character	Fitting stage
max	Maximum capability, no token constraints	Hardest problems needing deepest reasoning
xhigh	Extended capability for long-horizon work	Coding/agent tasks over 30 minutes
high (default)	High capability; same as unset	Complex reasoning, hard implementation
medium	Balance of speed, cost, quality	Balanced agentic work
low	Most efficient; slight capability drop	Classification, quick lookups, high volume

Note that effort is a behavioral signal, not a strict token cap. Even at low, the model still thinks on genuinely hard problems — it just thinks less than it would at a higher level for the same problem. max and xhigh are available on a limited set of models, so confirm yours supports them first (Opus 4.8 supports both).

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why effort controls the entire output token budget — not just thinking, but prose and tool calls — and how that reframes your design

✦A working router that assigns low / medium / high / xhigh per stage (classify, draft, review), with a Before/After

✦How effort relates to Opus 4.8 adaptive thinking, the budget_tokens 400 pitfall, and a measure-before-you-lower workflow

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Before / After: drop the blanket high, route per stage

Here is how I used to call the API. Every stage went through the same helper.

import anthropic
 
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from the environment
 
def call_claude(prompt: str) -> str:
    # every stage flows through this = everything at the default high
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

The flaw isn't visible by reading the code. It only shows up in the bill and the latency: even a lightweight category decision ran at the think-hard default every time.

Here is the version that lets each stage pass its own effort. The change is a single extra argument, and the model stays the same.

import anthropic
from typing import Literal
 
client = anthropic.Anthropic()
 
Effort = Literal["low", "medium", "high", "xhigh", "max"]
 
def call_claude(prompt: str, effort: Effort = "high", max_tokens: int = 4096) -> str:
    resp = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}],
        output_config={"effort": effort},  # the efficiency control itself
    )
    return resp.content[0].text
 
# call with intent per stage
category = call_claude(classify_prompt, effort="low", max_tokens=256)
draft    = call_claude(draft_prompt,    effort="medium")
review   = call_claude(review_prompt,   effort="high")

output_config={"effort": ...} is the official way to pass the parameter. No beta header is required; supported models accept it directly. Because effort="high" is synonymous with leaving it unset, you can migrate safely — lowering only the stages you choose without changing the behavior of the rest.

How to classify stages and assign effort

Designing the routing starts by ordering stages by how much quality matters. In my pipeline it settled roughly like this:

Stage	Nature	Assignment
Category / level decision	Bounded classification	low
Tag extraction / summary	Near-formulaic transform	low–medium
Body draft	Volume, but shallow exploration	medium
Fact / internal-link integrity check	Misses are fatal	high
Full restructure / contradiction hunt	Needs deep reasoning	high–xhigh

The axis is simple: "if I drop quality one notch here, how much rework or reader harm follows?" Misclassification is cheap to redo, so it goes to low. A miss in the pre-publish integrity check, by contrast, undermines the article's trustworthiness, so I keep quality over cost and place it at high. Since Opus 4.7, the model scopes its work more tightly to what was asked at lower effort, so casually lowering a complex stage yields shallow output. The rule is: lower it only after you measure.

On Opus 4.8, combine with adaptive thinking

If you use Opus 4.8, note that the way you control thinking has changed. Opus 4.8 uses adaptive thinking — the model decides how much to think on its own — and does not accept a manual budget_tokens.

# ❌ returns a 400 error on Opus 4.8
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=8192,
    thinking={"type": "enabled", "budget_tokens": 4000},  # unsupported
    messages=[{"role": "user", "content": prompt}],
)
 
# ✅ enable adaptive thinking, steer depth with effort
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=8192,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[{"role": "user", "content": prompt}],
)

Thinking only turns on once you pass thinking={"type": "adaptive"}; without it the call runs with no thinking. From there, high, xhigh, and max almost always think deeply, while lower levels skip thinking on simple problems. A frequent migration stumble is writing as if budget_tokens still applies and discovering the 400 the hard way.

How it lands in tool-using agents

effort moves both the number of tool calls and how talkative they are. Since that ties directly to agent cost, this is where routing pays off most visibly.

At lower effort, the model combines operations into fewer calls, acts without preamble, and reports tersely. At higher effort, it explains the plan, calls tools more granularly, and summarizes changes carefully. A routine agent task like "read the logs and identify one root cause" is often well served by medium; pushing it to xhigh can add exploration that only inflates cost. Conversely, exploratory work that crawls an unfamiliar codebase suits xhigh. When you run at xhigh or max, give max_tokens plenty of room (start around 64k) so the longer thinking and tool calls don't hit a ceiling mid-task.

A dynamic router that picks a level from the input

If stages are fixed, constants suffice. But when difficulty varies within the same stage, a cheap front step that chooses effort cuts waste. I slot in a thin router like this:

from dataclasses import dataclass
 
@dataclass
class Task:
    kind: str          # "classify" | "draft" | "review" | "redesign"
    input_len: int     # rough input length
    high_stakes: bool  # does failure ripple downstream?
 
def pick_effort(task: Task) -> str:
    # 1) base level from the stage's nature
    base = {
        "classify": "low",
        "draft": "medium",
        "review": "high",
        "redesign": "xhigh",
    }.get(task.kind, "high")
 
    # 2) bump high-stakes stages up one notch
    if task.high_stakes and base in ("low", "medium"):
        base = "medium" if base == "low" else "high"
 
    # 3) for very short inputs, step a heavy stage down to test the water
    if task.input_len < 400 and base in ("high", "xhigh"):
        base = "high" if base == "xhigh" else "medium"
 
    return base
 
# usage
effort = pick_effort(Task(kind="review", input_len=120, high_stakes=False))
result = call_claude(review_prompt, effort=effort)

The point is that the router itself must not call Claude. Spending another API call to decide would eat the very cost you set out to save. Decide the base from cheap signals — length, stage name — and move it at most one notch. This "decide cheaply, adjust only when confident" discipline is what holds up in production.

Measure before you lower

Base the decision to lower effort on your own data, not impressions. Run the same prompt set across effort levels and look side by side at output tokens, wall-clock time, and quality (pass rate against your own bar).

Honestly, in my small pipeline: when I dropped lightweight stages like classification and tag extraction from high to low, their output tokens visibly fell and latency shortened. But when I dropped the integrity check to medium, misses ticked up slightly, so I returned it to high. The figures depend heavily on my topics, prompts, and article style, so I won't generalize them into "X percent saved." What matters is having a procedure to verify, per stage, whether lowering was worth it on your own bar.

The first step can be tiny. Pick the single highest-frequency, lightest stage and add only output_config={"effort": "low"} there. Watch that stage's cost and latency move without touching the model or the prompt — and from there, the staged allocation that fits your pipeline starts to take shape.

I hope this helps anyone wrestling with the cost of their own automation.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.