●MCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)●LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter management●AGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returning●LIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based Enterprise●BILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limits●FIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task status●MCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)●LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter management●AGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returning●LIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based Enterprise●BILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limits●FIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task status
Routing the effort Parameter Per Stage to Balance Claude's Output Cost and Latency
Claude's effort parameter governs all output tokens — thinking, prose, and tool calls. This guide replaces a blanket high setting with per-stage tiers and a dynamic router, grounded in measurements from a solo developer's automation pipeline.
Running four blogs on autopilot as an indie developer means calling the Claude API dozens of times a day. One evening I was looking at a cost breakdown and something caught my eye. The call that merely "decides which category an article belongs to" and the call that "reviews a full draft and surfaces contradictions" were running at roughly the same token economics. The former should finish in a blink; the latter deserves careful thought. Yet I was sending both at the default — the setting that thinks as hard as possible.
The key to stopping this "maximum effort for everything" habit is the effort parameter. Without switching models, you can dial how eagerly a single call spends tokens. This article first pins down what effort actually controls, then builds a per-stage routing scheme and a dynamic router that picks a level based on the input — with honest measurements from my own indie-developer automation along the way.
effort controls more than "thinking"
The most common misconception is that effort is a switch for extended-thinking depth. That's half right, and the missing half matters.
The official definition is broader: effort affects every token in the response. Concretely, three kinds:
Target
Behavior at lower effort
Prose and explanations
Skips preamble, answers concisely
Tool calls and arguments
Fewer calls, combines operations
Extended thinking (when enabled)
May skip thinking on simple problems
What makes this important is that effort works even on requests where thinking isn't enabled. In a tool-heavy agent, lowering effort shifts behavior from "explain the plan at length, then act" toward "act quietly and report briefly." Treat effort as a knob on token spend itself — independent of whether thinking is on — and your design stays coherent.
Five levels, with high as the default
There are five levels. high is the default and behaves exactly the same as omitting effort entirely.
Level
Character
Fitting stage
max
Maximum capability, no token constraints
Hardest problems needing deepest reasoning
xhigh
Extended capability for long-horizon work
Coding/agent tasks over 30 minutes
high (default)
High capability; same as unset
Complex reasoning, hard implementation
medium
Balance of speed, cost, quality
Balanced agentic work
low
Most efficient; slight capability drop
Classification, quick lookups, high volume
Note that effort is a behavioral signal, not a strict token cap. Even at low, the model still thinks on genuinely hard problems — it just thinks less than it would at a higher level for the same problem. max and xhigh are available on a limited set of models, so confirm yours supports them first (Opus 4.8 supports both).
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦Why effort controls the entire output token budget — not just thinking, but prose and tool calls — and how that reframes your design
✦A working router that assigns low / medium / high / xhigh per stage (classify, draft, review), with a Before/After
✦How effort relates to Opus 4.8 adaptive thinking, the budget_tokens 400 pitfall, and a measure-before-you-lower workflow
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Before / After: drop the blanket high, route per stage
Here is how I used to call the API. Every stage went through the same helper.
import anthropicclient = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from the environmentdef call_claude(prompt: str) -> str: # every stage flows through this = everything at the default high resp = client.messages.create( model="claude-opus-4-8", max_tokens=4096, messages=[{"role": "user", "content": prompt}], ) return resp.content[0].text
The flaw isn't visible by reading the code. It only shows up in the bill and the latency: even a lightweight category decision ran at the think-hard default every time.
Here is the version that lets each stage pass its own effort. The change is a single extra argument, and the model stays the same.
output_config={"effort": ...} is the official way to pass the parameter. No beta header is required; supported models accept it directly. Because effort="high" is synonymous with leaving it unset, you can migrate safely — lowering only the stages you choose without changing the behavior of the rest.
How to classify stages and assign effort
Designing the routing starts by ordering stages by how much quality matters. In my pipeline it settled roughly like this:
Stage
Nature
Assignment
Category / level decision
Bounded classification
low
Tag extraction / summary
Near-formulaic transform
low–medium
Body draft
Volume, but shallow exploration
medium
Fact / internal-link integrity check
Misses are fatal
high
Full restructure / contradiction hunt
Needs deep reasoning
high–xhigh
The axis is simple: "if I drop quality one notch here, how much rework or reader harm follows?" Misclassification is cheap to redo, so it goes to low. A miss in the pre-publish integrity check, by contrast, undermines the article's trustworthiness, so I keep quality over cost and place it at high. Since Opus 4.7, the model scopes its work more tightly to what was asked at lower effort, so casually lowering a complex stage yields shallow output. The rule is: lower it only after you measure.
On Opus 4.8, combine with adaptive thinking
If you use Opus 4.8, note that the way you control thinking has changed. Opus 4.8 uses adaptive thinking — the model decides how much to think on its own — and does not accept a manual budget_tokens.
Thinking only turns on once you pass thinking={"type": "adaptive"}; without it the call runs with no thinking. From there, high, xhigh, and max almost always think deeply, while lower levels skip thinking on simple problems. A frequent migration stumble is writing as if budget_tokens still applies and discovering the 400 the hard way.
How it lands in tool-using agents
effort moves both the number of tool calls and how talkative they are. Since that ties directly to agent cost, this is where routing pays off most visibly.
At lower effort, the model combines operations into fewer calls, acts without preamble, and reports tersely. At higher effort, it explains the plan, calls tools more granularly, and summarizes changes carefully. A routine agent task like "read the logs and identify one root cause" is often well served by medium; pushing it to xhigh can add exploration that only inflates cost. Conversely, exploratory work that crawls an unfamiliar codebase suits xhigh. When you run at xhigh or max, give max_tokens plenty of room (start around 64k) so the longer thinking and tool calls don't hit a ceiling mid-task.
A dynamic router that picks a level from the input
If stages are fixed, constants suffice. But when difficulty varies within the same stage, a cheap front step that chooses effort cuts waste. I slot in a thin router like this:
from dataclasses import dataclass@dataclassclass Task: kind: str # "classify" | "draft" | "review" | "redesign" input_len: int # rough input length high_stakes: bool # does failure ripple downstream?def pick_effort(task: Task) -> str: # 1) base level from the stage's nature base = { "classify": "low", "draft": "medium", "review": "high", "redesign": "xhigh", }.get(task.kind, "high") # 2) bump high-stakes stages up one notch if task.high_stakes and base in ("low", "medium"): base = "medium" if base == "low" else "high" # 3) for very short inputs, step a heavy stage down to test the water if task.input_len < 400 and base in ("high", "xhigh"): base = "high" if base == "xhigh" else "medium" return base# usageeffort = pick_effort(Task(kind="review", input_len=120, high_stakes=False))result = call_claude(review_prompt, effort=effort)
The point is that the router itself must not call Claude. Spending another API call to decide would eat the very cost you set out to save. Decide the base from cheap signals — length, stage name — and move it at most one notch. This "decide cheaply, adjust only when confident" discipline is what holds up in production.
Measure before you lower
Base the decision to lower effort on your own data, not impressions. Run the same prompt set across effort levels and look side by side at output tokens, wall-clock time, and quality (pass rate against your own bar).
Honestly, in my small pipeline: when I dropped lightweight stages like classification and tag extraction from high to low, their output tokens visibly fell and latency shortened. But when I dropped the integrity check to medium, misses ticked up slightly, so I returned it to high. The figures depend heavily on my topics, prompts, and article style, so I won't generalize them into "X percent saved." What matters is having a procedure to verify, per stage, whether lowering was worth it on your own bar.
The first step can be tiny. Pick the single highest-frequency, lightest stage and add only output_config={"effort": "low"} there. Watch that stage's cost and latency move without touching the model or the prompt — and from there, the staged allocation that fits your pipeline starts to take shape.
I hope this helps anyone wrestling with the cost of their own automation.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.