CLAUDE LABJP
MCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter managementAGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returningLIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based EnterpriseBILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limitsFIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task statusMCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter managementAGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returningLIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based EnterpriseBILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limitsFIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task status
Articles/API & SDK
API & SDK/2026-06-20Advanced

Routing the effort Parameter Per Stage to Balance Claude's Output Cost and Latency

Claude's effort parameter governs all output tokens — thinking, prose, and tool calls. This guide replaces a blanket high setting with per-stage tiers and a dynamic router, grounded in measurements from a solo developer's automation pipeline.

Claude API80effortCost optimizationOpus 4.8Agent2

Premium Article

Running four blogs on autopilot as an indie developer means calling the Claude API dozens of times a day. One evening I was looking at a cost breakdown and something caught my eye. The call that merely "decides which category an article belongs to" and the call that "reviews a full draft and surfaces contradictions" were running at roughly the same token economics. The former should finish in a blink; the latter deserves careful thought. Yet I was sending both at the default — the setting that thinks as hard as possible.

The key to stopping this "maximum effort for everything" habit is the effort parameter. Without switching models, you can dial how eagerly a single call spends tokens. This article first pins down what effort actually controls, then builds a per-stage routing scheme and a dynamic router that picks a level based on the input — with honest measurements from my own indie-developer automation along the way.

effort controls more than "thinking"

The most common misconception is that effort is a switch for extended-thinking depth. That's half right, and the missing half matters.

The official definition is broader: effort affects every token in the response. Concretely, three kinds:

TargetBehavior at lower effort
Prose and explanationsSkips preamble, answers concisely
Tool calls and argumentsFewer calls, combines operations
Extended thinking (when enabled)May skip thinking on simple problems

What makes this important is that effort works even on requests where thinking isn't enabled. In a tool-heavy agent, lowering effort shifts behavior from "explain the plan at length, then act" toward "act quietly and report briefly." Treat effort as a knob on token spend itself — independent of whether thinking is on — and your design stays coherent.

Five levels, with high as the default

There are five levels. high is the default and behaves exactly the same as omitting effort entirely.

LevelCharacterFitting stage
maxMaximum capability, no token constraintsHardest problems needing deepest reasoning
xhighExtended capability for long-horizon workCoding/agent tasks over 30 minutes
high (default)High capability; same as unsetComplex reasoning, hard implementation
mediumBalance of speed, cost, qualityBalanced agentic work
lowMost efficient; slight capability dropClassification, quick lookups, high volume

Note that effort is a behavioral signal, not a strict token cap. Even at low, the model still thinks on genuinely hard problems — it just thinks less than it would at a higher level for the same problem. max and xhigh are available on a limited set of models, so confirm yours supports them first (Opus 4.8 supports both).

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Why effort controls the entire output token budget — not just thinking, but prose and tool calls — and how that reframes your design
A working router that assigns low / medium / high / xhigh per stage (classify, draft, review), with a Before/After
How effort relates to Opus 4.8 adaptive thinking, the budget_tokens 400 pitfall, and a measure-before-you-lower workflow
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-05-06
Building an Autonomous Research Agent with Claude API: Web Search, Summarization, and Knowledge Management
A complete guide to designing and implementing an autonomous research agent using Claude API and web search tools. Covers budget control, quality assurance, and knowledge base storage for production use.
API & SDK2026-06-20
Running Subagents in Parallel Without One Failure Sinking the Whole Run
A fan-out / fan-in design for running several subagents in parallel, covering token budgeting, a result contract, and partial-failure handling. Includes an implementation where one branch can fail without stopping the rest, plus measured numbers.
API & SDK2026-06-20
Putting Cloudflare AI Gateway in Front of Claude Made the Numbers I Needed Disappear — Field Notes on Instrumentation
After putting Cloudflare AI Gateway in front of Claude API, here is where I actually got stung — cost attribution, semantic-cache false hits, fallback quietly lowering quality, and budgets that don't really stop anything — with the code I used to fix each.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →