CLAUDE LABJP
CODE — Claude Code adds Trusted Devices, verifying a machine before remote admin sessions beginCODE — CPU use drops about 37% during streaming, keeping long always-on automation steadierCODE — Fullscreen mouse-click controls, voice dictation fixes, and better Linux voice detection landAUTH — Static API keys can now be replaced with short-lived, scoped WIF credentialsTEAM — You can tag Claude directly in Slack and delegate tasks while you focus elsewhereWORKFLOW — Dynamic workflows arrive in research preview, breaking complex work into steps on their ownCODE — Claude Code adds Trusted Devices, verifying a machine before remote admin sessions beginCODE — CPU use drops about 37% during streaming, keeping long always-on automation steadierCODE — Fullscreen mouse-click controls, voice dictation fixes, and better Linux voice detection landAUTH — Static API keys can now be replaced with short-lived, scoped WIF credentialsTEAM — You can tag Claude directly in Slack and delegate tasks while you focus elsewhereWORKFLOW — Dynamic workflows arrive in research preview, breaking complex work into steps on their own
Articles/Claude Code
Claude Code/2026-06-28Advanced

When You Fan Out Streaming Sessions, Your Laptop's CPU Gives Out First — An Adaptive Throttle That Caps Concurrency by Measured Load

Even with lighter streaming, fanning out many sessions on one machine saturates the host CPU before anything else. Here is why a fixed semaphore fails, plus a working adaptive gate that raises and lowers concurrency from measured CPU.

Claude Code171ConcurrencyStreaming4Performance2Unattended Automation

Premium Article

A recent Claude Code update cut CPU usage during streaming by roughly 37%. For anything that runs for hours, that is a welcome lift. But running several sites' publishing jobs in parallel on a single machine, I keep noticing that this kind of improvement reduces the per-session cost — it does nothing about the separate question of what gives out first when you stack many sessions at once.

On my own setup, the moment I ran each site's generation concurrently, the fan started spinning and a latency I had never seen with serial runs showed up at p95. Memory was nowhere near full. This article rebuilds that bottleneck — when the limiting resource turns out to be host CPU — by replacing a guessed, fixed concurrency number with a gate that throttles based on measured CPU, using working code and numbers from my own runs.

Why CPU Gives Out Before Memory

A streaming response is a steady loop of receiving server-sent events one at a time, parsing incremental JSON, and stitching text together. For one session this is trivial. Run the same loop across ten or twenty sessions, and the event loop sees a relentless pile of small parse-and-callback work, and CPU becomes the limiting factor.

What matters here is that each session is a busy coroutine, not a mostly-sleeping one. When the time spent handling arriving chunks outweighs the time spent waiting on the network, the usual I/O-concurrency intuition ("lots of waiting, so stack many") breaks down. Memory grows roughly linearly with session count and is easy to predict, while CPU hits a cliff once it saturates. That is exactly why you need a layer that decides concurrency by watching CPU, separate from any memory watchdog.

Stop Guessing "How Many at Once"

Most batch jobs cap concurrency with a fixed semaphore like this:

import asyncio
 
# A guessed constant. Comfortable on the dev machine, but...
sem = asyncio.Semaphore(12)
 
async def run_one(site, client):
    async with sem:
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=[{"role": "user", "content": build_prompt(site)}],
        ) as stream:
            async for _ in stream.text_stream:
                pass
        return await stream.get_final_message()

The trouble is that 12 is optimized for one specific machine at one specific moment. On the faster Mac I develop on, twelve sessions were fine. The instant I moved the same script to a scheduled run on an older mini PC, CPU pinned at around 96% and each stream took about 2.4x longer than it did alone. On a large, otherwise-idle machine, twelve underutilizes it. A constant fits neither the fast side nor the slow side.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Why a fixed semaphore is silently tuned for your fastest machine and saturates the CPU on a weaker host
A working adaptive gate that samples host CPU with an EWMA and adjusts the concurrency limit one slot at a time
Backpressure that pauses new work while guaranteeing at least one in-flight session, and how to share one script across machines
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Claude Code2026-03-19
Unity × Claude Code Advanced Workflow — Shader Generation, CI/CD & Performance Optimization
Advanced Unity development with Claude Code. Auto-generate custom shaders, build CI/CD pipelines, and implement performance profiling—with production-tested code.
Claude Code2026-06-28
Stop Leaving a Static API Key in Your Unattended Jobs — Move to Short-Lived WIF Credentials
Claude Code is moving from static API keys toward short-lived, scoped credentials via WIF (Workload Identity Federation). Here is how to translate that idea to a small unattended pipeline so a leaked key has a much smaller blast radius — with working code and the failure modes to watch.
Claude Code2026-06-27
When an OAuth Token Expires, Your Unattended Run Has Nowhere to Go — A Token-Lifecycle Design That Keeps Remote MCP Alive
Remote MCP connectors are authorized via OAuth, but access tokens are short-lived. Interactive sessions can re-authorize in a browser; an unattended scheduled run has nobody to click the dialog. Here is a token-lifecycle design that owns expiry and refreshes ahead of time.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →