When You Fan Out Streaming Sessions, Your Laptop's CPU Gives Out First — An Adaptive Throttle That Caps Concurrency by Measured Load

Even with lighter streaming, fanning out many sessions on one machine saturates the host CPU before anything else. Here is why a fixed semaphore fails, plus a working adaptive gate that raises and lowers concurrency from measured CPU.

Claude Code¹⁷¹ Concurrency Streaming⁴ Performance² Unattended Automation

✦ Premium Article

A recent Claude Code update cut CPU usage during streaming by roughly 37%. For anything that runs for hours, that is a welcome lift. But running several sites' publishing jobs in parallel on a single machine, I keep noticing that this kind of improvement reduces the per-session cost — it does nothing about the separate question of what gives out first when you stack many sessions at once.

On my own setup, the moment I ran each site's generation concurrently, the fan started spinning and a latency I had never seen with serial runs showed up at p95. Memory was nowhere near full. This article rebuilds that bottleneck — when the limiting resource turns out to be host CPU — by replacing a guessed, fixed concurrency number with a gate that throttles based on measured CPU, using working code and numbers from my own runs.

Why CPU Gives Out Before Memory

A streaming response is a steady loop of receiving server-sent events one at a time, parsing incremental JSON, and stitching text together. For one session this is trivial. Run the same loop across ten or twenty sessions, and the event loop sees a relentless pile of small parse-and-callback work, and CPU becomes the limiting factor.

What matters here is that each session is a busy coroutine, not a mostly-sleeping one. When the time spent handling arriving chunks outweighs the time spent waiting on the network, the usual I/O-concurrency intuition ("lots of waiting, so stack many") breaks down. Memory grows roughly linearly with session count and is easy to predict, while CPU hits a cliff once it saturates. That is exactly why you need a layer that decides concurrency by watching CPU, separate from any memory watchdog.

Stop Guessing "How Many at Once"

Most batch jobs cap concurrency with a fixed semaphore like this:

import asyncio
 
# A guessed constant. Comfortable on the dev machine, but...
sem = asyncio.Semaphore(12)
 
async def run_one(site, client):
    async with sem:
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=[{"role": "user", "content": build_prompt(site)}],
        ) as stream:
            async for _ in stream.text_stream:
                pass
        return await stream.get_final_message()

The trouble is that 12 is optimized for one specific machine at one specific moment. On the faster Mac I develop on, twelve sessions were fine. The instant I moved the same script to a scheduled run on an older mini PC, CPU pinned at around 96% and each stream took about 2.4x longer than it did alone. On a large, otherwise-idle machine, twelve underutilizes it. A constant fits neither the fast side nor the slow side.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why a fixed semaphore is silently tuned for your fastest machine and saturates the CPU on a weaker host

✦A working adaptive gate that samples host CPU with an EWMA and adjusts the concurrency limit one slot at a time

✦Backpressure that pauses new work while guaranteeing at least one in-flight session, and how to share one script across machines

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Measure and Smooth the Host CPU

The first step, before touching concurrency, is to observe host CPU continuously. Instantaneous readings jump around, so smooth them with an exponential moving average (EWMA).

import psutil
 
class CpuSampler:
    """Return host CPU usage smoothed with an EWMA."""
 
    def __init__(self, alpha: float = 0.3):
        self.alpha = alpha
        self.value = 0.0
        psutil.cpu_percent(interval=None)  # first call is a throwaway
 
    def sample(self) -> float:
        raw = psutil.cpu_percent(interval=None)
        self.value = self.alpha * raw + (1 - self.alpha) * self.value
        return self.value

A larger alpha reacts faster but follows noise; a smaller one is steadier but lags. Around 0.3 was my sweet spot: it ignores momentary spikes yet catches a real trend shift within a few seconds.

The Adaptive Gate — Move One Slot at a Time

Once you have measured CPU, adjust the number of concurrent sessions toward a target (say 70%). Drop one slot when overloaded, add one when there is headroom. Move gently — large jumps oscillate.

Because asyncio.Semaphore is awkward to resize on the fly, this small gate tracks the current allowance and in-flight count with a Condition.

import asyncio
from contextlib import asynccontextmanager
 
class AdaptiveCpuGate:
    """A gate that raises and lowers concurrency from measured host CPU."""
 
    def __init__(self, target_cpu=70.0, min_slots=1, max_slots=16, alpha=0.3):
        self.target_cpu = target_cpu
        self.min_slots = min_slots
        self.max_slots = max_slots
        self._slots = min_slots          # currently allowed concurrency
        self._in_flight = 0              # sessions running now
        self._sampler = CpuSampler(alpha)
        self._cond = asyncio.Condition()
 
    async def governor(self, stop: asyncio.Event, period: float = 2.0):
        """A resident loop that revises the limit periodically."""
        while not stop.is_set():
            cpu = self._sampler.sample()
            async with self._cond:
                if cpu > self.target_cpu:
                    self._slots = max(self.min_slots, self._slots - 1)
                elif cpu < self.target_cpu * 0.8:
                    self._slots = min(self.max_slots, self._slots + 1)
                self._cond.notify_all()   # wake waiters: a slot may have opened
            await asyncio.sleep(period)
 
    @asynccontextmanager
    async def slot(self):
        async with self._cond:
            # wait until in-flight drops below the allowance (backpressure)
            await self._cond.wait_for(lambda: self._in_flight < self._slots)
            self._in_flight += 1
        try:
            yield
        finally:
            async with self._cond:
                self._in_flight -= 1
                self._cond.notify_all()

Three things matter. First, the adjustment is asymmetric: drop one slot the moment you exceed the target, but only add one once you fall below 80% of it. Fast to shrink, cautious to grow — that keeps CPU off the ceiling. Second, wait_for naturally holds new sessions until a slot frees up, which is backpressure for free. Third, min_slots is at least one, so no matter how strained CPU gets, work never stops entirely.

Throttle and Keep Going, Don't Halt

The danger in a naive fixed-semaphore implementation is that overload detection tends to slide into halting everything. For unattended automation, running thin but steady gives a higher batch completion rate than stopping. Because AdaptiveCpuGate guarantees at least one slot, the worst case degrades to serial execution rather than dropping to zero.

The caller starts the gate alongside its resident governor:

async def run_one(site, gate, client):
    async with gate.slot():
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=[{"role": "user", "content": build_prompt(site)}],
        ) as stream:
            async for _ in stream.text_stream:
                pass
        return await stream.get_final_message()
 
async def main(sites, client):
    gate = AdaptiveCpuGate(target_cpu=70.0, min_slots=2, max_slots=12)
    stop = asyncio.Event()
    gov = asyncio.create_task(gate.governor(stop))
    try:
        await asyncio.gather(*(run_one(s, gate, client) for s in sites))
    finally:
        stop.set()
        await gov

Here client is an async Anthropic client. Pass the API key through an environment variable (ANTHROPIC_API_KEY); never hardcode it.

Before / After on My Own Box

Here is one comparison from my setup — a batch that runs article generation for several sites at once — using an older mini PC as the runner. These numbers are from my specific environment and will not be identical on every machine; read them as a directional guide.

Metric	Fixed semaphore (12)	Adaptive gate (target 70%)
Steady-state CPU	Pinned around 96%	Stable at ~70 ± 8%
Settled effective concurrency	12 (fixed)	Converged to 6–7
Per-session wall time (p95)	~2.4x vs alone	~1.3x vs alone
Total batch time	Baseline	~12% longer
Thermals (fan/heat)	Thermal throttling	None

The striking part: even after the gate cut concurrency nearly in half, total batch time grew only 12%. With CPU saturated, adding more sessions only made each one slower — throughput had barely improved. Pacing to a target CPU beat cramming sessions in, holding down heat and latency for a negligible difference in finish time.

Sharing One Script Across Machines

As an indie developer running this across several machines, the reason I rely on it is that the same script lets each machine pick its own concurrency to fit its own capacity. With a fixed constant, I had to revisit the number per machine; sharing a single policy — a target CPU percentage — lets each host drift toward its own optimum from live measurement. It is the kind of small operational lever I keep reaching for across the Dolice Labs sites.

A few field notes for putting this into production:

Aim low. Target 70% and real peaks still swing to about 80%. On a machine shared with other scheduled jobs, starting around 60% errs on the safe side.
Reconcile max_slots with your API rate limits. Spare CPU is useless if the server returns 429, so keep the ceiling consistent with your rate design.
Governor period of 2–5 seconds. Too short oscillates; too long delays recovery from overload.
Log the observations. Recording effective concurrency and CPU per line lets you later explain "why was that batch slow?" with a CPU plot.

Precisely because streaming itself is lighter now, the ceiling on "how many you can stack on one box" is set by your local host before the server. Next time you write a parallel batch, replace the fixed concurrency number with a single target CPU. You quietly stop re-tuning a constant every time you switch machines.

Thanks for reading. If you also stack many jobs solo, I hope this gives you a foothold for your own operational design.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.