●MODEL — Export controls on Claude Fable 5 are lifted, restoring global access starting July 1●MODEL — Fable 5 is available across the Claude Platform, Claude.ai, Claude Code, and Cowork●SCIENCE — Claude Science offers up to $30,000 in credits for research projects; apply by July 15●CODE — Claude Code weekly limits are raised by 50% through July 13●CODE — Dynamic workflows enter research preview with parallel, verified end-to-end task handling●CODE — A self-hosted gateway brings SSO, policy enforcement, and per-user cost attribution●MODEL — Export controls on Claude Fable 5 are lifted, restoring global access starting July 1●MODEL — Fable 5 is available across the Claude Platform, Claude.ai, Claude Code, and Cowork●SCIENCE — Claude Science offers up to $30,000 in credits for research projects; apply by July 15●CODE — Claude Code weekly limits are raised by 50% through July 13●CODE — Dynamic workflows enter research preview with parallel, verified end-to-end task handling●CODE — A self-hosted gateway brings SSO, policy enforcement, and per-user cost attribution
How Many Concurrent Claude API Requests Can You Actually Hold? Sizing Production Infrastructure with Little's Law and Measured Memory
Concurrency, queue depth, and memory are numbers you can derive, not guess. A working method for sizing Claude API production deployments with Little's Law, a memory probe, and a 30-minute load check — learned the hard way from an OOM crash.
In the last week of June, I was consolidating the nightly batch jobs for my four sites — article generation, link audits, that sort of thing — into a single process. Impatient to finish the queue faster, I raised the Claude API concurrency from 8 to 24. The container promptly died. Not from a 429, not from a timeout — from memory. I had double-checked my rate-limit budget several times, yet I had never once measured how much of my own process a single streaming connection actually occupies.
When people plan the infrastructure requirements for a Claude deployment, the reflex is to start from server specs. But the model runs on Anthropic's side (or on Bedrock, Vertex AI, or Microsoft Foundry) — not on your infrastructure. What you are sizing is the resource that waits. As an indie developer running both unattended batch pipelines and a small public-facing service, I've settled on a procedure for deriving concurrency, queue depth, and memory from numbers rather than instinct. Here it is, end to end.
You Are Not Sizing the Model — You Are Sizing the Waiting
Strip a Claude-backed application down to its integration layer, and the resources it consumes in production reduce to four numbers.
Resource
The number you set
What it derives from
Concurrency
Maximum simultaneously open API connections
Arrival rate λ and mean stream duration W (Little's Law)
Rate-limit budget
Projected RPM / input TPM / output TPM
Effective arrival rate × average token counts
Memory
Measured RSS delta per connection × concurrency
Direct measurement (probe below)
Queue depth
How many accepted requests may wait before work starts
Tolerable wait time × effective arrival rate
CPU almost never matters here. Streaming receipt is dominated by I/O waits; in my environment, 24 parallel streams kept CPU at roughly 15% of one core. If your deployment falls over, it will be rate limits, memory, or a queue you never bounded.
Required concurrency falls straight out of queueing theory's most forgiving formula.
Concurrency L = arrival rate λ (requests/second) × mean time in system W (seconds)
The crucial subtlety: W is not time-to-first-byte. It is the full lifetime of the stream, open to close. For long-form generation, TTFB may be one second while the stream stays open for forty. The connection is occupied the entire time.
Here are the numbers for two workloads I actually run.
Workload
Arrival rate λ
Mean stream duration W
Required concurrency L
With 1.5× headroom
Nightly batch (90 tasks in 30 minutes)
0.05 /s
42 s (long-form)
2.1
4
Chat UI (peak)
2.5 /s
12 s
30
49
Post-push-notification spike
8 /s (3 min)
12 s
96
145
The batch answer — four connections suffice — surprised me. Before the consolidation work I assumed that ninety queued tasks justified high parallelism, cranked it to 24, and earned the OOM in the opening paragraph. When the arrival rate is low, extra parallelism barely shortens the wall clock; it only multiplies memory. The chat UI cuts the other way: modest by RPM standards, yet demanding 49 simultaneous connections. That divergence between RPM and concurrency is the subject of the next section.
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦You can now derive your required concurrency from arrival rate and mean stream duration with Little's Law, instead of guessing
✦You will be able to measure per-stream memory with a working probe and catch OOM-before-429 configurations before they reach production
✦You'll learn how to combine rate limits, retry amplification, and queue depth into one capacity calculator you can defend with a load test
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Rate-Limit Budget and Concurrency Are Separate Axes — Check Both
Rate limits (RPM / ITPM / OTPM) and concurrency will each happily blindside you while you watch the other. Long streams occupy connections while barely denting RPM; short high-frequency calls return connections instantly while devouring RPM. The safe pattern is a single calculator that audits both at once.
Note the shape of that result: RPM consumption sits at 6% of the ceiling while concurrency demands 49 open connections. Judge capacity from the rate-limit dashboard alone and the connection and memory estimates silently vanish from your plan. Flip the workload to long-prompt RAG and the opposite happens — concurrency is fine but ITPM exhausts first. Which ceiling you hit first is a property of your workload's shape, so print both, always.
One 2026 footnote: Claude Sonnet 5 became the default model in early July, with introductory pricing (2 dollars per million input tokens, 10 per million output, through August 31, 2026) making parallelism cheaper than ever to buy. Cheaper tokens tempt you to raise concurrency — but the rate-limit and connection ceilings survive every price cut. Keep them in the estimate.
Measure What One Stream Costs in Memory — OOM Arrives Before 429
This is where my June incident lived. Retain streaming chunks in an array — "I'll need the full text later" — and run 24 streams of long-form output in parallel, and RSS climbs far faster than intuition suggests. Rather than argue from theory, here is the small probe I now keep in the repository.
// stream-memory-probe.ts — measure RSS delta for N concurrent streams// Run: PROBE_STREAMS=24 PROBE_RETAIN=1 npx tsx stream-memory-probe.tsimport Anthropic from "@anthropic-ai/sdk";const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });const N = Number(process.env.PROBE_STREAMS ?? 8);const RETAIN = process.env.PROBE_RETAIN === "1"; // 1: keep chunks (the bad pattern)const retained: string[][] = [];async function oneStream(): Promise<void> { const chunks: string[] = []; const stream = client.messages.stream({ model: "claude-sonnet-5", max_tokens: 8000, // a prompt that elicits long output (my runs generated ~60k characters) messages: [{ role: "user", content: "YOUR_LONG_FORM_PROMPT" }], }); for await (const event of stream) { if (event.type === "content_block_delta" && event.delta.type === "text_delta") { if (RETAIN) chunks.push(event.delta.text); // bad: every chunk pinned on the heap else void event.delta.text.length; // good: process and drop the reference } } if (RETAIN) retained.push(chunks);}const before = process.memoryUsage().rss;const timer = setInterval(() => { const mb = (process.memoryUsage().rss - before) / 1024 / 1024; console.log(`rss delta: ${mb.toFixed(1)} MB`);}, 2000);await Promise.all(Array.from({ length: N }, () => oneStream()));clearInterval(timer);const finalMb = (process.memoryUsage().rss - before) / 1024 / 1024;console.log(`final delta: ${finalMb.toFixed(1)} MB / ${N} streams`);// Expected output (example, RETAIN=1 / N=24): final delta: 166.3 MB / 24 streams
Results from my environment (Node 22, 512 MB container, ~60k characters of output per stream). Read the ratio, not the absolute numbers — retention policy moves the cost by an order of magnitude.
Retention policy
Parallel N
Measured RSS delta
Per stream
Retain chunk arrays (RETAIN=1)
8
+54 MB
~6.8 MB
Retain chunk arrays (RETAIN=1)
24
+166 MB
~6.9 MB
Process and drop (RETAIN=0)
8
+6 MB
~0.8 MB
Process and drop (RETAIN=0)
24
+19 MB
~0.8 MB
With a Node baseline near 90 MB and the batch's own working set at a bit over 100 MB, RETAIN=1 at 24 streams (+166 MB) reached the 512 MB ceiling — which matched the dashboard from the night of the crash. Three lessons: flush chunks as they arrive and keep no references; keep whole-transcript copies (a stray JSON.stringify of the full text, for instance) out of the hot path; and treat "measured per-stream cost × concurrency + baseline" crossing 70% of the container limit as a design smell, not a margin.
Fold Retry and Timeout Amplification into Your Headroom
The λ in Little's Law includes retries, not just user-initiated requests. At an 8% retry rate your effective arrival rate is 1.08×. Timeouts, meanwhile, define the worst-case W: set a 300-second timeout and a hung connection occupies a slot for up to 300 seconds. Your concurrency estimate therefore needs allowance not just for the average stream but for the few that squat until the timeout reaps them.
My rule of thumb: the larger the ratio of timeout to mean W, the fatter the headroom factor. Within 5× the mean, I use ×1.5; beyond 10×, ×2. Fold it into the calculator's peakFactor and no new code is required. Retry ownership itself — which layer retries, how Retry-After is honored — is beyond this article's scope; the backpressure section of designing priority and fairness into a Claude API job queue covers it properly.
Enforce the Numbers with a Bounded Semaphore and a Bounded Queue
A number nobody enforces is a wish. At small scale you don't need a library — thirty lines of bounded semaphore with a capped wait queue does the job.
// bounded-gate.ts — enforce the concurrency and queue depth you decided onexport class BoundedGate { private running = 0; private waiting: Array<() => void> = []; constructor( private maxConcurrent: number, // pass sizeCapacity().concurrent directly private maxQueue: number, // and queueDepth ) {} async run<T>(task: () => Promise<T>): Promise<T> { if (this.running >= this.maxConcurrent) { if (this.waiting.length >= this.maxQueue) { // refuse at the door rather than let hope pile up (load shedding) throw new Error("shed: queue is full"); } // wait for a slot; the releaser wakes exactly one waiter, so no stampede await new Promise<void>((resolve) => this.waiting.push(resolve)); } this.running++; try { return await task(); } finally { this.running--; this.waiting.shift()?.(); // wake the head of the queue, one at a time } } // observability gauge — log it periodically and compare against the estimate get gauge() { return { running: this.running, queued: this.waiting.length }; }}// Usage:// const gate = new BoundedGate(49, 243);// const msg = await gate.run(() => client.messages.create({ /* ... */ }));// setInterval(() => console.log(gate.gauge), 5000);// Expected output at peak: { running: 49, queued: 87 } — capped, by design
The essential choice is to refuse rather than to wait indefinitely. An unbounded queue answers no one while accumulating memory and expectation, then floods you with stale requests the moment things recover. What the rejected caller does next — immediate error, delayed requeue — is a product decision; having exactly one place in the integration layer that says no is the engineering decision that comes first.
A 30-Minute Load Check That Validates the Estimate
Once the arithmetic and the measurements agree, I spend thirty minutes validating before anything ships. The same five steps every time:
Drive 1.5× the projected peak arrival rate with production-shaped token counts for ten minutes (a light modification of the probe above suffices)
Record four signals every five seconds: RSS, event-loop lag, gate.gauge, and 429 count
Check that p95 full-stream duration lands within 1.5× your estimated W. If it doesn't, your W was optimistic — feed the measured value back into the calculator
If peak RSS crosses 70% of the container limit, lower concurrency or raise memory. A run that merely survived is not a pass
Deliberately overflow the queue and confirm the shed fires and callers behave as designed
Most of what this catches is not code defects but wrong inputs to the estimate. W is the usual offender: the value measured against short development prompts routinely differs from production prompts by 3× or more.
Your Next Step — Pull λ and W from Last Night's Logs
Start with a workload you already run. Extract two numbers from the last 24 hours of logs — arrival rate λ and mean stream duration W — and compute L = λW by hand. Comparing that against your currently configured parallelism takes five minutes and reveals, in numbers, either a latent OOM (oversized) or wasted wall-clock time (undersized). I skipped that habit for years at Dolice Labs and paid for it with the crash that opened this piece; I'd be glad if this saves you the same detour.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.