⬡ API & SDK/2026-07-03Advanced

How Many Concurrent Claude API Requests Can You Actually Hold? Sizing Production Infrastructure with Little's Law and Measured Memory

Concurrency, queue depth, and memory are numbers you can derive, not guess. A working method for sizing Claude API production deployments with Little's Law, a memory probe, and a 30-minute load check — learned the hard way from an OOM crash.

claude-api⁷⁶ deployment⁴ infrastructure⁴ capacity-planning streaming²¹ production¹⁰⁸

✦ Premium Article

In the last week of June, I was consolidating the nightly batch jobs for my four sites — article generation, link audits, that sort of thing — into a single process. Impatient to finish the queue faster, I raised the Claude API concurrency from 8 to 24. The container promptly died. Not from a 429, not from a timeout — from memory. I had double-checked my rate-limit budget several times, yet I had never once measured how much of my own process a single streaming connection actually occupies.

When people plan the infrastructure requirements for a Claude deployment, the reflex is to start from server specs. But the model runs on Anthropic's side (or on Bedrock, Vertex AI, or Microsoft Foundry) — not on your infrastructure. What you are sizing is the resource that waits. As an indie developer running both unattended batch pipelines and a small public-facing service, I've settled on a procedure for deriving concurrency, queue depth, and memory from numbers rather than instinct. Here it is, end to end.

You Are Not Sizing the Model — You Are Sizing the Waiting

Strip a Claude-backed application down to its integration layer, and the resources it consumes in production reduce to four numbers.

Resource	The number you set	What it derives from
Concurrency	Maximum simultaneously open API connections	Arrival rate λ and mean stream duration W (Little's Law)
Rate-limit budget	Projected RPM / input TPM / output TPM	Effective arrival rate × average token counts
Memory	Measured RSS delta per connection × concurrency	Direct measurement (probe below)
Queue depth	How many accepted requests may wait before work starts	Tolerable wait time × effective arrival rate

CPU almost never matters here. Streaming receipt is dominated by I/O waits; in my environment, 24 parallel streams kept CPU at roughly 15% of one core. If your deployment falls over, it will be rate limits, memory, or a queue you never bounded.

The upstream decisions — traffic tiers, SLA targets, data residency — are covered in the infrastructure requirements you should settle before shipping Claude API to production. This piece picks up where that one stops: turning an agreed scale into concrete settings.

Deriving Required Concurrency with Little's Law

Required concurrency falls straight out of queueing theory's most forgiving formula.

Concurrency L = arrival rate λ (requests/second) × mean time in system W (seconds)

The crucial subtlety: W is not time-to-first-byte. It is the full lifetime of the stream, open to close. For long-form generation, TTFB may be one second while the stream stays open for forty. The connection is occupied the entire time.

Here are the numbers for two workloads I actually run.

Workload	Arrival rate λ	Mean stream duration W	Required concurrency L	With 1.5× headroom
Nightly batch (90 tasks in 30 minutes)	0.05 /s	42 s (long-form)	2.1	4
Chat UI (peak)	2.5 /s	12 s	30	49
Post-push-notification spike	8 /s (3 min)	12 s	96	145

The batch answer — four connections suffice — surprised me. Before the consolidation work I assumed that ninety queued tasks justified high parallelism, cranked it to 24, and earned the OOM in the opening paragraph. When the arrival rate is low, extra parallelism barely shortens the wall clock; it only multiplies memory. The chat UI cuts the other way: modest by RPM standards, yet demanding 49 simultaneous connections. That divergence between RPM and concurrency is the subject of the next section.

Everything that shrinks W — region selection, connection pooling, prompt caching — is collected in four infrastructure moves that cut Claude API latency. Halve W and your required concurrency halves with it.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You can now derive your required concurrency from arrival rate and mean stream duration with Little's Law, instead of guessing

✦You will be able to measure per-stream memory with a working probe and catch OOM-before-429 configurations before they reach production

✦You'll learn how to combine rate limits, retry amplification, and queue depth into one capacity calculator you can defend with a load test

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Rate-Limit Budget and Concurrency Are Separate Axes — Check Both

Rate limits (RPM / ITPM / OTPM) and concurrency will each happily blindside you while you watch the other. Long streams occupy connections while barely denting RPM; short high-frequency calls return connections instantly while devouring RPM. The safe pattern is a single calculator that audits both at once.

// capacity.ts — one calculator for required concurrency and rate budget
type CapacityInput = {
  qps: number;            // peak arrival rate (requests/second)
  meanStreamSec: number;  // mean stream lifetime W (seconds)
  retryRate: number;      // retry fraction r (0.08 = 8%)
  peakFactor: number;     // burstiness multiplier (I use 1.5–2)
  avgInputTokens: number;
  avgOutputTokens: number;
  rpmLimit: number;       // your org's RPM ceiling
  itpmLimit: number;      // input TPM ceiling
  otpmLimit: number;      // output TPM ceiling
};
 
export function sizeCapacity(c: CapacityInput) {
  // fold retries and burstiness into an "effective arrival rate"
  const effectiveQps = c.qps * (1 + c.retryRate) * c.peakFactor;
 
  // Little's Law: concurrency L = arrival rate λ × time in system W
  const concurrent = Math.ceil(effectiveQps * c.meanStreamSec);
 
  // bound the queue at "what drains in 60 seconds"; shed beyond that
  const queueDepth = Math.ceil(effectiveQps * 60);
 
  const rpm = effectiveQps * 60;
  const itpm = rpm * c.avgInputTokens;
  const otpm = rpm * c.avgOutputTokens;
 
  return {
    concurrent,
    queueDepth,
    budget: {
      rpm:  { used: Math.round(rpm),  limit: c.rpmLimit,  ok: rpm  <= c.rpmLimit },
      itpm: { used: Math.round(itpm), limit: c.itpmLimit, ok: itpm <= c.itpmLimit },
      otpm: { used: Math.round(otpm), limit: c.otpmLimit, ok: otpm <= c.otpmLimit },
    },
  };
}
 
// Example: chat UI (peak 2.5 QPS, 12 s mean stream, 8% retries)
console.log(sizeCapacity({
  qps: 2.5, meanStreamSec: 12, retryRate: 0.08, peakFactor: 1.5,
  avgInputTokens: 2200, avgOutputTokens: 900,
  rpmLimit: 4000, itpmLimit: 2000000, otpmLimit: 400000,
}));
// Expected output:
// {
//   concurrent: 49,
//   queueDepth: 243,
//   budget: {
//     rpm:  { used: 243,    limit: 4000,    ok: true },
//     itpm: { used: 534600, limit: 2000000, ok: true },
//     otpm: { used: 218700, limit: 400000,  ok: true }
//   }
// }

Note the shape of that result: RPM consumption sits at 6% of the ceiling while concurrency demands 49 open connections. Judge capacity from the rate-limit dashboard alone and the connection and memory estimates silently vanish from your plan. Flip the workload to long-prompt RAG and the opposite happens — concurrency is fine but ITPM exhausts first. Which ceiling you hit first is a property of your workload's shape, so print both, always.

One 2026 footnote: Claude Sonnet 5 became the default model in early July, with introductory pricing (2 dollars per million input tokens, 10 per million output, through August 31, 2026) making parallelism cheaper than ever to buy. Cheaper tokens tempt you to raise concurrency — but the rate-limit and connection ceilings survive every price cut. Keep them in the estimate.

Measure What One Stream Costs in Memory — OOM Arrives Before 429

This is where my June incident lived. Retain streaming chunks in an array — "I'll need the full text later" — and run 24 streams of long-form output in parallel, and RSS climbs far faster than intuition suggests. Rather than argue from theory, here is the small probe I now keep in the repository.

// stream-memory-probe.ts — measure RSS delta for N concurrent streams
// Run: PROBE_STREAMS=24 PROBE_RETAIN=1 npx tsx stream-memory-probe.ts
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const N = Number(process.env.PROBE_STREAMS ?? 8);
const RETAIN = process.env.PROBE_RETAIN === "1"; // 1: keep chunks (the bad pattern)
 
const retained: string[][] = [];
 
async function oneStream(): Promise<void> {
  const chunks: string[] = [];
  const stream = client.messages.stream({
    model: "claude-sonnet-5",
    max_tokens: 8000,
    // a prompt that elicits long output (my runs generated ~60k characters)
    messages: [{ role: "user", content: "YOUR_LONG_FORM_PROMPT" }],
  });
  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      if (RETAIN) chunks.push(event.delta.text); // bad: every chunk pinned on the heap
      else void event.delta.text.length;          // good: process and drop the reference
    }
  }
  if (RETAIN) retained.push(chunks);
}
 
const before = process.memoryUsage().rss;
const timer = setInterval(() => {
  const mb = (process.memoryUsage().rss - before) / 1024 / 1024;
  console.log(`rss delta: ${mb.toFixed(1)} MB`);
}, 2000);
 
await Promise.all(Array.from({ length: N }, () => oneStream()));
clearInterval(timer);
const finalMb = (process.memoryUsage().rss - before) / 1024 / 1024;
console.log(`final delta: ${finalMb.toFixed(1)} MB / ${N} streams`);
// Expected output (example, RETAIN=1 / N=24): final delta: 166.3 MB / 24 streams

Results from my environment (Node 22, 512 MB container, ~60k characters of output per stream). Read the ratio, not the absolute numbers — retention policy moves the cost by an order of magnitude.

Retention policy	Parallel N	Measured RSS delta	Per stream
Retain chunk arrays (RETAIN=1)	8	+54 MB	~6.8 MB
Retain chunk arrays (RETAIN=1)	24	+166 MB	~6.9 MB
Process and drop (RETAIN=0)	8	+6 MB	~0.8 MB
Process and drop (RETAIN=0)	24	+19 MB	~0.8 MB

With a Node baseline near 90 MB and the batch's own working set at a bit over 100 MB, RETAIN=1 at 24 streams (+166 MB) reached the 512 MB ceiling — which matched the dashboard from the night of the crash. Three lessons: flush chunks as they arrive and keep no references; keep whole-transcript copies (a stray JSON.stringify of the full text, for instance) out of the hot path; and treat "measured per-stream cost × concurrency + baseline" crossing 70% of the container limit as a design smell, not a margin.

Fold Retry and Timeout Amplification into Your Headroom

The λ in Little's Law includes retries, not just user-initiated requests. At an 8% retry rate your effective arrival rate is 1.08×. Timeouts, meanwhile, define the worst-case W: set a 300-second timeout and a hung connection occupies a slot for up to 300 seconds. Your concurrency estimate therefore needs allowance not just for the average stream but for the few that squat until the timeout reaps them.

My rule of thumb: the larger the ratio of timeout to mean W, the fatter the headroom factor. Within 5× the mean, I use ×1.5; beyond 10×, ×2. Fold it into the calculator's peakFactor and no new code is required. Retry ownership itself — which layer retries, how Retry-After is honored — is beyond this article's scope; the backpressure section of designing priority and fairness into a Claude API job queue covers it properly.

Enforce the Numbers with a Bounded Semaphore and a Bounded Queue

A number nobody enforces is a wish. At small scale you don't need a library — thirty lines of bounded semaphore with a capped wait queue does the job.

// bounded-gate.ts — enforce the concurrency and queue depth you decided on
export class BoundedGate {
  private running = 0;
  private waiting: Array<() => void> = [];
 
  constructor(
    private maxConcurrent: number, // pass sizeCapacity().concurrent directly
    private maxQueue: number,      // and queueDepth
  ) {}
 
  async run<T>(task: () => Promise<T>): Promise<T> {
    if (this.running >= this.maxConcurrent) {
      if (this.waiting.length >= this.maxQueue) {
        // refuse at the door rather than let hope pile up (load shedding)
        throw new Error("shed: queue is full");
      }
      // wait for a slot; the releaser wakes exactly one waiter, so no stampede
      await new Promise<void>((resolve) => this.waiting.push(resolve));
    }
    this.running++;
    try {
      return await task();
    } finally {
      this.running--;
      this.waiting.shift()?.(); // wake the head of the queue, one at a time
    }
  }
 
  // observability gauge — log it periodically and compare against the estimate
  get gauge() {
    return { running: this.running, queued: this.waiting.length };
  }
}
 
// Usage:
// const gate = new BoundedGate(49, 243);
// const msg = await gate.run(() => client.messages.create({ /* ... */ }));
// setInterval(() => console.log(gate.gauge), 5000);
// Expected output at peak: { running: 49, queued: 87 } — capped, by design

The essential choice is to refuse rather than to wait indefinitely. An unbounded queue answers no one while accumulating memory and expectation, then floods you with stale requests the moment things recover. What the rejected caller does next — immediate error, delayed requeue — is a product decision; having exactly one place in the integration layer that says no is the engineering decision that comes first.

A 30-Minute Load Check That Validates the Estimate

Once the arithmetic and the measurements agree, I spend thirty minutes validating before anything ships. The same five steps every time:

Drive 1.5× the projected peak arrival rate with production-shaped token counts for ten minutes (a light modification of the probe above suffices)
Record four signals every five seconds: RSS, event-loop lag, gate.gauge, and 429 count
Check that p95 full-stream duration lands within 1.5× your estimated W. If it doesn't, your W was optimistic — feed the measured value back into the calculator
If peak RSS crosses 70% of the container limit, lower concurrency or raise memory. A run that merely survived is not a pass
Deliberately overflow the queue and confirm the shed fires and callers behave as designed

Most of what this catches is not code defects but wrong inputs to the estimate. W is the usual offender: the value measured against short development prompts routinely differs from production prompts by 3× or more.

Your Next Step — Pull λ and W from Last Night's Logs

Start with a workload you already run. Extract two numbers from the last 24 hours of logs — arrival rate λ and mean stream duration W — and compute L = λW by hand. Comparing that against your currently configured parallelism takes five minutes and reveals, in numbers, either a latent OOM (oversized) or wasted wall-clock time (undersized). I skipped that habit for years at Dolice Labs and paid for it with the crash that opened this piece; I'd be glad if this saves you the same detour.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.