⬡ API & SDK/2026-07-02Advanced

When 41 of 20,000 Message Batches Requests Quietly Vanished — Field Notes on Reconciling and Requeuing Partial Failures

processing_status: ended does not mean every request succeeded. How errored and expired results hide inside a finished batch, and how a custom_id ledger catches every gap and requeues safely — with real cost and timing numbers.

Claude API¹⁰⁰ Message Batches batch processing⁴ error handling³ production operations

✦ Premium Article

I watched the batch flip to ended, closed the laptop, and slept well. The problem didn't surface until three days later.

Of the 20,000 requests I had submitted to the Message Batches API for a tag-reclassification job, only 19,959 rows had landed in my aggregation table. Forty-one were simply gone — no exception, no log line, nothing. As an indie developer running several technical blogs in parallel, I lean on batches constantly for bulk metadata work, and this was the first time I had met a failure this quiet.

The cause turned out to be my own assumption, not an API bug. processing_status: "ended" only means the box finished processing. Whether each item inside the box succeeded is a separate question the API expects you to ask.

These notes cover where the counts leak, a custom_id ledger that makes every gap visible, and a requeue design that never double-processes — with the numbers I measured along the way.

"ended" Is Not "All Succeeded" — the Four Ways a Request Finishes

Every request in a batch carries its own result.type. There are four outcomes.

result.type	Meaning	Typical trigger	Correct response
succeeded	Completed normally; contains a message	—	Aggregate as usual
errored	Individual request failed; carries an error object	invalid_request (bad params), api_error, overloaded	Branch on error type (below)
canceled	Caught in a batch cancellation	Unprocessed items at manual cancel	Requeue
expired	Missed the 24-hour processing window	Large batches in busy periods	Requeue

The crucial part: a batch reaches ended normally even when errored and expired results are mixed in. Nothing throws. The SDK doesn't warn you. If your consumer loop only picks up successes, the failures disappear without a witness.

My 41 missing items broke down as 28 errored (19 overload-class, 9 invalid_request) and 13 expired. The nine invalid_request failures were empty article bodies I had passed through unfiltered — failures that no retry will ever fix. Requeue everything indiscriminately and those nine fail forever, on your budget.

Build the Ledger First — Designing a custom_id Manifest

Reconciliation requires a record of what you sent that lives outside the API. Without it there is nothing to diff against.

At submission time, write each custom_id and a hash of its input to a JSONL manifest.

import anthropic
import hashlib
import json
from pathlib import Path
 
client = anthropic.Anthropic()  # ANTHROPIC_API_KEY from the environment
 
MANIFEST = Path("batch_manifest.jsonl")
 
def build_requests(items: list[dict]) -> list[dict]:
    """items: [{"id": "article-0001", "text": "..."}]"""
    requests = []
    with MANIFEST.open("a", encoding="utf-8") as mf:
        for item in items:
            # attempt=1 embedded in the custom_id (bumped to 2, 3 on requeue)
            custom_id = f"{item['id']}__a1"
            requests.append({
                "custom_id": custom_id,
                "params": {
                    "model": "claude-sonnet-5",
                    "max_tokens": 300,
                    "messages": [{
                        "role": "user",
                        "content": f"Return three fitting tags for this article as a JSON array:\n\n{item['text']}"
                    }],
                },
            })
            mf.write(json.dumps({
                "custom_id": custom_id,
                "source_id": item["id"],
                "input_sha": hashlib.sha256(item["text"].encode()).hexdigest()[:16],
                "attempt": 1,
            }) + "\n")
    return requests
 
batch = client.messages.batches.create(requests=build_requests(items))
print(f"batch_id={batch.id} submitted={len(items)}")

The __a1 suffix is the load-bearing detail. Anthropic rejects duplicate custom_ids within one batch, but does nothing about duplicates across batches. Resubmit a failed item under its original ID and you can no longer tell which response belongs to which attempt when you aggregate. With the attempt number in the ID, the ledger always resolves "which attempt is current for this source_id" unambiguously.

After being burned by those empty bodies, I also added a gate at the top of build_requests — if not item["text"].strip(): continue — and log what it rejects. The cheapest place to fix an invalid_request is before it enters the batch.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A reconciliation harness that diffs your custom_id ledger against the results stream and mechanically surfaces succeeded / errored / canceled / expired counts

✦An attempt-numbered custom_id scheme that prevents double-processing on requeue, plus a decision table separating retryable failures from permanent ones

✦Measured costs stacking the Sonnet 5 introductory pricing with the 50% batch discount, and practical numbers for polling intervals and the 24-hour window

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Poll with Exponential Backoff — the 24-Hour Window in Practice

It's tempting to poll every five seconds. For large batches that's wasted traffic. In my logs, the same 20,000-request batch (≈900 input tokens each) finished in 62 minutes on a quiet day and 4 hours 11 minutes on a busy Monday afternoon. Treat completion time as unknowable and widen the interval as you wait.

import time
 
def wait_for_batch(batch_id: str, base: float = 30.0, cap: float = 600.0) -> None:
    interval = base
    while True:
        status = client.messages.batches.retrieve(batch_id)
        counts = status.request_counts
        print(f"processing={counts.processing} succeeded={counts.succeeded} "
              f"errored={counts.errored} expired={counts.expired}")
        if status.processing_status == "ended":
            return
        time.sleep(interval)
        interval = min(interval * 1.5, cap)  # 30s → 45s → ... → 10 min max

request_counts accumulates errored and expired counts while the batch is still running. Watch them mid-flight and you catch the leak during processing, not three days later. I page myself when errored crosses 1% of submitted.

Reconcile — Diff the Ledger Against the Results Stream

Once ended, classify every result into four buckets and diff against the manifest.

from collections import defaultdict
 
def reconcile(batch_id: str) -> dict:
    # 1. What we believe we submitted
    expected = {}
    with MANIFEST.open(encoding="utf-8") as mf:
        for line in mf:
            rec = json.loads(line)
            expected[rec["custom_id"]] = rec
 
    buckets = defaultdict(list)
    seen = set()
 
    # 2. Classify the results stream
    for result in client.messages.batches.results(batch_id):
        seen.add(result.custom_id)
        rtype = result.result.type
        if rtype == "succeeded":
            buckets["succeeded"].append(result)
        elif rtype == "errored":
            etype = result.result.error.error.type
            key = "retryable" if etype in (
                "overloaded_error", "api_error", "rate_limit_error"
            ) else "permanent"
            buckets[key].append(result)
        else:  # canceled / expired
            buckets["retryable"].append(result)
 
    # 3. IDs in the ledger that never appeared in results (should be zero — verify it)
    missing = [cid for cid in expected if cid not in seen]
 
    print(f"succeeded={len(buckets['succeeded'])} "
          f"retryable={len(buckets['retryable'])} "
          f"permanent={len(buckets['permanent'])} missing={len(missing)}")
    return {"buckets": buckets, "missing": missing, "expected": expected}

The dividing line is the error type. overloaded_error, api_error, and rate_limit_error heal with time; invalid_request_error means the request itself is broken and will fail identically every time. Treating both as one generic "error" was my original mistake.

One more trap worth naming: the results stream is not guaranteed to preserve submission order. Join results to inputs by index and your data corrupts silently the first time the order shifts. Always join on custom_id. Also note that results remain downloadable for 29 days from batch creation — relevant if your aggregation runs monthly.

Requeue — Bump the Attempt, Cap the Attempts

Only the retryable bucket goes back out, in a fresh batch, under attempt-bumped IDs.

MAX_ATTEMPTS = 3
 
def requeue(recon: dict, items_by_id: dict) -> list[dict]:
    retry_requests = []
    with MANIFEST.open("a", encoding="utf-8") as mf:
        for result in recon["buckets"]["retryable"]:
            rec = recon["expected"][result.custom_id]
            next_attempt = rec["attempt"] + 1
            if next_attempt > MAX_ATTEMPTS:
                print(f"give up: {rec['source_id']} (attempt {rec['attempt']})")
                continue
            new_cid = f"{rec['source_id']}__a{next_attempt}"
            src = items_by_id[rec["source_id"]]
            retry_requests.append({
                "custom_id": new_cid,
                "params": {  # rebuild params from source data, not from the result object
                    "model": "claude-sonnet-5",
                    "max_tokens": 300,
                    "messages": [{"role": "user",
                                  "content": f"Return three fitting tags for this article as a JSON array:\n\n{src['text']}"}],
                },
            })
            mf.write(json.dumps({"custom_id": new_cid, "source_id": rec["source_id"],
                                 "input_sha": rec["input_sha"],
                                 "attempt": next_attempt}) + "\n")
    return retry_requests

Always set a ceiling (three here). An uncapped auto-requeue plus one misclassified permanent failure equals a machine that converts your budget into identical errors. Items that exhaust their attempts stay in the ledger for a human to read the next morning. In the 20,000-request run, 31 of 32 retryable items succeeded on the second attempt; one needed a third.

Protect the aggregation side too. Because multiple attempts of the same source_id can theoretically both succeed, my aggregation query keeps only the highest-attempt success per source_id — double-counting becomes structurally impossible.

Measured Costs — Sonnet 5 Introductory Pricing × the 50% Batch Discount

Claude Sonnet 5, released June 30, 2026, carries introductory pricing of $2 input / $10 output per million tokens through August 31, 2026 ($3 / $15 after). Batches take a further 50% off, so during the introductory window the effective rate is $1 input / $5 output.

For the 20,000-request run (≈900 input, ≈250 output tokens each):

Path	Input 18M tok	Output 5M tok	Total
Real-time API (intro pricing)	$36.00	$50.00	$86.00
Batch (intro × 50%)	$18.00	$25.00	$43.00
Requeue overhead (32 items)	$0.03	$0.04	$0.07

The whole reconcile-and-requeue apparatus costs rounding error. The hour I spent manually investigating 41 missing rows cost far more. Since moving to submit-at-night, read-one-report-in-the-morning, batches stopped being a submit-and-pray exercise.

A batch tops out at 100,000 requests or 256MB, but I split jobs at 20,000–30,000 anyway. Expirations cluster in the tail of large batches, and smaller batches keep both the blast radius and the requeue small.

Make Reconciliation a Road You Always Travel

A safety net you can skip is a safety net you will skip. My cron pipeline runs reconciliation unconditionally as its final step and logs one summary line.

Submission appends to the manifest (submission and ledger write treated as one transaction)
After ended, reconcile and always log the four counts: succeeded / retryable / permanent / missing
Alert a human if missing > 0 or permanent > 0.5% of submitted
Auto-requeue retryables with the attempt cap at 3

With those four lines of process, discovering a gap went from "a vague unease three days later" to "one log line the same day."

The Next Step

Open your existing batch consumer and look for the spot where it picks up successes without branching on result.type. That is the single most valuable thing to fix today. The ledger and reconciliation above drop into most pipelines in under an hour.

If you run unattended batch jobs like I do, I hope these notes save you the three confusing days they cost me.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.