⬡ API & SDK/2026-07-01Advanced

When Claude API Document Extraction Is Confidently Wrong — Field Notes on Catching Silent Errors with Invariants

In structured extraction from invoices and contracts, the real danger isn't a crash — it's a value that's silently wrong while the schema validates and confidence reads high. Field notes on invariants, two-pass extraction, and tracking field-level error rates.

Claude API⁹⁷ Document Processing Structured Extraction Data Validation TypeScript¹⁹ Production²² Cost Optimization⁸

✦ Premium Article

Let me start with the moment that scared me most while running an invoice pipeline. Validation passed. The Zod schema parsed cleanly, and the model reported a confidence of 0.95. Yet total was off by an order of magnitude. No crash, no exception. I only caught it at aggregation time, when one vendor's monthly total looked suspiciously small, and traced it back from there.

The real enemy in structured extraction isn't the failure that stops. A failure that stops lands in your logs and gets picked up on retry. What's dangerous is the extraction where everything is formally correct and only the value is quietly wrong. Here are the design choices that worked when I ran document extraction on Claude API in production — with working code for catching the silent errors.

Why schema validation isn't enough

Schema validation only guarantees shape. It confirms that total is a number and issueDate is a string, but it says nothing about whether those values match the document.

Silent errors tend to collapse into three patterns. First, misread digits and commas: reading 1,250,000 as 1250.00, or letting a ¥ sign and a decimal point shift the magnitude. Second, swapped fields: putting dueDate into the issueDate slot, or picking up subtotal and total in reverse. Third, "plausible completion": filling in a tax rate that the document never stated with a generic value.

All three come back with high confidence. A model's self-reported confidence reflects "how sure I am about my own reading," not "agreement with the truth." Conflate the two and you get the worst kind of incident: the highest-confidence field is the most dangerous one.

Invariants — make the document check its own arithmetic

Even with no external ground truth, a document carries relationships you can use to check itself. On an invoice, the line items sum to the subtotal, and the subtotal plus tax equals the total. For dates, the issue date precedes the due date. Encode these as invariants and verify them mechanically right after extraction.

Zod's superRefine lets you fold type checks and invariant checks into a single schema.

// src/schema/invoice.ts
import { z } from "zod";
 
const Money = z.number().finite().nonnegative();
 
const LineItem = z.object({
  description: z.string().min(1),
  quantity: z.number().positive().optional(),
  unitPrice: Money.optional(),
  amount: Money,
});
 
// Tolerance: allow rounding drift up to 1 currency unit
const EPS = 1;
 
export const InvoiceSchema = z
  .object({
    invoiceNumber: z.string().optional(),
    issueDate: z.string().optional(),   // expect ISO 8601
    dueDate: z.string().optional(),
    currency: z.string().default("JPY"),
    lineItems: z.array(LineItem).min(1),
    subtotal: Money.optional(),
    tax: Money.optional(),
    total: Money,
  })
  .superRefine((inv, ctx) => {
    // Invariant 1: sum(lineItems) ~= subtotal
    if (inv.subtotal !== undefined) {
      const sum = inv.lineItems.reduce((s, li) => s + li.amount, 0);
      if (Math.abs(sum - inv.subtotal) > EPS) {
        ctx.addIssue({
          code: z.ZodIssueCode.custom,
          path: ["subtotal"],
          message: `lineItems sum (${sum}) != subtotal (${inv.subtotal})`,
        });
      }
    }
 
    // Invariant 2: subtotal + tax ~= total
    if (inv.subtotal !== undefined && inv.tax !== undefined) {
      const expected = inv.subtotal + inv.tax;
      if (Math.abs(expected - inv.total) > EPS) {
        ctx.addIssue({
          code: z.ZodIssueCode.custom,
          path: ["total"],
          message: `subtotal+tax (${expected}) != total (${inv.total})`,
        });
      }
    }
 
    // Invariant 3: issueDate <= dueDate
    if (inv.issueDate && inv.dueDate) {
      if (new Date(inv.issueDate) > new Date(inv.dueDate)) {
        ctx.addIssue({
          code: z.ZodIssueCode.custom,
          path: ["dueDate"],
          message: "issueDate is later than dueDate",
        });
      }
    }
 
    // Invariant 4: each line's quantity * unitPrice ~= amount
    inv.lineItems.forEach((li, i) => {
      if (li.quantity !== undefined && li.unitPrice !== undefined) {
        const expected = li.quantity * li.unitPrice;
        if (Math.abs(expected - li.amount) > EPS) {
          ctx.addIssue({
            code: z.ZodIssueCode.custom,
            path: ["lineItems", i, "amount"],
            message: `quantity*unitPrice (${expected}) != amount (${li.amount})`,
          });
        }
      }
    });
  });
 
export type Invoice = z.infer<typeof InvoiceSchema>;

The key is not to swallow invariant violations in an exception, but to keep a structured record of which field (path) contradicted what. Keep the path so you can feed it straight into downstream routing and re-extraction. Take results with safeParse and carry error.issues forward.

// src/validate.ts
import { InvoiceSchema, type Invoice } from "./schema/invoice";
 
export type FieldFault = { path: string; message: string };
 
export function validateInvoice(raw: unknown): {
  ok: boolean;
  data?: Invoice;
  faults: FieldFault[];
} {
  const result = InvoiceSchema.safeParse(raw);
  if (result.success) return { ok: true, data: result.data, faults: [] };
 
  const faults: FieldFault[] = result.error.issues.map((iss) => ({
    path: iss.path.join("."),
    message: iss.message,
  }));
  return { ok: false, faults };
}

At this point, most misread digits and swapped fields surface as "the shape is fine but the arithmetic doesn't add up." If total is off by an order of magnitude, the subtotal-plus-tax check will almost always catch it.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦The patterns where a schema-valid extraction goes quietly wrong, and how to make the document check its own arithmetic with invariants

✦A two-pass design: extract everything with a cheap model, then re-extract only the fields that failed invariants with a stronger model

✦Why you should track field-level error rate instead of document-level success, and how to route work to human review

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Calibrate model confidence against the invariant-failure rate

Asking the model to return a confidence during extraction is useful. But don't trust that number as-is. Run a calibration pass once.

Against a small labeled sample (a few dozen documents is enough), bucket the model's self-reported confidence into bins (say 0.9–1.0, 0.8–0.9, …) and count how often each bin actually produces an invariant violation or a mismatch with the ground truth. In my data, even the 0.9-and-above bin failed invariants a few percent of the time. So "0.9 means safe" becomes "even at 0.9, a few fields per document still need to be defended by arithmetic checks."

// src/calibration.ts
type Sample = { reported: number; failed: boolean }; // failed = invariant violation or mismatch
 
export function calibrate(samples: Sample[], bins = 5) {
  const table = Array.from({ length: bins }, (_, b) => ({
    range: [b / bins, (b + 1) / bins] as [number, number],
    n: 0,
    failed: 0,
  }));
  for (const s of samples) {
    const idx = Math.min(bins - 1, Math.floor(s.reported * bins));
    table[idx].n++;
    if (s.failed) table[idx].failed++;
  }
  // return the empirical error rate as a "corrected confidence"
  return table.map((t) => ({
    reportedRange: t.range,
    empiricalErrorRate: t.n ? t.failed / t.n : null,
  }));
}

Use this corrected error rate for thresholding. Instead of the model's raw 0.95, decide whether to send a document to human review based on the empirical error rate of its bin. Re-running calibration once a quarter, or whenever you swap the model, is enough. The curve does shift on model updates — I redrew it when I moved to claude-opus-4-8.

Two-pass extraction — re-pull only the failed fields

Extracting every document with a top-tier model from the start doesn't pencil out. What worked in practice was a two-pass design: extract everything with a cheap model, then re-extract only the fields that failed invariants with a stronger model.

The first pass uses claude-haiku-4-5 or claude-sonnet-4-6. The second pass re-extracts only the fields named in faults, with claude-opus-4-8. Focusing the prompt on the contradicted fields, rather than the whole document, keeps both tokens and decision drift down.

// src/extract.ts
import Anthropic from "@anthropic-ai/sdk";
import { validateInvoice, type FieldFault } from "./validate";
 
const client = new Anthropic();
 
// Pass 1: extract everything (cheap model)
async function extractAll(docText: string): Promise<unknown> {
  const res = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system:
      "Extract the invoice into the given schema and return JSON only. Use null for values not present. Do not guess.",
    messages: [{ role: "user", content: docText }],
  });
  return JSON.parse(textOf(res));
}
 
// Pass 2: re-pull only the failed fields by name (stronger model)
async function reextractFields(
  docText: string,
  faults: FieldFault[]
): Promise<Record<string, unknown>> {
  const fields = [...new Set(faults.map((f) => f.path.split(".")[0]))];
  const res = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    system:
      "Re-read ONLY these fields strictly from the invoice. For each value, also return the source snippet as a quote. Use null if not present.",
    messages: [
      {
        role: "user",
        content:
          `Target fields: ${fields.join(", ")}\n` +
          `Contradictions found by arithmetic checks:\n` +
          faults.map((f) => `- ${f.path}: ${f.message}`).join("\n") +
          `\n\n--- SOURCE ---\n${docText}`,
      },
    ],
  });
  return JSON.parse(textOf(res)) as Record<string, unknown>;
}
 
export async function extractInvoice(docText: string) {
  const first = await extractAll(docText);
  let v = validateInvoice(first);
  if (v.ok) return { data: v.data, passes: 1, faults: [] };
 
  // patch only the failed fields with the stronger model, then merge
  const patch = await reextractFields(docText, v.faults);
  const merged = { ...(first as object), ...patch };
  v = validateInvoice(merged);
 
  return {
    data: v.ok ? v.data : undefined,
    passes: 2,
    faults: v.faults, // still failing after pass 2 -> human
  };
}
 
function textOf(res: Anthropic.Message): string {
  const block = res.content[0];
  const t = block.type === "text" ? block.text : "";
  const m = t.match(/```json\s*([\s\S]*?)\s*```/) || t.match(/\{[\s\S]*\}/);
  return m ? (m[1] ?? m[0]) : t;
}

Having the second pass return a quote — the source snippet that justifies each value — pays off. Attaching provenance to a value, not just the value itself, makes a human review far faster, and if the quote doesn't actually appear in the document you can reject it as a likely hallucination.

When to send a document to human review

Documents that still carry faults after the second pass, and documents whose amounts exceed a threshold, go to a human regardless of the machine's confidence. Keep the criteria modest and tighten them with numbers as you operate. In my setup, a document is routed to review if any of these hold: (1) invariants still fail after two-pass extraction, (2) it lands in a bin whose corrected error rate exceeds a threshold, or (3) total exceeds an internal amount.

Sending something to review is not a failure. The goal is to stay in a state where mistakes are caught automatically, not to push everything through unattended. As long as the review rate stays in the low single-digit percent, operations run just fine.

Measurement — track field-level error rate

This was the biggest shift in approach. At first I watched "document-level success rate," which is far too coarse. A document with one wrong field out of ten gets rounded down to "failure," and you lose all visibility into which field is weak.

I switched to field-level error rate: per field — total, issueDate, each lineItem.amount — measure the invariant-violation rate and (where samples exist) the ground-truth mismatch rate. Suddenly you can see things like "dueDate swaps dominate" or "scanned PDFs lose lineItems more often" at a grain where you can actually act.

// src/metrics.ts
type Run = { faults: { path: string }[]; fields: string[] };
 
export function fieldErrorRates(runs: Run[]) {
  const total: Record<string, number> = {};
  const bad: Record<string, number> = {};
  for (const r of runs) {
    for (const f of r.fields) total[f] = (total[f] ?? 0) + 1;
    for (const fault of r.faults) {
      const head = fault.path.split(".")[0];
      bad[head] = (bad[head] ?? 0) + 1;
    }
  }
  return Object.keys(total)
    .map((f) => ({ field: f, rate: (bad[f] ?? 0) / total[f] }))
    .sort((a, b) => b.rate - a.rate);
}

Looking at this table weekly tells you whether the fix is a prompt tweak, another invariant, or input normalization (the rasterization quality of scanned PDFs). Document-level success rate would never let you make that distinction.

As an indie developer running an app business alongside auto-publishing for several technical blogs, mostly hands-off, I've learned the same lesson more than once. Reconciling AdMob revenue reports against the billing side at month-end, for instance, I used to settle for "the grand total matches." Then one month a single currency row drifted quietly, and at the granularity of the total I never caught it. The more a whole system looks like it's "roughly working," the easier it is to miss that one point is quietly broken when you only watch a coarse success rate. Extraction pipelines are no different: only after splitting down to the field level could I say, with numbers in hand, where a fix would do the most good.

Wrapping up

As a next step, add a single invariant to the pipeline you're running now. For invoices, even just the "subtotal + tax = total" check will surface most order-of-magnitude misreads. From there, grow into re-extracting only the failed fields and measuring at the field level, and you'll steadily move a "confidently wrong" extractor toward an operation that quietly refuses to let mistakes through.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.