⬡ API & SDK/2026-06-17Advanced

When Claude API Extracts the Wrong Value With Full Confidence — Designing the Verification Layer

When you extract invoices or contracts with Claude API, the scariest failure isn't an exception — it's plausible-but-wrong JSON. Here is how I build a verification layer that catches silent extraction errors with schema checks, arithmetic reconciliation, and dual-extraction agreement, in TypeScript.

Claude API⁷⁶ document processing² structured extraction verification³ TypeScript¹⁷ production⁹⁸

✦ Premium Article

A few weeks into running an automated invoice intake, someone asked me why the accounting numbers didn't add up even though the error logs were spotless. The cause was easy to find. On one low-resolution scan, Claude had misread a digit in the total amount, reported it as confidence: 0.96, and sailed through without throwing anything.

As an indie developer running several services in parallel, I assumed at first that exception handling alone could keep this class of error out. The most dangerous thing in structured extraction isn't the API going down or the JSON breaking. It is a plausible but wrong value flowing downstream without raising a single exception. JSON.parse succeeds, Zod passes, the logs stay green, and only the ledger quietly drifts. This article is a field note on how to build the verification layer that catches those silent errors before they reach production. It isn't about assembling the extraction pipeline itself — it's about the judgment that comes after extraction: whether the result can be trusted at all.

Don't treat confidence as a signal

The first assumption to drop is that the model's confidence is a usable quality metric. An LLM's self-reported confidence correlates almost not at all with whether the output is correct. The more badly an invoice is misread, the more decisively it can return a high number. Confidence is the model's self-assessment of its own output, not the result of checking against external truth.

So I never use confidence as a gate — only for prioritization. If a value comes back low, it goes near the front of the human review queue; that's all. The accept/reject decision is made by verification that lives outside the model. There are three sources of verification: schema (is it structurally valid?), arithmetic reconciliation (do the numbers agree internally?), and dual-extraction agreement (does an independent pass produce the same result?). Let's go through them.

Stage one: schema guards shape, nothing more

Zod validation is the first wall, but you have to size what it can defend accurately. A schema can only reject structural anomalies — wrong type, missing required field, a value outside an enum. It can guarantee that total is a number, but it can say nothing about whether that number is right.

Even so, weaving extraction-specific constraints into the schema drops a fair share of errors at the first stage. Force dates into ISO format, make amounts non-negative, restrict currency to the ISO 4217 enum. By dropping every "impossible shape" here, the arithmetic stage downstream can focus purely on substantive errors.

// src/schema.ts
import { z } from "zod";
 
const isoDate = z
  .string()
  .regex(/^\d{4}-\d{2}-\d{2}$/, "Extract as YYYY-MM-DD")
  .refine((s) => !Number.isNaN(Date.parse(s)), "Must be a real date");
 
const money = z.number().finite().nonnegative();
 
export const InvoiceSchema = z.object({
  invoiceNumber: z.string().min(1).optional(),
  issueDate: isoDate.optional(),
  dueDate: isoDate.optional(),
  vendor: z.object({ name: z.string().min(1), taxId: z.string().optional() }),
  lineItems: z
    .array(
      z.object({
        description: z.string().min(1),
        quantity: z.number().positive().optional(),
        unitPrice: money.optional(),
        amount: money,
      })
    )
    .min(1, "An invoice with zero line items is treated as a failed extraction"),
  subtotal: money.optional(),
  tax: money.optional(),
  total: money,
  currency: z.string().length(3), // ISO 4217
});
 
export type Invoice = z.infer<typeof InvoiceSchema>;

The quiet win is .min(1) on lineItems. An invoice with no line items essentially never exists, so if zero comes back you can declare the extraction failed. When you design the schema as a "detector of business-impossible shapes" rather than a "data-correctness check," the first net tightens considerably.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A verification layer that ignores the model's self-reported confidence and instead catches bad extractions mechanically through schema, arithmetic reconciliation, and dual-extraction agreement

✦Checksum validation that uses arithmetic constraints like subtotal + tax = total to reject plausible-looking errors, plus how to draw the line for routing to human review

✦A staged design that extracts with Sonnet first and only re-arbitrates disagreements with Opus, balanced against prompt caching and batch processing for realistic cost allocation

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Stage two: hit numeric documents with arithmetic

Documents dominated by numbers — invoices, receipts, quotes — carry redundancy inside themselves. Sum the line amounts and you get the subtotal; add tax to the subtotal and you get the total. This internal arithmetic relationship is a remarkably powerful way to detect errors without holding any external ground truth.

// src/arithmetic.ts
import type { Invoice } from "./schema";
 
export type Discrepancy = { field: string; expected: number; got: number; diff: number };
 
// Per-currency tolerance for minor-unit rounding. JPY has no fractional unit; USD is 1 cent.
const TOLERANCE: Record<string, number> = { JPY: 0, USD: 0.01, EUR: 0.01 };
 
export function auditInvoiceMath(inv: Invoice): Discrepancy[] {
  const tol = TOLERANCE[inv.currency] ?? 0.01;
  const issues: Discrepancy[] = [];
 
  const lineSum = inv.lineItems.reduce((s, li) => s + li.amount, 0);
 
  // line total vs subtotal
  if (inv.subtotal !== undefined && Math.abs(lineSum - inv.subtotal) > tol) {
    issues.push({ field: "subtotal", expected: lineSum, got: inv.subtotal, diff: inv.subtotal - lineSum });
  }
 
  // subtotal + tax = total (fall back to line sum if no subtotal)
  const base = inv.subtotal ?? lineSum;
  const expectedTotal = base + (inv.tax ?? 0);
  if (Math.abs(expectedTotal - inv.total) > tol) {
    issues.push({ field: "total", expected: expectedTotal, got: inv.total, diff: inv.total - expectedTotal });
  }
 
  // quantity x unitPrice = amount per line (only where both exist)
  for (const [i, li] of inv.lineItems.entries()) {
    if (li.quantity !== undefined && li.unitPrice !== undefined) {
      const expected = li.quantity * li.unitPrice;
      if (Math.abs(expected - li.amount) > tol) {
        issues.push({ field: `lineItems[${i}].amount`, expected, got: li.amount, diff: li.amount - expected });
      }
    }
  }
  return issues;
}

The digit error I opened with is stopped exactly here. If the total is the one value that doesn't reconcile with the rest, the diff exceeds tolerance and gets flagged. The key point is that you can name the error from the document's internal contradiction alone, holding no ground truth whatsoever. In practice, the sign and magnitude of the detected diff often tell you whether it was a misread digit, a wrong tax rate, or a dropped line item.

The per-currency tolerance exists so that tiny rounding drifts don't flood human review. JPY has no fractional unit so the tolerance is zero; only currencies with decimals get a minor unit allowance. Fix this at zero and you'll bounce correct invoices forever.

Stage three: gate on dual-extraction agreement

Arithmetic only defends numeric consistency. Free-text fields like vendor name, contract period, and address carry no internal redundancy. What works here is redundancy of a different kind: extract the same document again under different conditions and check whether the results agree. Think of it as bringing the ensemble idea into extraction.

The two calls become more independent if you vary the temperature and the framing. I run a first pass with claude-sonnet-4-6, a verification pass with the same Sonnet but a different prompt framing, and only arbitrate the fields that still disagree with claude-opus-4-8. Pulling everything twice through Opus is too extravagant, and Sonnet-versus-Sonnet disagreement precisely tells you "where the document is genuinely hard to read," which becomes a map of where to concentrate cost.

// src/agreement.ts
import Anthropic from "@anthropic-ai/sdk";
import { InvoiceSchema, type Invoice } from "./schema";
 
const client = new Anthropic();
 
// Normalize before comparing (absorb full-width spaces, trimming, case)
const norm = (v: unknown) =>
  typeof v === "string" ? v.normalize("NFKC").trim().toLowerCase() : v;
 
export function fieldAgreement(a: Invoice, b: Invoice): string[] {
  const mismatches: string[] = [];
  const keys: (keyof Invoice)[] = ["invoiceNumber", "issueDate", "dueDate", "total", "currency"];
  for (const k of keys) {
    if (norm(a[k]) !== norm(b[k])) mismatches.push(String(k));
  }
  if (norm(a.vendor.name) !== norm(b.vendor.name)) mismatches.push("vendor.name");
  return mismatches;
}
 
// Second pass with a different framing, kept independent from the first
export async function secondPass(content: string): Promise<Invoice> {
  const res = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system:
      "You are an auditor. Read the invoice character by character and transcribe only the values that appear, as JSON. " +
      "No guessing, no completion, no rounding. Use null for anything unreadable. Return valid JSON only.",
    messages: [{ role: "user", content: `Transcribe the following invoice:\n\n${content}` }],
  });
  const text = res.content[0].type === "text" ? res.content[0].text : "";
  const json = text.match(/\{[\s\S]*\}/)?.[0] ?? text;
  return InvoiceSchema.parse(JSON.parse(json));
}

Running NFKC normalization before the string comparison turned out to be mandatory in the field. If a difference in full-width versus half-width characters or leading whitespace alone counts as a "mismatch," you'll trigger endless re-extraction over values that are effectively identical. An agreement rate without normalization measures notation drift, not real errors. If you're going to treat the agreement rate as an operational metric, the pre-processing of the comparison is worth designing carefully.

Fold the layer into a single gate

Combine the three stages into one decision that returns a verdict and the reason for it. Downstream should only handle three values — accept / review / reject — and the reason should always be attached.

// src/verify.ts
import { InvoiceSchema, type Invoice } from "./schema";
import { auditInvoiceMath } from "./arithmetic";
import { fieldAgreement, secondPass } from "./agreement";
 
export type Verdict =
  | { status: "accept"; data: Invoice }
  | { status: "review"; data: Invoice; reasons: string[] }
  | { status: "reject"; reasons: string[] };
 
export async function verifyInvoice(raw: unknown, sourceText: string): Promise<Verdict> {
  // 1. Schema: structurally broken => reject immediately
  const parsed = InvoiceSchema.safeParse(raw);
  if (!parsed.success) {
    return { status: "reject", reasons: parsed.error.issues.map((i) => `${i.path.join(".")}: ${i.message}`) };
  }
  const inv = parsed.data;
  const reasons: string[] = [];
 
  // 2. Arithmetic: internal contradictions => review (never auto-accept)
  for (const d of auditInvoiceMath(inv)) {
    reasons.push(`arithmetic mismatch ${d.field}: expected ${d.expected} / got ${d.got} (diff ${d.diff})`);
  }
 
  // 3. Dual extraction: disagreement on key fields => review
  try {
    const second = await secondPass(sourceText);
    const diff = fieldAgreement(inv, second);
    if (diff.length > 0) reasons.push(`disagrees with second pass: ${diff.join(", ")}`);
  } catch {
    reasons.push("second pass failed (no independent verification — needs a look)");
  }
 
  if (reasons.length === 0) return { status: "accept", data: inv };
  return { status: "review", data: inv, reasons };
}

What I'm careful about here is to never blur the three-way split: auto-accept / hand to a person / discard. If you let anything flagged by arithmetic or dual extraction fall into "close enough, accept," you've defeated the purpose of building the layer. Anything suspect must drop to review. Reserve reject for structural breakage that can't even pass the schema.

Where to draw the human-review line

Once the verification layer is in place, a certain fraction will always pile up in review. Looking at all of it by hand dilutes the value of automation, so treat the review volume itself as a design target. Three rules of thumb I use:

First, scrutinize high-value documents more strictly. The same diff is worth a different amount of review on a few-dollar receipt versus a six-figure invoice. Scale the threshold with the amount: tolerate small drifts on petty sums, call a human for the slightest mismatch on large ones. Second, learn per vendor. If dual-extraction mismatches keep happening on one vendor's layout, that template has a specific hard-to-read quality, and it's safer to pin it to mandatory review. Third, feed review results back into the extraction prompt. Whatever a person corrects becomes few-shot material, reducing the same class of error next time.

When you see the verification layer not as "a device to eliminate humans" but as "a device to concentrate humans where value is highest," the design stops wobbling. Neither trusting everything nor doubting everything: the machine names the places worth doubting, and people look only there. Once that division of labor is running, you can scale throughput without it falling apart.

Balancing cost against accuracy

The verification layer increases API calls, so it has to be designed together with cost. Naively pulling everything two or three times inflates the bill, so apply cost in stages. Keep the first and second passes on Sonnet and call Opus only to arbitrate disagreeing fields. Let prompt caching cover the long, unchanging system prompt, and route latency-tolerant batches through Message Batches.

Stage	Model / method	Intent
First pass	claude-sonnet-4-6	Enough for most documents. The speed-and-cost workhorse
Second pass (independent)	claude-sonnet-4-6 (reframed)	Independence comes from temperature and framing, not from upgrading the model
Arbitrating disagreeing fields	claude-opus-4-8	Concentrate strong reasoning only where the passes conflict; avoid Opus on everything
Standing system prompt	Prompt caching	Extraction instructions and schema notes don't change, so cache them
Latency-tolerant bulk runs	Message Batches	Push nightly intake and similar paths where latency is acceptable

The dominant cost driver is "pulling everything through the top model multiple times." When you want verification but fear the cost, the lever isn't toggling verification on and off — it's reallocating verification to be wide on cheap models and narrow on expensive ones. Pass the agreeing majority cheaply; spend strong reasoning only on the small slice that conflicts. Whether you can build that gradient determines whether the verification layer survives realistic operation.

Signals I watch in operation

Finally, a few metrics I keep observing rather than setting and forgetting. Daily ratios of accept / review / reject, the count of arithmetic mismatches, the dual-extraction disagreement rate, and the share of human reviews where a correction was actually made. That last one — the fraction sent to review that no person ended up correcting — matters most: if it's high, your thresholds are too strict and you're wasting reviewers' time; conversely, if errors keep surfacing after the fact inside accept, verification is too loose. I nudge the balance between those two failure modes gradually, tuned to the value of the documents in play.

Rather than chasing the accuracy of the extractor itself, in production it pays more to build a mechanism that keeps "the probability of a wrong value passing through" at an operable level. Bake the premise that the model can be confidently wrong into the design from the start. That is the plainest, most reliable foundation for running automated intake with peace of mind.

I hope this helps anyone wrestling with the same kind of quiet error.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.