●SCIENCE — Claude Science launches in beta, a workbench that unifies research tools and produces auditable artifacts●MODEL — Fast mode for Claude Opus 4.7 retires on July 24; migrate to Opus 4.8 fast mode●CODE — Claude Code v2.1.195 adds a toggle to disable mouse clicks in fullscreen mode●CODE — Hyphenated hook matchers now match exactly instead of substring-matching●AGENT — Claude Science pairs a coordinating agent with specialists and a reviewer that checks citations and math●CLOUD — Claude is generally available in Microsoft Foundry on Azure with Azure-native access●SCIENCE — Claude Science launches in beta, a workbench that unifies research tools and produces auditable artifacts●MODEL — Fast mode for Claude Opus 4.7 retires on July 24; migrate to Opus 4.8 fast mode●CODE — Claude Code v2.1.195 adds a toggle to disable mouse clicks in fullscreen mode●CODE — Hyphenated hook matchers now match exactly instead of substring-matching●AGENT — Claude Science pairs a coordinating agent with specialists and a reviewer that checks citations and math●CLOUD — Claude is generally available in Microsoft Foundry on Azure with Azure-native access
Don't Accept an Agent's Numbers and Citations As-Is — A Verification Gate Built on a Dedicated Auditor Subagent
A design that verifies every number and citation in an agent-generated summary using a separate subagent before accepting it — with working TypeScript for deterministic recomputation and fail-closed source matching.
When Claude Science was announced, the part that stayed with me wasn't the number of new skills. It was the multi-stage shape: a coordinating agent calls specialist agents, and then a dedicated agent verifies the citations and calculations. Treating verification as an independent role, separate from generation, felt like the important idea.
As an indie developer, I run automated jobs for several sites on my own. One morning a generated metrics summary said "+18% week over week," but when I added up the raw numbers by hand, the real figure was +8%. Models produce plausible numbers and plausible-looking citations with unsettling fluency. And when a summary quietly drifts from the underlying data, no one notices as long as they're only reading the summary. Since that morning I've distrusted the very structure of "trusting an artifact's numbers and citations inside the same flow that produced them."
This article builds the missing piece with working code: a pre-acceptance gate that separates generation from verification, recomputes numbers deterministically, matches citations against the source text, and rejects the entire artifact if even one claim fails.
Why you must not verify "as a continuation of generation"
The failure happens when you ask the generating model itself, "Is this correct?" Self-checking inside the same context and the same train of thought leads the model to treat the numbers it just produced as a correct premise, so it overlooks the drift. It's like proofreading your own draft, alone, right after writing it.
The value of an independent verifier comes down to three things.
First, context isolation. The verifier takes only the artifact and the primary data and sources as input; it inherits none of the generation-time reasoning. Second, deterministic judgment. Numbers are recomputed from raw data by a function, not re-confirmed by a language model. Third, fail-closed. A claim that cannot be verified is treated as failing, not as "probably fine," and a single unpassable claim stops the whole artifact.
Extract claims at the right granularity (a claim ledger)
You cannot verify a free-form summary directly. First, extract the smallest verifiable units — numeric claims and citation claims — as structured data. The generating agent must emit this "claim ledger" alongside the summary, every time.
// claims.ts — types for verifiable claimsexport type NumericClaim = { id: string; kind: "number"; statement: string; // human-readable claim (the spot in the summary) metric: string; // metric key used for recomputation value: number; // the value the model claimed tolerance?: number; // relative tolerance (default if omitted)};export type CitationClaim = { id: string; kind: "citation"; statement: string; quote: string; // string that must exist in the source sourceId: string; // identifier of the source text to match against};export type Claim = NumericClaim | CitationClaim;export type Artifact = { summary: string; claims: Claim[];};
The key point is that value is stored as "the value the model claimed." The verifier does not trust this value; it later compares it against a figure it computes itself. The ledger isn't an appendix to the artifact — it is the input to verification.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A number-verification function that recomputes each claimed value deterministically from raw data, with relative tolerance
✦A fail-closed citation check that normalizes quoted text and confirms it actually exists in the cited source
✦How to split work between a dedicated auditor subagent and deterministic checks, rejecting the whole artifact if any single claim fails
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Recompute numbers with a function, never ask the model
The whole point of numeric verification is to route it around the language model entirely. Register, per metric, a function that derives the answer uniquely from raw data, and compare it against the model's claim.
// verify-number.tsimport type { NumericClaim } from "./claims";export type Dataset = Record<string, number[]>;const sum = (xs: number[]) => xs.reduce((a, b) => a + b, 0);// metric key -> function that determines the value from raw dataexport const metricFns: Record<string, (d: Dataset) => number> = { "clicks.total": (d) => sum(d.clicks), "clicks.avgPerDay": (d) => sum(d.clicks) / d.clicks.length, "wowChangePct": (d) => { const prev = sum(d.clicksPrevWeek); const cur = sum(d.clicksThisWeek); return prev === 0 ? NaN : ((cur - prev) / prev) * 100; },};const DEFAULT_TOLERANCE = 0.005; // relative 0.5%export function verifyNumber( claim: NumericClaim, data: Dataset): { ok: boolean; reason: string; expected?: number } { const fn = metricFns[claim.metric]; if (!fn) { // an unknown metric is "unverifiable" = fail (fail-closed) return { ok: false, reason: `unregistered metric: ${claim.metric}` }; } const expected = fn(data); if (!Number.isFinite(expected)) { return { ok: false, reason: "recomputed value is not finite", expected }; } const tol = claim.tolerance ?? DEFAULT_TOLERANCE; const denom = Math.abs(expected) || 1; const relErr = Math.abs(claim.value - expected) / denom; return relErr <= tol ? { ok: true, reason: "match", expected } : { ok: false, reason: `claimed ${claim.value} vs recomputed ${expected.toFixed(2)} differ by ${(relErr * 100).toFixed(2)}%`, expected, };}
The "+18% that was really +8%" from the opening fails instantly here: recomputing wowChangePct catches it. The relative tolerance exists so that rounding and display-digit differences don't fail; but the default is deliberately narrow at 0.5%, so any meaningful drift is caught.
Returning ok: false for an unregistered metric is the heart of fail-closed. You lean toward "cannot verify = do not pass," not "no verifier function = safe."
Match citations by checking the quote exists in the source
Citation verification, too, must not ask the model "is this quote correct?" Instead, confirm — after normalization — that the quoted string actually exists inside the designated source snippet.
// verify-citation.tsimport type { CitationClaim } from "./claims";export type Sources = Record<string, string>; // sourceId -> source snippet// absorb width, whitespace, and punctuation variancefunction normalize(s: string): string { return s .normalize("NFKC") .replace(/\s+/g, "") .replace(/[「」『』()(),.、。]/g, "") .toLowerCase();}export function verifyCitation( claim: CitationClaim, sources: Sources): { ok: boolean; reason: string } { const src = sources[claim.sourceId]; if (!src) { return { ok: false, reason: `source not found: ${claim.sourceId}` }; } const q = normalize(claim.quote); if (q.length < 8) { // a too-short quote matches by chance, so fail it return { ok: false, reason: "quote too short to match" }; } return normalize(src).includes(q) ? { ok: true, reason: "matches source" } : { ok: false, reason: "quote does not exist in source" };}
Normalization is there because models tend to alter punctuation and bracket widths slightly when quoting. On the other hand, a too-short quote (under 8 characters) would hit almost any source by coincidence, so it fails on purpose. Again: "cannot match = do not pass."
Use the auditor subagent only where determinism can't reach
Numbers and citations can be crushed deterministically as above, but "did the summary quietly slip in a number or source that isn't in the ledger?" is hard to measure with string matching alone. Hand only that part to a dedicated auditor subagent. Narrowing its role minimizes the surface you leave to model judgment.
In Claude Code, place a dedicated auditor subagent under .claude/agents/ and invoke it as an independent process that carries none of the generation context.
<!-- .claude/agents/claim-auditor.md -->---name: claim-auditordescription: Judges only whether a generated summary introduced numbers/citations absent from its claim ledgertools: []---You are a dedicated auditor. You receive only the "summary text" and the "claim ledger" (a list of ids and statements).List every number or specific citation appearing in the summary that maps to no claim in the ledger.Judge conservatively: if even one number/citation lacks ledger support, report it as unverified.Output only JSON of shape {"unlisted": string[]}. Do not infer or fill in the generator's intent.
If you wire this from the SDK instead, the key is to restrict the input to just the summary and the ledger — never pass the raw data or the generation-time prompt.
// audit-summary.ts (excerpt)import Anthropic from "@anthropic-ai/sdk";const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });export async function auditSummary( summary: string, claimStatements: string[]): Promise<{ unlisted: string[] }> { const msg = await client.messages.create({ model: "claude-haiku-4-5-20251001", // a light model is enough for auditing max_tokens: 512, system: 'List numbers/citations in the summary that map to none of the given claims. Output only JSON {"unlisted": string[]}.', messages: [ { role: "user", content: `# Summary\n${summary}\n\n# Claims\n${claimStatements.join("\n")}`, }, ], }); const text = msg.content.find((b) => b.type === "text"); return JSON.parse(text && "text" in text ? text.text : '{"unlisted":[]}');}
A light model suffices because this audit isn't a fresh judgment — it's a diff against the ledger.
Bundle everything into one gate (Before / After)
Finally, combine number checks, citation checks, and the leakage audit into a function that rejects the whole artifact if any one of them fails. First, the tempting "publish as-is" version.
// ❌ Before: accept the artifact's summary and publish it directlyconst artifact = await generateReport(data);await publish(artifact.summary); // no one verified the numbers or citations
Change it so that only artifacts that pass verification ever reach publish.
// ✅ After: always go through a pre-acceptance gateimport { verifyNumber } from "./verify-number";import { verifyCitation } from "./verify-citation";import { auditSummary } from "./audit-summary";import type { Artifact, Dataset, Sources } from "./types";export async function gate( artifact: Artifact, data: Dataset, sources: Sources): Promise<{ ok: boolean; failures: string[] }> { const failures: string[] = []; for (const c of artifact.claims) { if (c.kind === "number") { const r = verifyNumber(c, data); if (!r.ok) failures.push(`[${c.id}] ${r.reason}`); } else { const r = verifyCitation(c, sources); if (!r.ok) failures.push(`[${c.id}] ${r.reason}`); } } const audit = await auditSummary( artifact.summary, artifact.claims.map((c) => `${c.id}: ${c.statement}`) ); for (const u of audit.unlisted) { failures.push(`[unlisted] not in ledger: ${u}`); } return { ok: failures.length === 0, failures };}// callerconst artifact = await generateReport(data);const result = await gate(artifact, data, sources);if (!result.ok) { console.error("rejected:", result.failures); // do not publish. log the reasons and route to regeneration or human review} else { await publish(artifact.summary);}
The difference between Before and After isn't a single added line — it's the inversion of the premise: "an artifact is not trusted by default." publish is only ever called from an artifact that passed the gate.
Small judgments that paid off in practice
After running this shape for a few months, what helped wasn't the flashy machinery but the small decisions in the details.
Decision point
Policy taken
Reason
Unregistered metric / missing source
Fail it
Reading "unverifiable" as "safe" lets incidents slip through
Numeric tolerance
Narrow, relative 0.5%
Forgive rounding, always catch meaningful drift
Short quotes
Under 8 chars fails
Don't let a chance match read as "source exists"
Audit model
Run on a light model
Diff detection needs no heavy reasoning
On rejection
Log the reasons and stop
To read failure trends later; never discard silently
The rejection log, in particular, later tells you which metrics drift most often. In my case, difference metrics like week-over-week were the most error-prone, and tightening the tolerance just there let me catch nearly all of the quiet drift.
Speed and fluency of generation guarantee nothing about correctness. That's exactly why you place, somewhere apart from generation, a plain checkpoint: recompute the numbers, go back to the source for citations, and pass nothing that lacks support. It isn't glamorous, but the longer I run things unattended, the more this single gate has saved me. I hope it gives fellow builders handling agent outputs a foothold for their own design.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.