CLAUDE LABJP
WWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skillsWWDC — WWDC 2026 confirms Siri runs on Google Gemini; third-party handoff to ChatGPT is dropped, and Siri AI won't ship in the EU under the DMA at iOS 27BILLING — 6 days until the Jun 15 change: Agent SDK, headless Claude Code, GitHub Actions, and third-party agents move to API-rate monthly creditOUTAGE — claude.ai, Claude Code, and Cowork saw an outage (Jun). Scheduled runs are safest when built around fallbackModel and retriesDYNAMIC-WORKFLOWS — Dynamic workflows are on by default on Max/Team and the API, for codebase-wide bug hunts and independent verificationULTRACODE — Claude Code's new ultracode setting sits in the effort menu, fixing effort to xhigh while Claude decides when to run a workflowOPUS4.8 — Claude Opus 4.8 is settled in as the default across major plans, with stronger coding, agentic, and reasoning skills
Articles/API & SDK
API & SDK/2026-03-27Advanced

Building LLM Evaluation Pipelines with Claude API — Claude-as-Judge, Prompt A/B Testing, and Quality Scoring Patterns

Learn how to design and implement LLM evaluation pipelines using Claude API. Covers Claude-as-Judge patterns, prompt A/B testing frameworks, quality scoring systems, and regression testing for production applications.

evaluation2claude-api71testing7prompt-engineering16production110quality-assurance3

Setup and context — Why You Need an LLM Evaluation Pipeline

Once your Claude API-powered application hits production, you'll inevitably face a critical question: how do you guarantee output quality at scale?

A small tweak to your prompt can produce unexpected outputs. A model version upgrade might break existing workflows. Manual spot-checking misses hundreds of edge cases. Without a systematic approach, you're essentially flying blind.

This article walks you through building a comprehensive evaluation pipeline for Claude-powered applications. We'll cover Claude-as-Judge for automated evaluation, prompt A/B testing with statistical rigor, quality scoring systems, and regression testing — everything you need to ship with confidence.

Who This Article Is For

  • Developers running Claude API applications in production
  • Teams struggling with prompt quality management
  • Engineers looking to integrate evaluation into their CI/CD pipelines

Prerequisites

  • Python 3.11+
  • anthropic Python SDK (latest version)
  • TypeScript examples are also included

The Claude-as-Judge Pattern — Automated Output Evaluation

Core Concept

Claude-as-Judge uses Claude itself as an evaluator to assess the quality of outputs from other Claude calls (or any LLM output). The key advantage is that it achieves high correlation with human evaluators while processing thousands of evaluations in minutes.

Designing Evaluation Rubrics

Clear evaluation criteria are essential. Vague rubrics produce inconsistent results, so you need specific scoring guidelines.

# evaluation_rubric.py — Evaluation rubric definition
from dataclasses import dataclass, field
from typing import Optional
 
@dataclass
class EvalCriterion:
    """A single evaluation criterion"""
    name: str
    description: str
    scoring_guide: dict[int, str]  # score → description
    weight: float = 1.0
 
@dataclass
class EvalRubric:
    """Complete evaluation rubric"""
    name: str
    criteria: list[EvalCriterion] = field(default_factory=list)
 
    def to_prompt(self) -> str:
        """Convert rubric to prompt string"""
        lines = [f"# Evaluation Rubric: {self.name}\n"]
        for c in self.criteria:
            lines.append(f"## {c.name} (weight: {c.weight})")
            lines.append(f"{c.description}\n")
            for score, desc in sorted(c.scoring_guide.items()):
                lines.append(f"- **{score}**: {desc}")
            lines.append("")
        return "\n".join(lines)
 
# Example: Customer support response quality rubric
support_rubric = EvalRubric(
    name="Customer Support Response Quality",
    criteria=[
        EvalCriterion(
            name="Accuracy",
            description="Is the response factually correct with no misinformation?",
            scoring_guide={
                1: "Contains significant factual errors",
                2: "Minor inaccuracies present",
                3: "Mostly accurate but some vague areas",
                4: "Accurate with specific information provided",
                5: "Completely accurate with relevant supplementary details",
            },
            weight=2.0,
        ),
        EvalCriterion(
            name="Completeness",
            description="Does the response address all parts of the question?",
            scoring_guide={
                1: "Fails to address the main question",
                2: "Only partially addresses the question",
                3: "Addresses the main question but lacks supplementary info",
                4: "Fully addresses the question with next steps",
                5: "Comprehensive answer with proactive related information",
            },
            weight=1.5,
        ),
        EvalCriterion(
            name="Tone",
            description="Is the tone appropriate and empathetic?",
            scoring_guide={
                1: "Cold or overly formal",
                2: "Somewhat mechanical",
                3: "Standard, no issues",
                4: "Friendly and polite",
                5: "Empathetic and reassuring",
            },
            weight=1.0,
        ),
    ],
)

Implementing Claude-as-Judge

Here's the core function that uses the rubric to have Claude evaluate outputs.

# claude_judge.py — Claude-as-Judge evaluation engine
import anthropic
import json
from typing import Any
 
client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY env var
 
async def evaluate_with_claude(
    input_text: str,
    output_text: str,
    rubric: EvalRubric,
    reference_answer: str | None = None,
    model: str = "claude-sonnet-4-6",
) -> dict[str, Any]:
    """
    Evaluate an output using Claude-as-Judge
 
    Args:
        input_text: Original input (user question, etc.)
        output_text: Output to evaluate
        rubric: Evaluation rubric
        reference_answer: Reference answer (optional)
        model: Model to use for evaluation
 
    Returns:
        Dictionary with per-criterion scores and overall score
    """
    system_prompt = """You are an expert evaluator of LLM outputs.
Follow the given rubric strictly and score each criterion.
Evaluate objectively and consistently.
 
Always respond in the following JSON format:
{
  "scores": {
    "criterion_name": {"score": number, "reasoning": "explanation"},
    ...
  },
  "weighted_total": weighted_total_score,
  "max_possible": maximum_possible_score,
  "overall_feedback": "overall feedback"
}"""
 
    user_prompt = f"""Evaluate the following output against the input.
 
{rubric.to_prompt()}
 
## Input
{input_text}
 
## Output to Evaluate
{output_text}
"""
    if reference_answer:
        user_prompt += f"\n## Reference Answer (ideal response)\n{reference_answer}\n"
 
    response = client.messages.create(
        model=model,
        max_tokens=2000,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}],
    )
 
    # Extract and parse JSON
    text = response.content[0].text
    json_start = text.find("{")
    json_end = text.rfind("}") + 1
    result = json.loads(text[json_start:json_end])
 
    return result
 
# Usage example
# result = await evaluate_with_claude(
#     input_text="How do I return an item?",
#     output_text="You can return items within 30 days...",
#     rubric=support_rubric,
# )
# print(f"Overall score: {result['weighted_total']}/{result['max_possible']}")

Expected output:

{
  "scores": {
    "Accuracy": {"score": 4, "reasoning": "Return policy and steps are accurately described"},
    "Completeness": {"score": 3, "reasoning": "Covers basic steps but missing shipping cost details"},
    "Tone": {"score": 4, "reasoning": "Friendly and polite tone throughout"}
  },
  "weighted_total": 16.5,
  "max_possible": 22.5,
  "overall_feedback": "Accurate and polite response, but could include shipping cost information"
}

Techniques for Improving Evaluation Consistency

To increase the accuracy of Claude-as-Judge evaluations, consider these approaches:

  1. Few-shot examples: Include concrete response examples at each score level alongside the rubric
  2. Pairwise comparison: Comparing two outputs side-by-side is more consistent than absolute scoring
  3. Median of multiple runs: Run the same evaluation 3 times and take the median to reduce variance
async def evaluate_with_consistency(
    input_text: str,
    output_text: str,
    rubric: EvalRubric,
    n_runs: int = 3,
) -> dict:
    """Evaluate multiple times and take the median"""
    import statistics
 
    results = []
    for _ in range(n_runs):
        result = await evaluate_with_claude(input_text, output_text, rubric)
        results.append(result)
 
    # Calculate median score for each criterion
    median_scores = {}
    for criterion in rubric.criteria:
        scores = [r["scores"][criterion.name]["score"] for r in results]
        median_scores[criterion.name] = statistics.median(scores)
 
    return {
        "median_scores": median_scores,
        "all_runs": results,
        "consistency": _calculate_consistency(results, rubric),
    }

Prompt A/B Testing Framework

Why A/B Test Your Prompts?

"This prompt feels better" is a dangerous basis for production decisions. Prompt A/B testing runs two or more prompt variants against the same test cases and determines whether there's a statistically significant difference.

Framework Design

# prompt_ab_test.py — Prompt A/B testing framework
import anthropic
import asyncio
from dataclasses import dataclass
from datetime import datetime
from scipy import stats
import numpy as np
 
client = anthropic.Anthropic()
 
@dataclass
class PromptVariant:
    """A prompt variant"""
    name: str
    system_prompt: str
    model: str = "claude-sonnet-4-6"
    temperature: float = 0.0
    max_tokens: int = 4096
 
@dataclass
class TestCase:
    """A test case"""
    id: str
    input_text: str
    reference_answer: str | None = None
    metadata: dict | None = None
 
@dataclass
class ABTestResult:
    """A/B test results"""
    variant_a_scores: list[float]
    variant_b_scores: list[float]
    t_statistic: float
    p_value: float
    is_significant: bool
    winner: str | None
    effect_size: float
 
async def run_ab_test(
    variant_a: PromptVariant,
    variant_b: PromptVariant,
    test_cases: list[TestCase],
    rubric: EvalRubric,
    significance_level: float = 0.05,
) -> ABTestResult:
    """
    A/B test two prompt variants
 
    Args:
        variant_a: Variant A (baseline)
        variant_b: Variant B (challenger)
        test_cases: Test case set
        rubric: Evaluation rubric
        significance_level: Significance level (default 5%)
    """
    scores_a, scores_b = [], []
 
    for tc in test_cases:
        # Generate outputs from both variants
        output_a = await _generate(variant_a, tc.input_text)
        output_b = await _generate(variant_b, tc.input_text)
 
        # Evaluate with Claude-as-Judge
        eval_a = await evaluate_with_claude(
            tc.input_text, output_a, rubric, tc.reference_answer
        )
        eval_b = await evaluate_with_claude(
            tc.input_text, output_b, rubric, tc.reference_answer
        )
 
        scores_a.append(eval_a["weighted_total"] / eval_a["max_possible"])
        scores_b.append(eval_b["weighted_total"] / eval_b["max_possible"])
 
    # Paired t-test (same test cases compared)
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    is_significant = p_value < significance_level
 
    # Effect size (Cohen's d)
    diff = np.array(scores_a) - np.array(scores_b)
    effect_size = np.mean(diff) / np.std(diff) if np.std(diff) > 0 else 0
 
    winner = None
    if is_significant:
        winner = variant_a.name if np.mean(scores_a) > np.mean(scores_b) else variant_b.name
 
    return ABTestResult(
        variant_a_scores=scores_a,
        variant_b_scores=scores_b,
        t_statistic=t_stat,
        p_value=p_value,
        is_significant=is_significant,
        winner=winner,
        effect_size=effect_size,
    )
 
async def _generate(variant: PromptVariant, input_text: str) -> str:
    """Generate output from a prompt variant"""
    response = client.messages.create(
        model=variant.model,
        max_tokens=variant.max_tokens,
        temperature=variant.temperature,
        system=variant.system_prompt,
        messages=[{"role": "user", "content": input_text}],
    )
    return response.content[0].text
 
# Expected output example:
# ABTestResult(
#     variant_a_scores=[0.82, 0.78, 0.85, ...],
#     variant_b_scores=[0.88, 0.91, 0.87, ...],
#     t_statistic=-3.42,
#     p_value=0.002,
#     is_significant=True,
#     winner="variant_b_detailed_prompt",
#     effect_size=0.65
# )

Test Case Design Strategy

The reliability of your A/B test depends heavily on test case quality and quantity.

  • Minimum 30 cases: Ensure adequate statistical power with at least 30 test cases
  • Edge case coverage: Include ambiguous inputs, long inputs, and multilingual inputs beyond just standard cases
  • Category balance: Distribute test cases evenly across use case categories
  • Golden dataset: Prepare ideal reference answers written by humans
# Building a test case set
test_cases = [
    TestCase(
        id="support-001",
        input_text="My order hasn't arrived. Order number is ORD-12345.",
        reference_answer="I'm sorry for the inconvenience. For order ORD-12345...",
        metadata={"category": "shipping", "difficulty": "standard"},
    ),
    TestCase(
        id="support-002",
        input_text="I was charged twice on last month's bill",
        reference_answer=None,  # No reference → rubric evaluation only
        metadata={"category": "billing", "difficulty": "complex"},
    ),
    # ... 30+ cases
]

Regression Testing — Protecting Quality Across Prompt Changes

Why Regression Testing Matters

Prompt improvements and model version changes can cause unintended quality drops. Regression testing automatically verifies that your current version maintains quality at or above baseline levels.

CI/CD Integration

# regression_test.py — Prompt regression testing
import json
from pathlib import Path
 
BASELINE_PATH = Path("eval/baselines/current.json")
THRESHOLD = 0.95  # Must maintain 95% of baseline
 
async def run_regression_test(
    variant: PromptVariant,
    test_cases: list[TestCase],
    rubric: EvalRubric,
) -> dict:
    """
    Run regression test and return comparison against baseline
    """
    # Load current baseline
    baseline = json.loads(BASELINE_PATH.read_text())
 
    # Score the new variant
    new_scores = []
    for tc in test_cases:
        output = await _generate(variant, tc.input_text)
        eval_result = await evaluate_with_claude(
            tc.input_text, output, rubric
        )
        normalized = eval_result["weighted_total"] / eval_result["max_possible"]
        new_scores.append({"id": tc.id, "score": normalized})
 
    # Compare against baseline
    baseline_avg = sum(b["score"] for b in baseline["scores"]) / len(baseline["scores"])
    new_avg = sum(s["score"] for s in new_scores) / len(new_scores)
    ratio = new_avg / baseline_avg if baseline_avg > 0 else 0
 
    passed = ratio >= THRESHOLD
    regressions = _find_regressions(baseline["scores"], new_scores)
 
    return {
        "passed": passed,
        "baseline_avg": round(baseline_avg, 4),
        "new_avg": round(new_avg, 4),
        "ratio": round(ratio, 4),
        "threshold": THRESHOLD,
        "regressions": regressions,  # Cases with significant score drops
        "improvements": _find_improvements(baseline["scores"], new_scores),
    }
 
def _find_regressions(baseline_scores, new_scores, drop_threshold=0.15):
    """Detect cases with significant score drops from baseline"""
    regressions = []
    baseline_map = {s["id"]: s["score"] for s in baseline_scores}
    for ns in new_scores:
        if ns["id"] in baseline_map:
            drop = baseline_map[ns["id"]] - ns["score"]
            if drop > drop_threshold:
                regressions.append({
                    "id": ns["id"],
                    "baseline_score": baseline_map[ns["id"]],
                    "new_score": ns["score"],
                    "drop": round(drop, 4),
                })
    return regressions
 
# Usage in GitHub Actions / CI:
# python -m pytest eval/test_regression.py -v
# → Failures block the PR

Expected output:

{
  "passed": true,
  "baseline_avg": 0.8234,
  "new_avg": 0.8512,
  "ratio": 1.0338,
  "threshold": 0.95,
  "regressions": [],
  "improvements": [
    {"id": "support-015", "baseline_score": 0.72, "new_score": 0.89, "gain": 0.17}
  ]
}

Production Quality Monitoring Dashboard

Continuous Monitoring in Production

Your evaluation pipeline shouldn't stop at development time. Here's a pattern for sampling production requests and monitoring quality continuously.

# quality_monitor.py — Production quality monitoring
import random
import logging
from datetime import datetime, timezone
 
logger = logging.getLogger(__name__)
 
class QualityMonitor:
    """Production quality monitoring"""
 
    def __init__(
        self,
        rubric: EvalRubric,
        sample_rate: float = 0.05,  # 5% sampling
        alert_threshold: float = 0.6,  # Alert below 60%
    ):
        self.rubric = rubric
        self.sample_rate = sample_rate
        self.alert_threshold = alert_threshold
        self._scores_buffer: list[dict] = []
 
    async def maybe_evaluate(
        self, input_text: str, output_text: str
    ) -> dict | None:
        """Evaluate based on sampling rate"""
        if random.random() > self.sample_rate:
            return None  # Skip
 
        result = await evaluate_with_claude(
            input_text, output_text, self.rubric
        )
        normalized = result["weighted_total"] / result["max_possible"]
 
        record = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "score": normalized,
            "details": result,
        }
        self._scores_buffer.append(record)
 
        # Alert check
        if normalized < self.alert_threshold:
            logger.warning(
                f"Quality alert: score {normalized:.2f} "
                f"(threshold: {self.alert_threshold})"
            )
            await self._send_alert(record)
 
        return record
 
    async def get_daily_report(self) -> dict:
        """Generate daily quality report"""
        if not self._scores_buffer:
            return {"status": "no_data"}
 
        scores = [r["score"] for r in self._scores_buffer]
        return {
            "date": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
            "total_evaluated": len(scores),
            "avg_score": round(sum(scores) / len(scores), 4),
            "min_score": round(min(scores), 4),
            "max_score": round(max(scores), 4),
            "below_threshold": sum(1 for s in scores if s < self.alert_threshold),
            "p50": round(sorted(scores)[len(scores) // 2], 4),
            "p90": round(sorted(scores)[int(len(scores) * 0.9)], 4),
        }
 
    async def _send_alert(self, record: dict):
        """Send quality alert (connect to Slack/PagerDuty/etc.)"""
        # Implementation: send to Slack webhook
        pass

TypeScript Implementation

Here's the same pattern using the TypeScript SDK.

// eval-pipeline.ts — TypeScript evaluation pipeline
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
interface EvalResult {
  scores: Record<string, { score: number; reasoning: string }>;
  weightedTotal: number;
  maxPossible: number;
  overallFeedback: string;
}
 
async function evaluateOutput(
  input: string,
  output: string,
  rubricPrompt: string,
  model: string = "claude-sonnet-4-6"
): Promise<EvalResult> {
  const response = await client.messages.create({
    model,
    max_tokens: 2000,
    system: `You are an expert LLM output evaluator. Respond in JSON format.`,
    messages: [
      {
        role: "user",
        content: `${rubricPrompt}\n\n## Input\n${input}\n\n## Output\n${output}`,
      },
    ],
  });
 
  const text =
    response.content[0].type === "text" ? response.content[0].text : "";
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) throw new Error("JSON parse failed");
 
  return JSON.parse(jsonMatch[0]) as EvalResult;
}
 
// Use Batch API for efficient large-scale evaluation
async function batchEvaluate(
  cases: Array<{ input: string; output: string }>,
  rubricPrompt: string
): Promise<EvalResult[]> {
  // Parallel processing with Batch API (50% cost reduction)
  const results = await Promise.all(
    cases.map((c) => evaluateOutput(c.input, c.output, rubricPrompt))
  );
  return results;
}

Summary

In this article, we covered the design and implementation of LLM evaluation pipelines using the Claude API.

The Claude-as-Judge pattern, combined with rubric design and consistency techniques, automates quality assessment at a level comparable to human evaluators. The prompt A/B testing framework enables data-driven decision-making based on statistical significance, moving prompt management from "gut feel improvements" to systematic optimization. Regression testing and quality monitoring complete the picture for production quality assurance.

Together, these components form the foundation for systematically managing the quality of Claude API-powered applications. We recommend starting with a small test suite and gradually expanding coverage as your confidence grows.

Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

If you found this article helpful, a small tip ($1.50) would mean a lot to us. Your support helps keep this site ad-free and covers server and hosting costs.

Related Articles

API & SDK2026-05-01
Stop Claude API Prompt Regressions with Golden-Dataset Testing
A complete production guide to catching the silent quality drift that hits Claude API prompts when models or prompts change — using golden datasets and LLM-as-a-Judge wired into CI.
API & SDK2026-04-26
Replay-Driven Testing for Claude API: A Production Pattern for Recording and Replaying Responses
A production-grade design for stabilizing Claude API tests by recording and replaying real responses. Covers cassettes for Messages, Streaming, Tool Use, CI integration, and incident replay.
API & SDK2026-04-12
Complete Testing Strategy for Claude API Applications — Unit, Integration, and E2E Patterns to Guarantee AI Output Quality
Solve the 'AI output changed and broke my tests' problem for good. Learn to combine mocks, semantic assertions, and snapshot testing into a practical test design pattern for Claude API applications.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →