Building LLM Evaluation Pipelines with Claude API — Claude-as-Judge, Prompt A/B Testing, and Quality Scoring Patterns

Once your Claude API-powered application hits production, you'll inevitably face a critical question: how do you guarantee output quality at scale?

A small tweak to your prompt can produce unexpected outputs. A model version upgrade might break existing workflows. Manual spot-checking misses hundreds of edge cases. Without a systematic approach, you're essentially flying blind.

What follows is an evaluation pipeline for Claude-powered applications built piece by piece: Claude-as-Judge for automated grading, prompt A/B testing with statistical rigor, quality scoring, and regression testing.

Who This Article Is For

Developers running Claude API applications in production
Teams struggling with prompt quality management
Engineers looking to integrate evaluation into their CI/CD pipelines

Prerequisites

Python 3.11+
anthropic Python SDK (latest version)
TypeScript examples are also included

The Claude-as-Judge Pattern — Automated Output Evaluation

Core Concept

Claude-as-Judge uses Claude itself as an evaluator to assess the quality of outputs from other Claude calls (or any LLM output). The key advantage is that it achieves high correlation with human evaluators while processing thousands of evaluations in minutes.

Designing Evaluation Rubrics

Clear evaluation criteria are essential. Vague rubrics produce inconsistent results, so you need specific scoring guidelines.

# evaluation_rubric.py — Evaluation rubric definition
from dataclasses import dataclass, field
from typing import Optional
 
@dataclass
class EvalCriterion:
    """A single evaluation criterion"""
    name: str
    description: str
    scoring_guide: dict[int, str]  # score → description
    weight: float = 1.0
 
@dataclass
class EvalRubric:
    """Complete evaluation rubric"""
    name: str
    criteria: list[EvalCriterion] = field(default_factory=list)
 
    def to_prompt(self) -> str:
        """Convert rubric to prompt string"""
        lines = [f"# Evaluation Rubric: {self.name}\n"]
        for c in self.criteria:
            lines.append(f"## {c.name} (weight: {c.weight})")
            lines.append(f"{c.description}\n")
            for score, desc in sorted(c.scoring_guide.items()):
                lines.append(f"- **{score}**: {desc}")
            lines.append("")
        return "\n".join(lines)
 
# Example: Customer support response quality rubric
support_rubric = EvalRubric(
    name="Customer Support Response Quality",
    criteria=[
        EvalCriterion(
            name="Accuracy",
            description="Is the response factually correct with no misinformation?",
            scoring_guide={
                1: "Contains significant factual errors",
                2: "Minor inaccuracies present",
                3: "Mostly accurate but some vague areas",
                4: "Accurate with specific information provided",
                5: "Completely accurate with relevant supplementary details",
            },
            weight=2.0,
        ),
        EvalCriterion(
            name="Completeness",
            description="Does the response address all parts of the question?",
            scoring_guide={
                1: "Fails to address the main question",
                2: "Only partially addresses the question",
                3: "Addresses the main question but lacks supplementary info",
                4: "Fully addresses the question with next steps",
                5: "Comprehensive answer with proactive related information",
            },
            weight=1.5,
        ),
        EvalCriterion(
            name="Tone",
            description="Is the tone appropriate and empathetic?",
            scoring_guide={
                1: "Cold or overly formal",
                2: "Somewhat mechanical",
                3: "Standard, no issues",
                4: "Friendly and polite",
                5: "Empathetic and reassuring",
            },
            weight=1.0,
        ),
    ],
)

Implementing Claude-as-Judge

Here's the core function that uses the rubric to have Claude evaluate outputs.

# claude_judge.py — Claude-as-Judge evaluation engine
import anthropic
import json
from typing import Any
 
client = anthropic.Anthropic()  # Uses ANTHROPIC_API_KEY env var
 
async def evaluate_with_claude(
    input_text: str,
    output_text: str,
    rubric: EvalRubric,
    reference_answer: str | None = None,
    model: str = "claude-sonnet-4-6",
) -> dict[str, Any]:
    """
    Evaluate an output using Claude-as-Judge
 
    Args:
        input_text: Original input (user question, etc.)
        output_text: Output to evaluate
        rubric: Evaluation rubric
        reference_answer: Reference answer (optional)
        model: Model to use for evaluation
 
    Returns:
        Dictionary with per-criterion scores and overall score
    """
    system_prompt = """You are an expert evaluator of LLM outputs.
Follow the given rubric strictly and score each criterion.
Evaluate objectively and consistently.
 
Always respond in the following JSON format:
{
  "scores": {
    "criterion_name": {"score": number, "reasoning": "explanation"},
    ...
  },
  "weighted_total": weighted_total_score,
  "max_possible": maximum_possible_score,
  "overall_feedback": "overall feedback"
}"""
 
    user_prompt = f"""Evaluate the following output against the input.
 
{rubric.to_prompt()}
 
## Input
{input_text}
 
## Output to Evaluate
{output_text}
"""
    if reference_answer:
        user_prompt += f"\n## Reference Answer (ideal response)\n{reference_answer}\n"
 
    response = client.messages.create(
        model=model,
        max_tokens=2000,
        system=system_prompt,
        messages=[{"role": "user", "content": user_prompt}],
    )
 
    # Extract and parse JSON
    text = response.content[0].text
    json_start = text.find("{")
    json_end = text.rfind("}") + 1
    result = json.loads(text[json_start:json_end])
 
    return result
 
# Usage example
# result = await evaluate_with_claude(
#     input_text="How do I return an item?",
#     output_text="You can return items within 30 days...",
#     rubric=support_rubric,
# )
# print(f"Overall score: {result['weighted_total']}/{result['max_possible']}")

Expected output:

{
  "scores": {
    "Accuracy": {"score": 4, "reasoning": "Return policy and steps are accurately described"},
    "Completeness": {"score": 3, "reasoning": "Covers basic steps but missing shipping cost details"},
    "Tone": {"score": 4, "reasoning": "Friendly and polite tone throughout"}
  },
  "weighted_total": 16.5,
  "max_possible": 22.5,
  "overall_feedback": "Accurate and polite response, but could include shipping cost information"
}

Techniques for Improving Evaluation Consistency

To increase the accuracy of Claude-as-Judge evaluations, consider these approaches:

Few-shot examples: Include concrete response examples at each score level alongside the rubric
Pairwise comparison: Comparing two outputs side-by-side is more consistent than absolute scoring
Median of multiple runs: Run the same evaluation 3 times and take the median to reduce variance

async def evaluate_with_consistency(
    input_text: str,
    output_text: str,
    rubric: EvalRubric,
    n_runs: int = 3,
) -> dict:
    """Evaluate multiple times and take the median"""
    import statistics
 
    results = []
    for _ in range(n_runs):
        result = await evaluate_with_claude(input_text, output_text, rubric)
        results.append(result)
 
    # Calculate median score for each criterion
    median_scores = {}
    for criterion in rubric.criteria:
        scores = [r["scores"][criterion.name]["score"] for r in results]
        median_scores[criterion.name] = statistics.median(scores)
 
    return {
        "median_scores": median_scores,
        "all_runs": results,
        "consistency": _calculate_consistency(results, rubric),
    }

Prompt A/B Testing Framework

Why A/B Test Your Prompts?

"This prompt feels better" is a dangerous basis for production decisions. Prompt A/B testing runs two or more prompt variants against the same test cases and determines whether there's a statistically significant difference.

Framework Design

# prompt_ab_test.py — Prompt A/B testing framework
import anthropic
import asyncio
from dataclasses import dataclass
from datetime import datetime
from scipy import stats
import numpy as np
 
client = anthropic.Anthropic()
 
@dataclass
class PromptVariant:
    """A prompt variant"""
    name: str
    system_prompt: str
    model: str = "claude-sonnet-4-6"
    temperature: float = 0.0
    max_tokens: int = 4096
 
@dataclass
class TestCase:
    """A test case"""
    id: str
    input_text: str
    reference_answer: str | None = None
    metadata: dict | None = None
 
@dataclass
class ABTestResult:
    """A/B test results"""
    variant_a_scores: list[float]
    variant_b_scores: list[float]
    t_statistic: float
    p_value: float
    is_significant: bool
    winner: str | None
    effect_size: float
 
async def run_ab_test(
    variant_a: PromptVariant,
    variant_b: PromptVariant,
    test_cases: list[TestCase],
    rubric: EvalRubric,
    significance_level: float = 0.05,
) -> ABTestResult:
    """
    A/B test two prompt variants
 
    Args:
        variant_a: Variant A (baseline)
        variant_b: Variant B (challenger)
        test_cases: Test case set
        rubric: Evaluation rubric
        significance_level: Significance level (default 5%)
    """
    scores_a, scores_b = [], []
 
    for tc in test_cases:
        # Generate outputs from both variants
        output_a = await _generate(variant_a, tc.input_text)
        output_b = await _generate(variant_b, tc.input_text)
 
        # Evaluate with Claude-as-Judge
        eval_a = await evaluate_with_claude(
            tc.input_text, output_a, rubric, tc.reference_answer
        )
        eval_b = await evaluate_with_claude(
            tc.input_text, output_b, rubric, tc.reference_answer
        )
 
        scores_a.append(eval_a["weighted_total"] / eval_a["max_possible"])
        scores_b.append(eval_b["weighted_total"] / eval_b["max_possible"])
 
    # Paired t-test (same test cases compared)
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    is_significant = p_value < significance_level
 
    # Effect size (Cohen's d)
    diff = np.array(scores_a) - np.array(scores_b)
    effect_size = np.mean(diff) / np.std(diff) if np.std(diff) > 0 else 0
 
    winner = None
    if is_significant:
        winner = variant_a.name if np.mean(scores_a) > np.mean(scores_b) else variant_b.name
 
    return ABTestResult(
        variant_a_scores=scores_a,
        variant_b_scores=scores_b,
        t_statistic=t_stat,
        p_value=p_value,
        is_significant=is_significant,
        winner=winner,
        effect_size=effect_size,
    )
 
async def _generate(variant: PromptVariant, input_text: str) -> str:
    """Generate output from a prompt variant"""
    response = client.messages.create(
        model=variant.model,
        max_tokens=variant.max_tokens,
        temperature=variant.temperature,
        system=variant.system_prompt,
        messages=[{"role": "user", "content": input_text}],
    )
    return response.content[0].text
 
# Expected output example:
# ABTestResult(
#     variant_a_scores=[0.82, 0.78, 0.85, ...],
#     variant_b_scores=[0.88, 0.91, 0.87, ...],
#     t_statistic=-3.42,
#     p_value=0.002,
#     is_significant=True,
#     winner="variant_b_detailed_prompt",
#     effect_size=0.65
# )

Test Case Design Strategy

The reliability of your A/B test depends heavily on test case quality and quantity.

Minimum 30 cases: Ensure adequate statistical power with at least 30 test cases
Edge case coverage: Include ambiguous inputs, long inputs, and multilingual inputs beyond just standard cases
Category balance: Distribute test cases evenly across use case categories
Golden dataset: Prepare ideal reference answers written by humans

# Building a test case set
test_cases = [
    TestCase(
        id="support-001",
        input_text="My order hasn't arrived. Order number is ORD-12345.",
        reference_answer="I'm sorry for the inconvenience. For order ORD-12345...",
        metadata={"category": "shipping", "difficulty": "standard"},
    ),
    TestCase(
        id="support-002",
        input_text="I was charged twice on last month's bill",
        reference_answer=None,  # No reference → rubric evaluation only
        metadata={"category": "billing", "difficulty": "complex"},
    ),
    # ... 30+ cases
]

Regression Testing — Protecting Quality Across Prompt Changes

Why Regression Testing Matters

Prompt improvements and model version changes can cause unintended quality drops. Regression testing automatically verifies that your current version maintains quality at or above baseline levels.

CI/CD Integration

# regression_test.py — Prompt regression testing
import json
from pathlib import Path
 
BASELINE_PATH = Path("eval/baselines/current.json")
THRESHOLD = 0.95  # Must maintain 95% of baseline
 
async def run_regression_test(
    variant: PromptVariant,
    test_cases: list[TestCase],
    rubric: EvalRubric,
) -> dict:
    """
    Run regression test and return comparison against baseline
    """
    # Load current baseline
    baseline = json.loads(BASELINE_PATH.read_text())
 
    # Score the new variant
    new_scores = []
    for tc in test_cases:
        output = await _generate(variant, tc.input_text)
        eval_result = await evaluate_with_claude(
            tc.input_text, output, rubric
        )
        normalized = eval_result["weighted_total"] / eval_result["max_possible"]
        new_scores.append({"id": tc.id, "score": normalized})
 
    # Compare against baseline
    baseline_avg = sum(b["score"] for b in baseline["scores"]) / len(baseline["scores"])
    new_avg = sum(s["score"] for s in new_scores) / len(new_scores)
    ratio = new_avg / baseline_avg if baseline_avg > 0 else 0
 
    passed = ratio >= THRESHOLD
    regressions = _find_regressions(baseline["scores"], new_scores)
 
    return {
        "passed": passed,
        "baseline_avg": round(baseline_avg, 4),
        "new_avg": round(new_avg, 4),
        "ratio": round(ratio, 4),
        "threshold": THRESHOLD,
        "regressions": regressions,  # Cases with significant score drops
        "improvements": _find_improvements(baseline["scores"], new_scores),
    }
 
def _find_regressions(baseline_scores, new_scores, drop_threshold=0.15):
    """Detect cases with significant score drops from baseline"""
    regressions = []
    baseline_map = {s["id"]: s["score"] for s in baseline_scores}
    for ns in new_scores:
        if ns["id"] in baseline_map:
            drop = baseline_map[ns["id"]] - ns["score"]
            if drop > drop_threshold:
                regressions.append({
                    "id": ns["id"],
                    "baseline_score": baseline_map[ns["id"]],
                    "new_score": ns["score"],
                    "drop": round(drop, 4),
                })
    return regressions
 
# Usage in GitHub Actions / CI:
# python -m pytest eval/test_regression.py -v
# → Failures block the PR

Expected output:

{
  "passed": true,
  "baseline_avg": 0.8234,
  "new_avg": 0.8512,
  "ratio": 1.0338,
  "threshold": 0.95,
  "regressions": [],
  "improvements": [
    {"id": "support-015", "baseline_score": 0.72, "new_score": 0.89, "gain": 0.17}
  ]
}

Production Quality Monitoring Dashboard

Continuous Monitoring in Production

Your evaluation pipeline shouldn't stop at development time. Here's a pattern for sampling production requests and monitoring quality continuously.

# quality_monitor.py — Production quality monitoring
import random
import logging
from datetime import datetime, timezone
 
logger = logging.getLogger(__name__)
 
class QualityMonitor:
    """Production quality monitoring"""
 
    def __init__(
        self,
        rubric: EvalRubric,
        sample_rate: float = 0.05,  # 5% sampling
        alert_threshold: float = 0.6,  # Alert below 60%
    ):
        self.rubric = rubric
        self.sample_rate = sample_rate
        self.alert_threshold = alert_threshold
        self._scores_buffer: list[dict] = []
 
    async def maybe_evaluate(
        self, input_text: str, output_text: str
    ) -> dict | None:
        """Evaluate based on sampling rate"""
        if random.random() > self.sample_rate:
            return None  # Skip
 
        result = await evaluate_with_claude(
            input_text, output_text, self.rubric
        )
        normalized = result["weighted_total"] / result["max_possible"]
 
        record = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "score": normalized,
            "details": result,
        }
        self._scores_buffer.append(record)
 
        # Alert check
        if normalized < self.alert_threshold:
            logger.warning(
                f"Quality alert: score {normalized:.2f} "
                f"(threshold: {self.alert_threshold})"
            )
            await self._send_alert(record)
 
        return record
 
    async def get_daily_report(self) -> dict:
        """Generate daily quality report"""
        if not self._scores_buffer:
            return {"status": "no_data"}
 
        scores = [r["score"] for r in self._scores_buffer]
        return {
            "date": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
            "total_evaluated": len(scores),
            "avg_score": round(sum(scores) / len(scores), 4),
            "min_score": round(min(scores), 4),
            "max_score": round(max(scores), 4),
            "below_threshold": sum(1 for s in scores if s < self.alert_threshold),
            "p50": round(sorted(scores)[len(scores) // 2], 4),
            "p90": round(sorted(scores)[int(len(scores) * 0.9)], 4),
        }
 
    async def _send_alert(self, record: dict):
        """Send quality alert (connect to Slack/PagerDuty/etc.)"""
        # Implementation: send to Slack webhook
        pass

TypeScript Implementation

Here's the same pattern using the TypeScript SDK.

// eval-pipeline.ts — TypeScript evaluation pipeline
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
interface EvalResult {
  scores: Record<string, { score: number; reasoning: string }>;
  weightedTotal: number;
  maxPossible: number;
  overallFeedback: string;
}
 
async function evaluateOutput(
  input: string,
  output: string,
  rubricPrompt: string,
  model: string = "claude-sonnet-4-6"
): Promise<EvalResult> {
  const response = await client.messages.create({
    model,
    max_tokens: 2000,
    system: `You are an expert LLM output evaluator. Respond in JSON format.`,
    messages: [
      {
        role: "user",
        content: `${rubricPrompt}\n\n## Input\n${input}\n\n## Output\n${output}`,
      },
    ],
  });
 
  const text =
    response.content[0].type === "text" ? response.content[0].text : "";
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (!jsonMatch) throw new Error("JSON parse failed");
 
  return JSON.parse(jsonMatch[0]) as EvalResult;
}
 
// Use Batch API for efficient large-scale evaluation
async function batchEvaluate(
  cases: Array<{ input: string; output: string }>,
  rubricPrompt: string
): Promise<EvalResult[]> {
  // Parallel processing with Batch API (50% cost reduction)
  const results = await Promise.all(
    cases.map((c) => evaluateOutput(c.input, c.output, rubricPrompt))
  );
  return results;
}

Summary

Design and implementation of an LLM evaluation pipeline on the Claude API come down to a few interlocking parts.

The Claude-as-Judge pattern, combined with rubric design and consistency techniques, automates quality assessment at a level comparable to human evaluators. The prompt A/B testing framework enables data-driven decision-making based on statistical significance, moving prompt management from "gut feel improvements" to systematic optimization. Regression testing and quality monitoring complete the picture for production quality assurance.

Together, these components form the foundation for systematically managing the quality of Claude API-powered applications. Start with ten cases you already know the right answer to, wire them into CI, and expand only once that loop runs without you.