Setup and context — Why You Need an LLM Evaluation Pipeline
Once your Claude API-powered application hits production, you'll inevitably face a critical question: how do you guarantee output quality at scale?
A small tweak to your prompt can produce unexpected outputs. A model version upgrade might break existing workflows. Manual spot-checking misses hundreds of edge cases. Without a systematic approach, you're essentially flying blind.
This article walks you through building a comprehensive evaluation pipeline for Claude-powered applications. We'll cover Claude-as-Judge for automated evaluation, prompt A/B testing with statistical rigor, quality scoring systems, and regression testing — everything you need to ship with confidence.
Who This Article Is For
- Developers running Claude API applications in production
- Teams struggling with prompt quality management
- Engineers looking to integrate evaluation into their CI/CD pipelines
Prerequisites
- Python 3.11+
anthropicPython SDK (latest version)- TypeScript examples are also included
The Claude-as-Judge Pattern — Automated Output Evaluation
Core Concept
Claude-as-Judge uses Claude itself as an evaluator to assess the quality of outputs from other Claude calls (or any LLM output). The key advantage is that it achieves high correlation with human evaluators while processing thousands of evaluations in minutes.
Designing Evaluation Rubrics
Clear evaluation criteria are essential. Vague rubrics produce inconsistent results, so you need specific scoring guidelines.
# evaluation_rubric.py — Evaluation rubric definition
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class EvalCriterion:
"""A single evaluation criterion"""
name: str
description: str
scoring_guide: dict[int, str] # score → description
weight: float = 1.0
@dataclass
class EvalRubric:
"""Complete evaluation rubric"""
name: str
criteria: list[EvalCriterion] = field(default_factory=list)
def to_prompt(self) -> str:
"""Convert rubric to prompt string"""
lines = [f"# Evaluation Rubric: {self.name}\n"]
for c in self.criteria:
lines.append(f"## {c.name} (weight: {c.weight})")
lines.append(f"{c.description}\n")
for score, desc in sorted(c.scoring_guide.items()):
lines.append(f"- **{score}**: {desc}")
lines.append("")
return "\n".join(lines)
# Example: Customer support response quality rubric
support_rubric = EvalRubric(
name="Customer Support Response Quality",
criteria=[
EvalCriterion(
name="Accuracy",
description="Is the response factually correct with no misinformation?",
scoring_guide={
1: "Contains significant factual errors",
2: "Minor inaccuracies present",
3: "Mostly accurate but some vague areas",
4: "Accurate with specific information provided",
5: "Completely accurate with relevant supplementary details",
},
weight=2.0,
),
EvalCriterion(
name="Completeness",
description="Does the response address all parts of the question?",
scoring_guide={
1: "Fails to address the main question",
2: "Only partially addresses the question",
3: "Addresses the main question but lacks supplementary info",
4: "Fully addresses the question with next steps",
5: "Comprehensive answer with proactive related information",
},
weight=1.5,
),
EvalCriterion(
name="Tone",
description="Is the tone appropriate and empathetic?",
scoring_guide={
1: "Cold or overly formal",
2: "Somewhat mechanical",
3: "Standard, no issues",
4: "Friendly and polite",
5: "Empathetic and reassuring",
},
weight=1.0,
),
],
)Implementing Claude-as-Judge
Here's the core function that uses the rubric to have Claude evaluate outputs.
# claude_judge.py — Claude-as-Judge evaluation engine
import anthropic
import json
from typing import Any
client = anthropic.Anthropic() # Uses ANTHROPIC_API_KEY env var
async def evaluate_with_claude(
input_text: str,
output_text: str,
rubric: EvalRubric,
reference_answer: str | None = None,
model: str = "claude-sonnet-4-6",
) -> dict[str, Any]:
"""
Evaluate an output using Claude-as-Judge
Args:
input_text: Original input (user question, etc.)
output_text: Output to evaluate
rubric: Evaluation rubric
reference_answer: Reference answer (optional)
model: Model to use for evaluation
Returns:
Dictionary with per-criterion scores and overall score
"""
system_prompt = """You are an expert evaluator of LLM outputs.
Follow the given rubric strictly and score each criterion.
Evaluate objectively and consistently.
Always respond in the following JSON format:
{
"scores": {
"criterion_name": {"score": number, "reasoning": "explanation"},
...
},
"weighted_total": weighted_total_score,
"max_possible": maximum_possible_score,
"overall_feedback": "overall feedback"
}"""
user_prompt = f"""Evaluate the following output against the input.
{rubric.to_prompt()}
## Input
{input_text}
## Output to Evaluate
{output_text}
"""
if reference_answer:
user_prompt += f"\n## Reference Answer (ideal response)\n{reference_answer}\n"
response = client.messages.create(
model=model,
max_tokens=2000,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}],
)
# Extract and parse JSON
text = response.content[0].text
json_start = text.find("{")
json_end = text.rfind("}") + 1
result = json.loads(text[json_start:json_end])
return result
# Usage example
# result = await evaluate_with_claude(
# input_text="How do I return an item?",
# output_text="You can return items within 30 days...",
# rubric=support_rubric,
# )
# print(f"Overall score: {result['weighted_total']}/{result['max_possible']}")Expected output:
{
"scores": {
"Accuracy": {"score": 4, "reasoning": "Return policy and steps are accurately described"},
"Completeness": {"score": 3, "reasoning": "Covers basic steps but missing shipping cost details"},
"Tone": {"score": 4, "reasoning": "Friendly and polite tone throughout"}
},
"weighted_total": 16.5,
"max_possible": 22.5,
"overall_feedback": "Accurate and polite response, but could include shipping cost information"
}Techniques for Improving Evaluation Consistency
To increase the accuracy of Claude-as-Judge evaluations, consider these approaches:
- Few-shot examples: Include concrete response examples at each score level alongside the rubric
- Pairwise comparison: Comparing two outputs side-by-side is more consistent than absolute scoring
- Median of multiple runs: Run the same evaluation 3 times and take the median to reduce variance
async def evaluate_with_consistency(
input_text: str,
output_text: str,
rubric: EvalRubric,
n_runs: int = 3,
) -> dict:
"""Evaluate multiple times and take the median"""
import statistics
results = []
for _ in range(n_runs):
result = await evaluate_with_claude(input_text, output_text, rubric)
results.append(result)
# Calculate median score for each criterion
median_scores = {}
for criterion in rubric.criteria:
scores = [r["scores"][criterion.name]["score"] for r in results]
median_scores[criterion.name] = statistics.median(scores)
return {
"median_scores": median_scores,
"all_runs": results,
"consistency": _calculate_consistency(results, rubric),
}Prompt A/B Testing Framework
Why A/B Test Your Prompts?
"This prompt feels better" is a dangerous basis for production decisions. Prompt A/B testing runs two or more prompt variants against the same test cases and determines whether there's a statistically significant difference.
Framework Design
# prompt_ab_test.py — Prompt A/B testing framework
import anthropic
import asyncio
from dataclasses import dataclass
from datetime import datetime
from scipy import stats
import numpy as np
client = anthropic.Anthropic()
@dataclass
class PromptVariant:
"""A prompt variant"""
name: str
system_prompt: str
model: str = "claude-sonnet-4-6"
temperature: float = 0.0
max_tokens: int = 4096
@dataclass
class TestCase:
"""A test case"""
id: str
input_text: str
reference_answer: str | None = None
metadata: dict | None = None
@dataclass
class ABTestResult:
"""A/B test results"""
variant_a_scores: list[float]
variant_b_scores: list[float]
t_statistic: float
p_value: float
is_significant: bool
winner: str | None
effect_size: float
async def run_ab_test(
variant_a: PromptVariant,
variant_b: PromptVariant,
test_cases: list[TestCase],
rubric: EvalRubric,
significance_level: float = 0.05,
) -> ABTestResult:
"""
A/B test two prompt variants
Args:
variant_a: Variant A (baseline)
variant_b: Variant B (challenger)
test_cases: Test case set
rubric: Evaluation rubric
significance_level: Significance level (default 5%)
"""
scores_a, scores_b = [], []
for tc in test_cases:
# Generate outputs from both variants
output_a = await _generate(variant_a, tc.input_text)
output_b = await _generate(variant_b, tc.input_text)
# Evaluate with Claude-as-Judge
eval_a = await evaluate_with_claude(
tc.input_text, output_a, rubric, tc.reference_answer
)
eval_b = await evaluate_with_claude(
tc.input_text, output_b, rubric, tc.reference_answer
)
scores_a.append(eval_a["weighted_total"] / eval_a["max_possible"])
scores_b.append(eval_b["weighted_total"] / eval_b["max_possible"])
# Paired t-test (same test cases compared)
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
is_significant = p_value < significance_level
# Effect size (Cohen's d)
diff = np.array(scores_a) - np.array(scores_b)
effect_size = np.mean(diff) / np.std(diff) if np.std(diff) > 0 else 0
winner = None
if is_significant:
winner = variant_a.name if np.mean(scores_a) > np.mean(scores_b) else variant_b.name
return ABTestResult(
variant_a_scores=scores_a,
variant_b_scores=scores_b,
t_statistic=t_stat,
p_value=p_value,
is_significant=is_significant,
winner=winner,
effect_size=effect_size,
)
async def _generate(variant: PromptVariant, input_text: str) -> str:
"""Generate output from a prompt variant"""
response = client.messages.create(
model=variant.model,
max_tokens=variant.max_tokens,
temperature=variant.temperature,
system=variant.system_prompt,
messages=[{"role": "user", "content": input_text}],
)
return response.content[0].text
# Expected output example:
# ABTestResult(
# variant_a_scores=[0.82, 0.78, 0.85, ...],
# variant_b_scores=[0.88, 0.91, 0.87, ...],
# t_statistic=-3.42,
# p_value=0.002,
# is_significant=True,
# winner="variant_b_detailed_prompt",
# effect_size=0.65
# )Test Case Design Strategy
The reliability of your A/B test depends heavily on test case quality and quantity.
- Minimum 30 cases: Ensure adequate statistical power with at least 30 test cases
- Edge case coverage: Include ambiguous inputs, long inputs, and multilingual inputs beyond just standard cases
- Category balance: Distribute test cases evenly across use case categories
- Golden dataset: Prepare ideal reference answers written by humans
# Building a test case set
test_cases = [
TestCase(
id="support-001",
input_text="My order hasn't arrived. Order number is ORD-12345.",
reference_answer="I'm sorry for the inconvenience. For order ORD-12345...",
metadata={"category": "shipping", "difficulty": "standard"},
),
TestCase(
id="support-002",
input_text="I was charged twice on last month's bill",
reference_answer=None, # No reference → rubric evaluation only
metadata={"category": "billing", "difficulty": "complex"},
),
# ... 30+ cases
]Regression Testing — Protecting Quality Across Prompt Changes
Why Regression Testing Matters
Prompt improvements and model version changes can cause unintended quality drops. Regression testing automatically verifies that your current version maintains quality at or above baseline levels.
CI/CD Integration
# regression_test.py — Prompt regression testing
import json
from pathlib import Path
BASELINE_PATH = Path("eval/baselines/current.json")
THRESHOLD = 0.95 # Must maintain 95% of baseline
async def run_regression_test(
variant: PromptVariant,
test_cases: list[TestCase],
rubric: EvalRubric,
) -> dict:
"""
Run regression test and return comparison against baseline
"""
# Load current baseline
baseline = json.loads(BASELINE_PATH.read_text())
# Score the new variant
new_scores = []
for tc in test_cases:
output = await _generate(variant, tc.input_text)
eval_result = await evaluate_with_claude(
tc.input_text, output, rubric
)
normalized = eval_result["weighted_total"] / eval_result["max_possible"]
new_scores.append({"id": tc.id, "score": normalized})
# Compare against baseline
baseline_avg = sum(b["score"] for b in baseline["scores"]) / len(baseline["scores"])
new_avg = sum(s["score"] for s in new_scores) / len(new_scores)
ratio = new_avg / baseline_avg if baseline_avg > 0 else 0
passed = ratio >= THRESHOLD
regressions = _find_regressions(baseline["scores"], new_scores)
return {
"passed": passed,
"baseline_avg": round(baseline_avg, 4),
"new_avg": round(new_avg, 4),
"ratio": round(ratio, 4),
"threshold": THRESHOLD,
"regressions": regressions, # Cases with significant score drops
"improvements": _find_improvements(baseline["scores"], new_scores),
}
def _find_regressions(baseline_scores, new_scores, drop_threshold=0.15):
"""Detect cases with significant score drops from baseline"""
regressions = []
baseline_map = {s["id"]: s["score"] for s in baseline_scores}
for ns in new_scores:
if ns["id"] in baseline_map:
drop = baseline_map[ns["id"]] - ns["score"]
if drop > drop_threshold:
regressions.append({
"id": ns["id"],
"baseline_score": baseline_map[ns["id"]],
"new_score": ns["score"],
"drop": round(drop, 4),
})
return regressions
# Usage in GitHub Actions / CI:
# python -m pytest eval/test_regression.py -v
# → Failures block the PRExpected output:
{
"passed": true,
"baseline_avg": 0.8234,
"new_avg": 0.8512,
"ratio": 1.0338,
"threshold": 0.95,
"regressions": [],
"improvements": [
{"id": "support-015", "baseline_score": 0.72, "new_score": 0.89, "gain": 0.17}
]
}Production Quality Monitoring Dashboard
Continuous Monitoring in Production
Your evaluation pipeline shouldn't stop at development time. Here's a pattern for sampling production requests and monitoring quality continuously.
# quality_monitor.py — Production quality monitoring
import random
import logging
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
class QualityMonitor:
"""Production quality monitoring"""
def __init__(
self,
rubric: EvalRubric,
sample_rate: float = 0.05, # 5% sampling
alert_threshold: float = 0.6, # Alert below 60%
):
self.rubric = rubric
self.sample_rate = sample_rate
self.alert_threshold = alert_threshold
self._scores_buffer: list[dict] = []
async def maybe_evaluate(
self, input_text: str, output_text: str
) -> dict | None:
"""Evaluate based on sampling rate"""
if random.random() > self.sample_rate:
return None # Skip
result = await evaluate_with_claude(
input_text, output_text, self.rubric
)
normalized = result["weighted_total"] / result["max_possible"]
record = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"score": normalized,
"details": result,
}
self._scores_buffer.append(record)
# Alert check
if normalized < self.alert_threshold:
logger.warning(
f"Quality alert: score {normalized:.2f} "
f"(threshold: {self.alert_threshold})"
)
await self._send_alert(record)
return record
async def get_daily_report(self) -> dict:
"""Generate daily quality report"""
if not self._scores_buffer:
return {"status": "no_data"}
scores = [r["score"] for r in self._scores_buffer]
return {
"date": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
"total_evaluated": len(scores),
"avg_score": round(sum(scores) / len(scores), 4),
"min_score": round(min(scores), 4),
"max_score": round(max(scores), 4),
"below_threshold": sum(1 for s in scores if s < self.alert_threshold),
"p50": round(sorted(scores)[len(scores) // 2], 4),
"p90": round(sorted(scores)[int(len(scores) * 0.9)], 4),
}
async def _send_alert(self, record: dict):
"""Send quality alert (connect to Slack/PagerDuty/etc.)"""
# Implementation: send to Slack webhook
passTypeScript Implementation
Here's the same pattern using the TypeScript SDK.
// eval-pipeline.ts — TypeScript evaluation pipeline
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface EvalResult {
scores: Record<string, { score: number; reasoning: string }>;
weightedTotal: number;
maxPossible: number;
overallFeedback: string;
}
async function evaluateOutput(
input: string,
output: string,
rubricPrompt: string,
model: string = "claude-sonnet-4-6"
): Promise<EvalResult> {
const response = await client.messages.create({
model,
max_tokens: 2000,
system: `You are an expert LLM output evaluator. Respond in JSON format.`,
messages: [
{
role: "user",
content: `${rubricPrompt}\n\n## Input\n${input}\n\n## Output\n${output}`,
},
],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
const jsonMatch = text.match(/\{[\s\S]*\}/);
if (!jsonMatch) throw new Error("JSON parse failed");
return JSON.parse(jsonMatch[0]) as EvalResult;
}
// Use Batch API for efficient large-scale evaluation
async function batchEvaluate(
cases: Array<{ input: string; output: string }>,
rubricPrompt: string
): Promise<EvalResult[]> {
// Parallel processing with Batch API (50% cost reduction)
const results = await Promise.all(
cases.map((c) => evaluateOutput(c.input, c.output, rubricPrompt))
);
return results;
}Summary
In this article, we covered the design and implementation of LLM evaluation pipelines using the Claude API.
The Claude-as-Judge pattern, combined with rubric design and consistency techniques, automates quality assessment at a level comparable to human evaluators. The prompt A/B testing framework enables data-driven decision-making based on statistical significance, moving prompt management from "gut feel improvements" to systematic optimization. Regression testing and quality monitoring complete the picture for production quality assurance.
Together, these components form the foundation for systematically managing the quality of Claude API-powered applications. We recommend starting with a small test suite and gradually expanding coverage as your confidence grows.