When you hand routine work to a subagent, nine times out of ten it comes back clean—but that one stray result that broke a rule you agreed on is the real problem. As an indie developer running several sites (Dolice Labs), I split article drafting out to a separate agent, and violations like "a banned word slipped in" or "too few headings" tend to get merged unnoticed on the busiest days.
Reviewing everything by hand would catch them, but then delegation buys you nothing. So I built a check that grades the work the instant a subagent finishes and sends it back for a redo when it falls short, using a SubagentStop hook.
SubagentStop sits right on the subagent's "submit" button
Claude Code hooks fire on several events, and the one dedicated to subagents is SubagentStop. It is distinct from Stop, which reacts to the parent agent halting; SubagentStop fires only the moment a subagent finishes returning its response. That is exactly the inspection line that sits before you accept the delivery.
What matters is that this hook can steer behavior through its exit code and JSON output. Printing {"decision": "block", "reason": "..."} to stdout tells Claude Code not to stop the subagent but to keep going with reason injected as an instruction. In other words, you can feed back "this part is wrong, fix it." Reusing the grading result directly as the rejection note is the heart of this design.
Pin the criteria to a JSON rubric, not prose
A vague bar like "write a good article" is unsuited to machine grading. Break it down until each violation can be judged unambiguously. The rubric I use for the drafting subagent looks like this.
{
"min_h2": 4,
"max_chars": 12000,
"min_chars": 2500,
"banned_words": ["sensational", "the best", "blazing", "godlike", "complete guide", "definitive"],
"forbidden_openers": ["In this article", "This article will", "How was that"],
"require_code_block": true
}The key is that every item is either countable or decidable by string match. Subjective judgments (is it interesting, is it readable) do not belong here. Keep only "rule violations a machine can reject with certainty," and leave creative quality to the human and the main prompt. Drawing that boundary is what keeps the hook from looping on false positives.
The hook body: open the transcript from the stdin JSON
A SubagentStop hook receives session info as JSON on stdin. Inside it, transcript_path points to the subagent's conversation log (JSONL). The last assistant message is the deliverable, so we pull it out and pass it to the grader.
#!/usr/bin/env bash
# .claude/hooks/grade-subagent.sh
set -euo pipefail
INPUT="$(cat)" # the hook receives JSON on stdin
TRANSCRIPT="$(printf '%s' "$INPUT" | node -e \
'let d="";process.stdin.on("data",c=>d+=c).on("end",()=>{
console.log(JSON.parse(d).transcript_path || "")})')"
if [ -z "$TRANSCRIPT" ] || [ ! -f "$TRANSCRIPT" ]; then
exit 0 # nothing to grade, pass through
fi
node "$(dirname "$0")/grade.mjs" "$TRANSCRIPT"The set -euo pipefail is there so that if the grader crashes, the hook does not silently count as a pass. A broken inspection line that quietly approves everything is the scariest failure mode for a quality gate.
The grader: read the tail of the transcript and match the rubric
Keep the grader deterministic (same input, same result). It calls no external API, so it is fast, free, and unaffected by network outages.
// .claude/hooks/grade.mjs
import { readFileSync } from "node:fs";
const RUBRIC = JSON.parse(
readFileSync(new URL("./rubric.json", import.meta.url), "utf8")
);
const transcriptPath = process.argv[2];
const lines = readFileSync(transcriptPath, "utf8").trim().split("\n");
// Walk the JSONL from the end to find the last assistant text
let text = "";
for (let i = lines.length - 1; i >= 0; i--) {
const ev = JSON.parse(lines[i]);
if (ev.message?.role !== "assistant") continue;
const blocks = ev.message.content ?? [];
text = blocks.filter(b => b.type === "text").map(b => b.text).join("\n");
if (text) break;
}
const fail = [];
const h2 = (text.match(/^##\s+/gm) ?? []).length;
if (h2 < RUBRIC.min_h2) fail.push(`${h2} H2 headings (need at least ${RUBRIC.min_h2})`);
const chars = [...text].length;
if (chars < RUBRIC.min_chars) fail.push(`${chars} chars (need at least ${RUBRIC.min_chars})`);
if (chars > RUBRIC.max_chars) fail.push(`${chars} chars (over the ${RUBRIC.max_chars} cap)`);
for (const w of RUBRIC.banned_words)
if (text.includes(w)) fail.push(`contains banned word "${w}"`);
for (const o of RUBRIC.forbidden_openers)
if (text.includes(o)) fail.push(`contains boilerplate phrase "${o}"`);
if (RUBRIC.require_code_block && !text.includes("```"))
fail.push("no code block");
if (fail.length === 0) process.exit(0); // pass: exit with no output
// fail: return a block decision as JSON; the reason becomes the redo instruction
console.log(JSON.stringify({
decision: "block",
reason:
"The deliverable does not meet the quality rubric. Fix these and resubmit:\n" +
fail.map(f => `- ${f}`).join("\n"),
}));
process.exit(0);The trick is returning decision: block while still exiting with code 0. Exit code 2 also blocks, but returning JSON lets you pass reason straight to the subagent as an instruction, so it knows what to fix. Making the rejection note a bullet list visibly improved the accuracy of the regeneration.
Always add a guard against infinite loops
The first thing this setup got wrong was a subagent retrying forever on a violation it could not fix. If the rubric is too strict to satisfy structurally, block-and-regenerate never stops.
As a guard, count how many prior rejections appear in the transcript, and once it exceeds a limit, stop blocking and escalate to a human.
const blockCount = lines.filter(l => l.includes('"decision":"block"')).length;
if (fail.length && blockCount >= 2) {
// if two redos still do not pass, stop and log it for a human
console.error("[grade] retry limit. Escalate to manual review: " + fail.join(" / "));
process.exit(0); // do NOT block here = break the loop
}For a grading gate, "not blocking forever" turned out to matter more in practice than "blocking automatically." A gate that never stops is like an alert that never stops: eventually everyone ignores it.
Three practical lessons from running it
First, I pulled the rubric out into a JSON file instead of hard-coding it. Criteria always change in operation. If you have to touch the script every time a new banned word appears, you will not keep it up.
Second, I kept the grader deterministic. The temptation to let a model grade is strong, but if the same deliverable passes or fails at random, the subagent cannot learn anything except "bad luck." Splitting the layers—mechanical violations in code, subjective quality in the main prompt and the human—proved far more stable.
Third, I always made the rejection reason concrete. Returning "3 H2 headings (need 4)" with numbers, instead of "insufficient quality," raises the odds the redo passes on the first try. The granularity of feedback directly sets how fast the self-correction loop converges.
As a next step, wire up a single SubagentStop hook with a minimal rubric of just banned_words and min_h2. Once the inspection line is running, adding criteria later is easy. I hope this helps anyone wrestling with the quality of delegated work.