Context Budgets for Nested Subagents: Designing Contracts So 5-Level Delegation Doesn't Lose Quality

Once subagents could nest, deeper delegation made summaries thinner and reruns more frequent. Here is how I rebuilt quality by adding four contracts between layers: token budgets, a handoff schema, failure isolation, and an independent grader.

Claude Code²⁰⁴ subagents⁶ architecture¹⁰ automation⁹⁸

✦ Premium Article

One night a scheduled run kept going forty minutes longer than usual. Tracing the logs, the subagent I had handed article generation to had spawned a grandchild for verification, which spawned a great-grandchild, and by the fourth level they were repeating the same consistency check in a loop. Nobody held the role of stopping.

As soon as subagent nesting shipped, I rebuilt the blog automation for the four sites I run as an indie developer — moving from a single fan-out of children to a tree of delegation. Going deep was genuinely convenient. But the moment I went deep, my quality metrics got worse. Summaries thinned out one level at a time, reruns climbed, and token spend swelled.

The cause was not the extra depth itself. It was the absence of any contract between the layers. These are the four contracts I added over three weeks at Dolice Labs — token budgets, a handoff schema, failure isolation, and an independent grader — and the numbers behind them.

"Can go deep" and "should go deep" are different questions

Tree delegation carries two kinds of decay that flat parallelism never had.

The first is summary decay. When a child returns to a parent, it summarizes its work log. The grandchild summarizes to the child, and the child summarizes that summary again on the way up. Information compresses at every hop, and stacked four deep, what reached the parent was a single line — "completed successfully." That is an accident, not a report.

The second is loss of control. The parent does not watch its grandchildren directly. When a middle layer runs away, all the parent receives is "in progress," with nothing to decide on. The forty-minute runaway above was exactly this.

So the design question is not "how many levels can I go," but "what contract does each layer sign with the ones above and below it." Depth is a result, not a goal.

Hand each layer a context budget: the token contract

The first thing I added was token-budget allocation. The parent holds the total budget, and every time it delegates, it explicitly passes the child the token amount it is allowed to spend. The child then carves a share for its own grandchildren.

Even allocation did not work. The leaves do the most concrete work yet run out of budget — a tapering problem. I decay the share with depth while guaranteeing a floor at the leaf.

def allocate_budget(total: int, depth: int, max_depth: int = 3,
                    leaf_floor: int = 8000) -> dict:
    """Split the parent's remaining `total` between this layer and its children.
    decay: tighter the deeper you go, capping any single layer's runaway.
    leaf_floor: guarantees the leaf can finish its concrete work.
    """
    if depth >= max_depth:
        return {"self": max(total, leaf_floor), "children_pool": 0}
 
    decay = 0.55 ** depth          # depth0=1.0, depth1=0.55, depth2=0.30
    self_share = int(total * 0.35 * decay)
    children_pool = total - self_share
 
    # If what's left for children dips under the leaf floor, do not go deeper
    if children_pool < leaf_floor:
        return {"self": total, "children_pool": 0, "force_leaf": True}
 
    return {"self": self_share, "children_pool": children_pool}

When force_leaf comes back, that layer stops delegating and finishes the work itself. This became "depth limiting by budget." Even in an environment that permits five levels, my article workflow hits the leaf floor at three, so it effectively caps at three. Letting the budget form the ceiling, rather than hardcoding a depth constant, flexes with how heavy the task is — which I find far easier to live with.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A budget-allocation algorithm that hands each layer its own token allowance, plus code that degrades summaries when a layer runs out

✦Measured results from three-level delegation: rerun rate down from 23% to 7%, summary fidelity up from 0.62 to 0.88

✦A rubric for an independent grader placed at the leaf, so a layer never scores its own output

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Protect summary fidelity with a handoff schema

My fix for summary decay was changing the return format from free text to a structured schema. Ask for "a nice summary" and the vocabulary abstracts away at every hop. Decide the fields up front and each layer only fills slots, so the facts that must survive do survive compression.

{
  "status": "ok | degraded | failed",
  "artifact_refs": ["content/articles/ja/claude-code/xxx.mdx"],
  "metrics": { "gate_violations": 0, "chars": 6200, "h2": 7 },
  "decisions": ["picked category claude-code", "excluded old slug to avoid dup"],
  "unresolved": ["EN code comments not yet translated"],
  "tokens_used": 14200
}

The crux is the unresolved field. Keep one field that must never come back empty and every layer gets into the habit of checking "is anything still open?" The small drops a child used to swallow — an untranslated English passage, an early sign of a count mismatch — now surface explicitly all the way to the parent.

Making status three-valued mattered in practice too. With a middle degraded state, the parent receives output that is "not a clean pass but not worth discarding," and can do a light repair instead of a full rerun. With only binary success and failure, I threw away near-miss output and reran everything, burning tokens.

Keep failures inside the layer: circuit breakers and degradation

So the runaway never happens again, each layer got a breaker that detects repetition of the same work. If a child repeats the same kind of delegation past a limit, that layer cuts it off and returns degraded.

class LayerBreaker:
    def __init__(self, max_retries=2, max_same_task=3):
        self.retries = 0
        self.task_counts = {}
 
    def allow(self, task_signature: str) -> bool:
        self.task_counts[task_signature] = self.task_counts.get(task_signature, 0) + 1
        if self.task_counts[task_signature] > 3:
            return False                 # repeated identical work -> cut here
        return True
 
    def on_child_fail(self) -> str:
        self.retries += 1
        if self.retries > 2:
            return "degrade"             # report degraded upward, do not re-delegate
        return "retry"

The point is to not throw raw failures up to the parent. A failure at layer N is first absorbed by layer N. If it cannot be absorbed, it is converted into a structured degraded or failed report. What the parent receives is a decidable state, not a raw exception.

In production the real win was choosing how to degrade. When a verification grandchild dies, the generated article itself is usually fine, so the child returns "article exists, verification incomplete." The parent can then rerun only the verification on a separate path. Having the option to not roll everything back drove the rerun rate down directly.

Put an independent grader at the leaf, so a layer never scores itself

The last contract is separating quality judgment from the generation lineage. Ask the layer that did the writing whether its own output passes and it almost always says yes — it is steeped in its own context. So, off the generation tree, I call an independent grader at the leaf given only the rubric.

GRADER_RUBRIC = {
    "intro_is_concrete": "opens from a concrete scene, not a template intro",
    "has_working_code": "has at least one complete, copy-paste-runnable example",
    "has_own_insight": "includes operational knowledge not in the official docs",
    "no_filler": "no padded summaries that restate the body",
    "premium_signals": "has three or more practical-value signals",
}
# The grader carries no generation context; it gets only the body and the rubric.
# Each item returns pass/fail with a one-line reason. Any fail -> status=degraded.

I check the independent grader against the mechanical article_gate.py I run afterward. Each week I count how often the grader said pass but the machine gate blocked, and reword the rubric on items where they disagree most. Keeping the role that stands in for human eyes out of the generation context was the whole point.

The numbers: how summary decay and rerun rate changed at three levels

I ran the same workflow for two weeks each, before and after the four contracts.

The rerun rate fell from 23% to 7%. Partial reruns enabled by degraded degradation were the largest factor; before, a verification failure alone meant rebuilding the whole article.

Summary fidelity — a hand-scored metric of whether the leaf's key facts survived in what the parent received — rose from 0.62 to 0.88. The decisions and unresolved fields of the structured schema carried it.

Token spend, on the other hand, rose 1.4x. That is the cost of calling the grader independently and of filling the schema faithfully. I will be honest about it: it roughly offset the savings from fewer reruns, so total cost was about flat — quality did not get cheaper, it got steadier.

Abnormal terminations from runaways hit zero in the two weeks after the breaker shipped. A forty-minute accident like the opening one is now physically impossible to cause, thanks to force_leaf and LayerBreaker working as two gates.

Order of adoption: where to add contracts first

I did not need to add all four at once. The order I recommend by effect-versus-cost is: failure isolation, then the handoff schema, then token budgets, then the independent grader.

I add the breaker first because stopping runaways kills the most expensive failure. The code is small and slots into the existing delegation without restructuring it. Adding the schema next stops summary decay and makes the parent's log readable again. That far alone improves operational stability a lot.

Token budgets start paying off once concurrency rises — running several sites at once, say. While concurrency is low the budget sits unused, so deferring it does little harm. The independent grader costs the most, so I add it once quality variance starts reaching paying readers.

Deep delegation, made into a goal in itself, always thins out. Place one contract at a time between the layers and the depth settles where it matches its own weight. Landing on that kind of design is what I see as my job right now.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.