⬡ API & SDK/2026-04-22Advanced

Implementing Progressive Delivery with the Claude Agent SDK: Canary, Feature Flags, and Automatic Rollback Patterns for Production

Production-grade patterns for safely rolling out AI agents built with the Claude Agent SDK. Combines canary traffic splitting, feature flags, and SLO-driven automatic rollback with runnable TypeScript/Hono implementation code.

claude-agent-sdk⁶ progressive-delivery canary feature-flags rollback production¹¹¹

✦ Premium Article

I tweaked a prompt, went to bed, and woke up to a support agent whose wrong-answer rate had tripled. The eval set had been green. Production's long-tail requests were the problem — cases the eval fixture never captured. I've lived that incident more than once, and every time the same thought surfaces: traditional CI/CD is designed to deploy code, not to change how an agent behaves. The CI pipeline will happily merge a prompt tweak that rewrites how the model reasons about a whole class of inputs, then hand it to production as if it were a typo fix.

This article walks through a production-tested approach to applying progressive delivery to agents built with the Claude Agent SDK. We'll combine canary rollouts, feature flags, and SLO-driven automatic rollback into a pipeline that replaces "deploy and pray" with an observable, self-healing loop. Everything below is distilled from setups I actually run in production, and is structured so you can take the first real step today. By the end you should know exactly where to start in your own stack, which pieces you can defer, and what the common traps look like so you can skip them on the first attempt rather than the third.

Why traditional CI/CD falls short for agents

For most web apps, Blue/Green or a simple canary is enough. Agents have extra properties that break this assumption:

Probabilistic output: the same input can yield different responses, so A/B comparisons need distributional thinking — averages alone will mislead you.
Multi-dimensional failure modes: latency and 5xx aren't enough. You also care about hallucinations, wrong tool selection, tone drift — quality signals that only show up in production.
Eval sets don't cover the production distribution: eval suites skew toward "representative" cases, and the long tail is where incidents happen. "95% eval pass, 3x more complaints in prod" is depressingly common.
Tight coupling with external state: when tools create tickets or send email, rolling back code doesn't roll back side effects.

That's why agent rollouts need to validate "does this hold up under real production traffic?" in graduated steps, not just "does the code compile." Progressive delivery is the natural answer. Equally important is minimizing rollback cost: agent releases tend to change behavior rather than add features, so if you can't cut back to the old behavior in minutes, the business side quickly loses trust in shipping anything.

There's a cultural dimension to this too. When a product manager has watched one "minor prompt tweak" generate a bad weekend, they start blocking anything prompt-related out of self-defense. That blockage is far more expensive than the occasional bad release — it strangles the product's ability to improve at all. Progressive delivery gives the organization a trustable "we can try this safely" mechanism, and that's often the more valuable output than any specific incident avoided. The goal isn't zero incidents; the goal is making the cost of each incident small enough that iteration stays possible.

Architecture overview

The pipeline we'll build has six moving parts:

Feature flag service (LaunchDarkly, Unleash, or a homegrown KV) — decides which requests hit the new version.
Routing layer (Hono or Express) — splits traffic to old vs. new based on the flag.
Agent SDK layer (@anthropic-ai/agent-sdk) — holds system prompts, tool definitions, and model choice keyed by version.
Observability (OpenTelemetry + your favorite backend) — emits per-request metrics tagged with version.
SLO engine (custom or Prometheus Alertmanager) — rolls the flag back on breach.
Promotion controller (cron or Workers Cron Triggers) — advances the canary stage automatically when SLOs hold.

The critical property is that switching to a new version is a config change, not a code deploy. If you have to rebuild and redeploy to roll back, you've already lost the race against an incident. The difference between "switch in seconds" and "redeploy in minutes" is the difference between a quiet night and a scrambled on-call.

Another way to frame this: progressive delivery closes the loop of observe → decide → act. Observability without control still leaves 3 a.m. incidents on your plate. Control without observability risks rolling back the right version for the wrong reason. Real progressive delivery automates the whole loop.

One more architectural choice worth naming up front: the version granularity. I keep "version" coarse — one version per agent per release — rather than finer-grained per-prompt or per-tool toggles. Coarse versions mean one rollback button, one dashboard row, one integration per tool. Finer granularity sounds flexible, but when you are staring at an incident at 3 a.m. you want a single lever to pull, not a matrix of knobs. If you truly need experiment-style toggles, add them on top of the version system — don't replace it.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦You'll stop relying on eval-set green lights and learn to build a rollout pipeline that measures quality against real production traffic, with canary routing and SLO watchdogs you can ship today.

✦You'll get working TypeScript/Hono code that handles model swaps, prompt revisions, and new tool additions through the same progressive-delivery machinery — one pipeline for every kind of agent change.

✦You'll replace 'deploy and pray' with an automatic detect-and-recover loop, so the 'quality collapsed overnight, took half a day to roll back' incident becomes a non-event.

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Step 0: Offline evaluation before the canary

Before any traffic hits the new version, run a minimal quality gate. The eval set doesn't need perfect coverage — it needs to include cases that have burned you before.

// eval/offline.ts — run in CI
import { runAgent } from "../agent/versions";
import Anthropic from "@anthropic-ai/sdk";
import testset from "./fixtures/regression.json"; // cases harvested from prior incidents
 
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
 
async function runEval(version: "v2026.04.15" | "v2026.04.22") {
  const results: Array<{ id: string; ok: boolean; reason?: string }> = [];
  for (const tc of testset) {
    const out = await runAgent(version, tc.input, client);
    const ok = tc.must_contain.every((s) => out.text.includes(s));
    results.push({ id: tc.id, ok, reason: ok ? undefined : "missing_required_phrase" });
  }
  const pass = results.filter((r) => r.ok).length / results.length;
  if (pass < 0.9) throw new Error(`Eval pass rate ${(pass * 100).toFixed(1)}% < 90%`);
  return { pass, results };
}

Intent: this gate exists to stop the new version from tripping on known landmines. Chase "pass rate" and you'll overfit the eval set — instead, treat "any case that was green before must stay green" as the build failure condition. That simple regression lens keeps the bar honest without turning into a number-chasing game.

I recommend keeping this eval set small (dozens, not thousands) and curated by hand from real incidents. Large auto-generated eval sets feel reassuring but often measure the wrong thing. A handful of painful real-world cases, each labeled with the incident it came from, is more useful than ten thousand synthetic ones. Let the eval set grow by accretion: every postmortem adds one or two cases, so the gate gets tighter over time based on what has actually burned you.

Step 1: Treat versions as values, not code branches

Rewrite your agent invocation as a pure function that takes a version argument. This is the foundation for everything else.

// agent/versions.ts
import type Anthropic from "@anthropic-ai/sdk";
import { agent } from "@anthropic-ai/agent-sdk";
 
export const AGENT_VERSIONS = {
  "v2026.04.15": {
    model: "claude-sonnet-4-6" as const,
    systemPrompt:
      "You are a customer support agent who answers strictly from official docs. For anything uncertain, reply 'Let me check on that'.",
    tools: ["search_docs", "create_ticket"],
  },
  "v2026.04.22": {
    model: "claude-sonnet-4-6" as const,
    systemPrompt:
      "You are a customer support agent. Favor official documentation, reference prior interactions when useful, and respond with empathy. Reply 'Let me check on that' for anything unverified.",
    tools: ["search_docs", "search_history", "create_ticket"],
  },
} as const;
 
export type AgentVersion = keyof typeof AGENT_VERSIONS;
 
export async function runAgent(
  version: AgentVersion,
  input: string,
  client: Anthropic,
) {
  const cfg = AGENT_VERSIONS[version];
  const session = agent({
    client,
    model: cfg.model,
    system: cfg.systemPrompt,
    allowed_tools: cfg.tools,
  });
  // Let errors bubble up so they show up in the observability layer.
  return await session.run({ input });
}

Key property: with versions as string keys, you can pass the feature flag value straight through. The moment you branch with switch, this architecture collapses.

Second rule: never swallow errors. A quietly-swallowed error makes SLOs look artificially healthy, and the canary will keep getting promoted into a failure. Wire every error into spans and metrics.

What about versioning tool implementations? Tools (search_history, create_ticket, etc.) should themselves be versioned, but separately from the agent version. A practical pattern: each tool has its own version string, and the agent's AGENT_VERSIONS entry references tool versions by ID. That way a tool change can be rolled out independently of a prompt change, and you don't conflate "is the new prompt broken?" with "is the new tool broken?" Keeping concerns separable is the single biggest lever for debuggability in this architecture.

Dependency pinning: pin the Anthropic SDK version precisely. Agent behavior can shift when the SDK updates its default message formatting, retry policy, or streaming semantics. Treating the SDK version as part of the agent version (include it in the version string, e.g. v2026.04.22-sdk-1.14.0) prevents mysterious regressions when someone bumps the SDK in an unrelated PR.

Step 2: Canary traffic via feature flags

Here's the layer that decides the version for each request. The example uses a homegrown flag store, but the shape maps cleanly onto LaunchDarkly or Unleash.

// flags/resolver.ts
import { createHash } from "crypto";
 
// Example KV payload:
// {
//   "agent.customer_support": {
//     "stable": "v2026.04.15",
//     "canary": "v2026.04.22",
//     "canary_percentage": 5,           // route 5% to canary
//     "canary_allowlist": ["tenant_a"]  // force these tenants onto canary
//   }
// }
 
export type FlagConfig = {
  stable: string;
  canary: string;
  canary_percentage: number;
  canary_allowlist: string[];
};
 
// Hash-based bucketing so the same user always lands in the same version.
function bucketOf(userId: string, flagKey: string): number {
  const h = createHash("sha256").update(`${flagKey}:${userId}`).digest();
  return h.readUInt32BE(0) % 100;
}
 
export function resolveVersion(
  flag: FlagConfig,
  ctx: { userId: string; tenantId: string; flagKey: string },
): { version: string; channel: "stable" | "canary" } {
  if (flag.canary_allowlist.includes(ctx.tenantId)) {
    return { version: flag.canary, channel: "canary" };
  }
  const bucket = bucketOf(ctx.userId, ctx.flagKey);
  if (bucket < flag.canary_percentage) {
    return { version: flag.canary, channel: "canary" };
  }
  return { version: flag.stable, channel: "stable" };
}

Why hash, not random? With pure randomness, the same user drifts between versions and their experience breaks. Deterministic hashing guarantees "same user, same version," which also makes incident repro trivial — you can reproduce what version a specific user saw without searching logs.

Ramp canary_percentage in stages: 1% → 5% → 25% → 50% → 100%, with SLO gates between each. The canary_allowlist exists for internal dogfooding tenants and opt-in beta customers — a real safety net, because you always want internal surfaces to see a new version before external customers do.

Step 3: Routing and observability hooks

This Hono handler is where all the pieces meet. The single most important discipline here is: every metric must carry the version tag. Without it, you cannot debug an incident by version.

// routes/support.ts
import { Hono } from "hono";
import Anthropic from "@anthropic-ai/sdk";
import { trace, metrics } from "@opentelemetry/api";
import { runAgent, type AgentVersion } from "../agent/versions";
import { resolveVersion } from "../flags/resolver";
import { getFlag } from "../flags/store";
 
const app = new Hono();
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
const meter = metrics.getMeter("agent.support");
 
const latency = meter.createHistogram("agent.latency", { unit: "ms" });
const errors = meter.createCounter("agent.errors");
const thumbsDown = meter.createCounter("agent.thumbs_down");
 
app.post("/chat", async (c) => {
  const body = await c.req.json<{ userId: string; tenantId: string; input: string }>();
  const flag = await getFlag("agent.customer_support");
  const { version, channel } = resolveVersion(flag, {
    userId: body.userId,
    tenantId: body.tenantId,
    flagKey: "agent.customer_support",
  });
 
  const span = trace.getTracer("agent").startSpan("agent.run", {
    attributes: { "agent.version": version, "agent.channel": channel },
  });
 
  const start = performance.now();
  try {
    const result = await runAgent(version as AgentVersion, body.input, client);
    latency.record(performance.now() - start, { version, channel, status: "ok" });
    return c.json({ version, channel, result });
  } catch (err) {
    // You could fail over canary errors to stable, but it pollutes the SLO signal.
    // We attribute canary failures to canary intentionally.
    errors.add(1, { version, channel, error_type: (err as Error).name });
    latency.record(performance.now() - start, { version, channel, status: "error" });
    throw err;
  } finally {
    span.end();
  }
});
 
app.post("/feedback", async (c) => {
  const { version, channel, rating } = await c.req.json();
  if (rating === "down") thumbsDown.add(1, { version, channel });
  return c.json({ ok: true });
});
 
export default app;

Gotcha: user feedback (thumbs down) is the highest-signal quality metric, but it arrives late. Split SLOs into fast signals (error rate, latency) used for immediate rollback, and slow signals (satisfaction) used to gate promotion.

Bonus: including version in the response body helps post-incident forensics. If you don't want to leak that to end users, gate it behind a debug flag.

Step 4: SLO-driven automatic rollback

This is where teams usually stop. Don't — without automated rollback, graduated rollout loses most of its value. Run two tiers:

Fast SLO (60s window): error rate and p95 latency — triggers rollback.
Slow SLO (15m window): satisfaction ratio and tool-selection correctness — freezes promotion and alerts.

// slo/watchdog.ts — runs every 60 seconds via cron
import { queryMetrics } from "./prometheus";
import { setFlag, getFlag } from "../flags/store";
import { notifySlack } from "./slack";
 
type Sample = { version: string; errors: number; total: number; p95_ms: number };
 
const IMMEDIATE_BUDGET = {
  error_rate_max: 0.02, // 2%
  p95_ms_max: 6000,
};
 
export async function checkAndRollback(flagKey: string) {
  const flag = await getFlag(flagKey);
  if (flag.canary_percentage === 0) return;
 
  const stable: Sample = await queryMetrics(flag.stable, "60s");
  const canary: Sample = await queryMetrics(flag.canary, "60s");
 
  if (canary.total < 50) return; // too few samples to trust
 
  const canaryErrRate = canary.errors / canary.total;
  const stableErrRate = stable.total > 0 ? stable.errors / stable.total : 0;
 
  const tooManyErrors = canaryErrRate > IMMEDIATE_BUDGET.error_rate_max;
  const tooSlow = canary.p95_ms > IMMEDIATE_BUDGET.p95_ms_max;
  const regressedVsStable =
    canaryErrRate > stableErrRate * 2 && canary.total >= 200;
 
  if (tooManyErrors || tooSlow || regressedVsStable) {
    await setFlag(flagKey, { ...flag, canary_percentage: 0 });
    await notifySlack(
      `🚨 ${flagKey} canary=${flag.canary} rolled back: ` +
        `errRate=${(canaryErrRate * 100).toFixed(2)}%, p95=${canary.p95_ms}ms`,
    );
  }
}

Why compare against stable? If a downstream dependency flaps, both stable and canary degrade. Absolute thresholds alone would roll back the canary during a global issue — and removing canary traffic in that moment makes nothing better. The relative check isolates "is the world bad?" from "is only canary bad?"

The regressedVsStable check uses "error rate more than 2x stable, with ≥200 samples." Calibrate that to your business risk: a mission-critical support agent might use 1.5x; an experimental assistant 3x. Once tuned, keep thresholds in config/slo.yaml so you're not redeploying to change a number.

A subtle but important detail: rollbacks should be idempotent. If your watchdog runs every 60 seconds and a regression persists for five minutes, you don't want five Slack alerts and five redundant flag writes. Gate the rollback action on canary_percentage > 0 and include deduplication at the notification layer. Operators lose trust in a watchdog that pages them every minute about the same thing; they trust one that fires once, clearly, with the relevant numbers attached.

Finally, treat the watchdog itself as a piece of production code. It deserves tests. A broken watchdog silently fails open — the worst mode — letting a bad canary promote all the way to 100%. I run a unit test that feeds synthetic metric payloads through checkAndRollback and asserts the expected flag mutations. It's a tiny investment with outsized payoff.

Step 5: Progressive promotion

Add automation that advances the canary stage when SLOs hold. This turns a recurring manual checkin into a background process. The only rule: wait for slow signals before promoting, not just fast ones.

// slo/promoter.ts — runs every 15 minutes
import { queryMetrics, querySatisfaction } from "./prometheus";
import { setFlag, getFlag } from "../flags/store";
import { notifySlack } from "./slack";
 
const STAGES = [1, 5, 25, 50, 100]; // %
 
export async function tryPromote(flagKey: string) {
  const flag = await getFlag(flagKey);
  if (flag.canary_percentage === 0 || flag.canary_percentage === 100) return;
 
  const canary = await queryMetrics(flag.canary, "15m");
  const sat = await querySatisfaction(flag.canary, "15m");
 
  if (sat.thumbs_down_rate > 0.05) {
    await notifySlack(
      `⏸ ${flagKey} promotion paused: thumbs_down=${sat.thumbs_down_rate.toFixed(3)}`,
    );
    return;
  }
 
  if (canary.total < 500) return;
 
  const idx = STAGES.indexOf(flag.canary_percentage);
  const next = STAGES[idx + 1];
  if (!next) return;
  await setFlag(flagKey, { ...flag, canary_percentage: next });
  await notifySlack(`✅ ${flagKey} promoted to ${next}%`);
}

Once 100% is reached, the next release cycle flips stable to the new version and clears the canary slot. After that you can delete the old version from the code entirely — and you should. Keeping old versions around out of sentimentality adds coupling to your next release, which is exactly what this machinery is designed to avoid.

Step 6: Shadow execution for "breaking" changes

For risky changes — a new model, a redesigned tool — run the new version in shadow mode: serve users from stable while also invoking canary, and record only the diff.

// agent/shadow.ts
import { runAgent, type AgentVersion } from "./versions";
import type Anthropic from "@anthropic-ai/sdk";
import { recordDiff } from "./shadow_store";
 
export async function runWithShadow(
  primary: AgentVersion,
  shadow: AgentVersion,
  input: string,
  client: Anthropic,
) {
  const [primaryResult, shadowResult] = await Promise.allSettled([
    runAgent(primary, input, client),
    runAgent(shadow, input, client),
  ]);
 
  if (
    primaryResult.status === "fulfilled" &&
    shadowResult.status === "fulfilled"
  ) {
    await recordDiff({
      input,
      primary: primaryResult.value,
      shadow: shadowResult.value,
    });
  }
  if (primaryResult.status === "rejected") throw primaryResult.reason;
  return primaryResult.value;
}

Watch out: shadowing effectively doubles your API cost. Sample it — 10% of traffic is often plenty. Tools with side effects need a dry_run: true path so the shadow version can't write to external systems. Building a dedicated dry-run hook into your tool definitions keeps this clean.

Step 7: Dashboards and team rituals

Automation is necessary, not sufficient. Pair it with three dashboards and a weekly retro:

Release Status — which version is at what percentage, time until next promotion, count of recent SLO breaches.
Quality Diff — stable vs. canary error rate, p95, thumbs-down, tool-use distribution.
Cost Diff — per-request token counts and dollar cost per version, so a model swap doesn't blow the budget unnoticed.

In the weekly retro, pick one success or failure from the dashboards and do a focused post-mortem. Over time this tunes your eval set, thresholds, and stage durations to your actual product. The pattern to preserve: humans make decisions from numbers; machines handle routing and rollback. That division of labor is what keeps the system humane at 3 a.m.

Testing the rollout machinery itself

A rollout system you can't trust is worse than no rollout system — it hides failures under a veneer of automation. A short checklist I use before letting progressive delivery make unattended production decisions:

Unit tests for resolveVersion: feed in synthetic flag configs and verify both deterministic bucketing and allowlist precedence. A bug here means users bounce between versions unpredictably.
Integration test for the watchdog: feed in synthetic metric payloads representing healthy, degraded, and ambiguous canaries; assert that only degraded states trigger a rollback call.
Chaos test for the flag store: kill the KV and verify the agent still serves requests (from local cache) rather than erroring. An agent that dies during a KV outage is a progressive-delivery regression.
Rollback drill: once a month, manually set a bad canary config and watch the watchdog roll it back. Automation you never exercise is automation you cannot rely on.

These tests don't need to be elaborate. What matters is that you have touched each safety mechanism at least once and seen it work. Confidence compounds when you've actually observed the system self-heal, not merely written the code for it.

Cost management across versions

Progressive delivery also helps you catch cost regressions. A new prompt might be 20% longer. A new tool might cause an extra model call per turn. A model swap from Haiku to Sonnet might multiply cost per request by 5x without you noticing. All of these show up in per-version token counts — provided you're recording them.

// routes/support.ts (add after runAgent returns)
const usage = result.usage; // { input_tokens, output_tokens, cache_*: ... }
tokensIn.record(usage.input_tokens, { version, channel });
tokensOut.record(usage.output_tokens, { version, channel });

From there, compute dollar cost per version in your observability backend using the model's list price, and surface it on the Quality Diff dashboard. The single number I track most closely is cost per resolved conversation — raw token counts don't matter if the new version resolves issues in two turns instead of four. A version that doubles token usage but halves the number of turns is still a win.

The progressive delivery machinery is the perfect place to catch a cost regression early. If you notice cost has jumped 3x in canary, you can evaluate whether the quality gain justifies it before rolling out to 100%. Without staged rollout, cost regressions tend to be discovered via the monthly bill — which is a very expensive way to learn.

Multi-tenant considerations

B2B SaaS products often have per-tenant contract terms that demand tenant-scoped control. Extend the flag shape with a tenant_overrides field so you can hold specific tenants back — or push them forward — independently of the global percentage.

// flags/resolver.ts (add)
if (flag.tenant_overrides?.[ctx.tenantId]) {
  const override = flag.tenant_overrides[ctx.tenantId];
  return { version: override.version, channel: override.channel };
}

An enterprise contract clause like "we'll be notified before any new version" can be honored technically by pinning the tenant to the previous stable via tenant_overrides. Sales teams relax noticeably once they can see that contractual promises are enforced by code.

Tagging metrics with tenant_id in addition to version also makes it instant to answer "is only this one tenant regressing?" That diagnostic is worth a lot during escalations.

Common pitfalls

Traps I've personally hit:

Prompt cache stops hitting: changing the system prompt invalidates the cache; cost and latency both spike temporarily in canary. Either loosen SLO thresholds briefly or accept a warm-up burst by design.
Tool side effects don't roll back: adding tools that create tickets or send email leaves residue even after rollback. Gate new tools behind their own flag and go through a dry-run stage before production-enabling them.
Eval/prod distribution drift: the long-tail requests that never show up in eval can dominate production. Bake "harvest anonymized prod samples into the eval set" into every release cycle.
Canary too small to be statistically significant: a 1% canary can take days to surface rare bugs. For low-risk changes (minor prompt tweaks) start at 25% — match the starting percentage to the change's risk profile.
Flag store as a single point of failure: hitting KV on every request means a KV outage stops the agent. Always add a local cache (≈30s TTL) with a last-known-good fallback.
Unreadable version IDs: v1234 is useless during an incident. Name versions like v2026.04.22-shorten-tone — date plus a change summary — and debugging time drops noticeably.
Forgetting to clean up canary data: metrics, dashboards, and eval fixtures tend to accumulate per-version entries that nobody ever deletes. Add a scheduled job that prunes metric series older than N weeks. A dashboard filled with dead versions eventually stops getting looked at, which is the worst possible failure mode for this system.
Over-automating for small teams: if you are a team of two, you don't need a separate promoter cron job — manual promotion after checking the dashboard is fine. What you must automate is rollback. Rollback is where seconds matter; promotion is where you can afford to be deliberate.

A real production schedule

Here's the cadence I run for a production support agent:

Mon/Wed/Fri: ship a new canary at 5%, let it soak into Friday night while SLOs observe.
Thu: promote to 25% if SLOs hold; otherwise roll back and do a root cause.
Weekend: hold at 50% to cover off-hours traffic and non-JP regions.
Next Monday: promote to 100%, flip stable, delete old version code.

The result: I no longer brace myself before deploying. Rollbacks are automatic, promotions are data-driven, and humans focus on the weekly retro. Once the first agent is wired up, adding others is mostly copy-and-adapt.

Multi-language products should slice SLOs by locale too. I've seen releases where everything looked fine in aggregate while Japanese quality had quietly collapsed — adding a locale tag to metrics makes that instantly visible.

One more practical note on cadence: avoid Friday late-night promotions to 100%. Even with automated rollback, debugging a weekend incident is measurably harder than debugging one during business hours. I reserve "full-send" promotions for Monday or Tuesday mornings, giving the new version the full work week to get real traffic before the quieter weekend window. This isn't a rule the system enforces; it's a humane operating norm that keeps the humans running it sustainable.

For anyone operating at larger scale, consider regional staging: promote to one AWS region or Cloudflare colo group at a time. Progressive delivery composes nicely with geographic rollout — you treat each region as its own flag percentage, and an issue surfaces in one region rather than globally. The code changes are minimal once you've built the version-as-value foundation; you're just adding another dimension to the flag resolver.

A walked-through rollout scenario

To make this concrete, here's a release I ran recently and what happened at each stage. The change: swap the support agent's system prompt to include a "reference past tickets from this customer before answering" instruction, and enable a new search_history tool.

T+0 (Monday morning): Offline eval ran in CI against the regression fixture. 58/60 cases passed, 2 borderline cases recovered with a small prompt tweak. Merge.
T+10 minutes: Canary rolled out at 1%. The watchdog immediately flagged elevated p95 latency — but stable was also elevated (Anthropic API was having a minor slow period). The regressedVsStable check correctly decided not to roll back.
T+2 hours: Promoter cron lifted canary to 5%. Thumbs-down rate at 2.1%, still under the 5% threshold. Cost per request up 18% (the tool call added an extra round trip), within an acceptable range for the quality improvement.
T+5 hours: Canary saw a spike of tool_error on search_history — a malformed query format for one tenant. The tenant was already in canary_allowlist, so the impact was contained. Fixed the tool, bumped the version to v2026.04.22-search-history-fix, redeployed.
T+1 day: Canary at 25%, holding. Moved to 50% on Thursday.
T+4 days (Monday): Promoted to 100%, flipped stable, removed the old version from the code.

None of this required me being awake at unusual hours. The watchdog handled the one rollback candidate correctly by comparing to stable. The promoter paused appropriately when thumbs-down edged up. The postmortem entry for the tool bug joined the regression fixture. This is what progressive delivery looks like when it works — mostly boring, occasionally informative, rarely scary.

Common misconceptions

A few things I frequently hear that are worth addressing directly:

"We have evals, we don't need canary": Evals test known cases. Canary tests unknown ones. They're complementary, not substitutes. The eval gate stops obvious regressions; the canary catches the ones you didn't think to test.
"Feature flags add complexity we don't need": At small scale, yes — you can skip the flag store and hard-code the version. What you can't skip is the version-as-value discipline in the agent layer itself. Keep the flag store for when you're ready, but write the agent code so it's trivial to add later.
"Rollback is the same as redeploy": Only if your deploy takes seconds. For most teams, rollback via flag is 10-100x faster than rollback via CI. During an active incident that difference is enormous.
"We'll add this later when we hit scale": Adding progressive delivery retroactively is painful because your agent code is full of switches and your metrics don't carry version tags. Adding it early is cheap because you're building around the primitives from day one.
"We can manually approve each promotion": You can, until you're running five agents and each one needs daily attention. Automation isn't about avoiding work — it's about preserving your capacity to make the decisions that actually require judgment. If your promoter cron frees you from thinking about obvious-green releases, you have more bandwidth left for the ambiguous ones.

Your next step

Start by rewriting one agent's runAgent() to accept a version argument. The feature flag can come later. Lifting the version out of an in-code switch and into config is the foundation everything else rests on. This change alone usually takes under 30 minutes.

Once that foundation exists, canary routing, SLO watchdogs, and automatic rollback can be added incrementally. You don't need to build it all at once. The day your agent can switch versions by config is the day your relationship to production deploys starts to change. Most teams I've worked with find that after a week or two of running the version-as-value foundation alongside their existing flow, they start wanting to add the flag store, because they can feel how much easier it would make the next experiment. That's the right order: let the need pull the automation in, rather than building all the machinery speculatively.

If the barrier is organizational buy-in, start with an internal-only tool. Letting your teammates experience "what I relied on yesterday might behave differently today" makes the case for consumer-facing adoption far faster than any slide deck.

Another pragmatic approach: ship the machinery as a no-op first. Wire up the version resolver, the metrics, and the watchdog, but leave canary_percentage at 0 until you've watched the pipeline run for a week against stable traffic. This catches integration bugs — misconfigured Prometheus queries, missing metric tags, flag-store authentication issues — before any real user is exposed to the canary. The same approach works for the shadow-execution layer: deploy it with sampling at 0, turn it to 1% after a week of silent operation, and ramp from there.

If you want to go deeper on observability or multi-agent design, pair this with the Claude API OpenTelemetry AI observability guide and the Claude Agent SDK production multi-agent system guide. They cover the signals and orchestration layer this pipeline depends on.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.