⬡ API & SDK/2026-06-19Advanced

When Your Claude × Playwright Browser Agent Fails While Reporting Success — Verifying Actions and Catching UI Drift

A vision-driven Claude + Playwright browser agent fails quietly: it reports success while nothing actually changed. Here is how to stop trusting self-reports, verify each action against the goal, and detect UI drift before it breaks you.

playwright² browser-automation² agent¹⁰ claude-api⁶⁶ production⁹⁹ typescript¹⁰

✦ Premium Article

It reported "added to cart" — but the cart was empty

I opened the overnight logs of a collection agent one morning and found a wall of completed: true. Yet only half the expected output existed. Tracing the steps, the agent confidently reported on one site that it had "added the product to the cart and reached checkout" — while the cart stayed empty the whole way through. The click succeeded in terms of coordinates, the screenshot showed something that looked like a pressed "Add" button, and still, the one thing that mattered — the state — had not changed at all.

This is a different kind of failure from the old, familiar broken Playwright selector. An imperative script, when it breaks, throws and stops, so at least you find out. The nasty part of a vision agent — one that looks at a screenshot and lets Claude decide the next move — is that it keeps going while believing it succeeded. No exception, logs stay green, only the result is wrong. In production, this quiet failure is the one to fear most.

This article assumes you should never trust the agent's self-report. We'll lay out how to verify every action against its goal, and how to notice UI changes before they silently break you — with working code throughout. It's aimed at engineers who have used Playwright and the Claude API in TypeScript.

Why vision agents end up "thinking they did it"

An imperative Playwright script and an agent that hands screenshots to Claude fail in fundamentally different ways.

Aspect	Imperative script	Vision agent
On UI change	Selector mismatch → exception → stop	Reasons from appearance and continues
Failure detection	Immediate, via exception	Quietly advances into a wrong state
Basis for "success"	Explicit waitForSelector	The model's self-reported "achieved"
The danger	Stops, but you notice	Doesn't stop, but is wrong

In my experience, false successes arise through roughly three paths. First, the click lands in terms of coordinates, but the button was disabled or covered by a modal, so no side effect occurred. Second, the screenshot is captured before an async update lands, so the agent judges from the pre-action screen and thinks it changed. Third, the model optimistically declares the goal met — Claude is helpful, and when it sees a screen close to the goal, it tends to nudge toward "I think I achieved it."

The shared root is one thing: treating "I performed the action" as identical to "the state changed as intended." Separating those two is the starting point of the whole design.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why vision agents produce silent false successes, and how to design a verification gate that ignores the model's self-report

✦Goal-bound assertions and DOM-grounded double-checking, with working TypeScript code

✦Catching UI drift early with canaries, then metering the false-success rate and halting with a circuit breaker

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Stop trusting self-reports — split action from verification

The belief to fix is the flow where you let the model say goalAchieved: true and believe it. Limit the agent's reply to a proposal ("I'd like to do this next"), and let a separate verifier decide whether that action actually took effect — based on observable facts, not the model.

// The model proposes; a verifier decides. Keep the roles distinct.
interface ActionProposal {
  type: 'click' | 'fill' | 'navigate';
  selector?: string;
  value?: string;
  url?: string;
  // The post-condition the model declares this action should produce
  expectedEffect: string;
}
 
interface VerifiedStep {
  proposal: ActionProposal;
  reasoning: string;
  verified: boolean;       // the verifier's verdict, not the model's self-report
  evidence: string;        // the observed fact behind the verdict
}

The key is making the model declare expectedEffect itself. Not just "press the add button," but "after pressing, the cart count should increase by 1." With that prediction on record, the verifier can check whether it came true. If it didn't, the action is treated as a failure even if the click succeeded in coordinates.

Bind verification to the goal (goal-bound assertion)

The worst thing to do in verification is to look for generic "success-ness." A vague check like "did .success-message appear?" gets fooled by a success banner shown for some other reason, producing another false success. Verification must be tied to a state-bearing fact specific to that step's goal.

// src/agent/verify.ts
import { Page } from 'playwright';
 
interface VerificationResult {
  verified: boolean;
  evidence: string;
}
 
// Verify against a goal-specific, observable fact
export async function verifyCartAdded(
  page: Page,
  expectedCount: number
): Promise<VerificationResult> {
  // 1. Target a stateful element, not UI text
  const badge = page.locator('[data-testid="cart-count"]');
  // 2. Wait for it. Absorb the screenshot-timing gap right here.
  try {
    await badge.waitFor({ state: 'visible', timeout: 5000 });
  } catch {
    return { verified: false, evidence: 'cart-count badge never appeared' };
  }
  const text = (await badge.textContent())?.trim() ?? '';
  const actual = parseInt(text, 10);
  // 3. Verdict comes from matching the expected state
  if (Number.isNaN(actual)) {
    return { verified: false, evidence: `cart-count is not numeric: "${text}"` };
  }
  return actual >= expectedCount
    ? { verified: true, evidence: `cart-count=${actual} (expected >= ${expectedCount})` }
    : { verified: false, evidence: `cart-count=${actual} (below expected ${expectedCount})` };
}

This verifier looks at no screenshot at all. That's deliberate. Use the screenshot as input to a decision, but never as the basis for a verdict — vision misjudges, so verdicts come from facts that can't be faked: DOM state, API responses, and the like. On sites with no stable attribute like data-testid, choose observation points tied as closely as possible to meaning — a count, the URL, the presence of a specific element, an aria attribute — rather than display wording.

Across the four sites I run under Dolice Labs, I drive collection and integrity-check automation with Claude and Playwright, and this is exactly where I got burned first. I trusted the "looks successful" screenshot and skipped verification; the job ran all night and came up half-empty by morning. Since then I attach one "post-condition measurable as state" to every action, and draw a line: any action whose effect can't be observed as state is one I don't delegate to the agent at all. Vision is an excellent navigator but a poor inspector — that's the lesson I took from running it in production.

Wire the verification gate into the loop

Bundle action, verification, and retry into a single gate. When verification fails, feed that failure back to the model as fact and prompt a different approach — that's where the vision agent's strength pays off, in a way an imperative script can't match.

// src/agent/step.ts
import Anthropic from '@anthropic-ai/sdk';
import { Page } from 'playwright';
import { VerificationResult } from './verify';
 
type Verifier = (page: Page) => Promise<VerificationResult>;
 
export async function runVerifiedStep(
  client: Anthropic,
  page: Page,
  proposal: ActionProposal,
  verify: Verifier,
  history: Anthropic.MessageParam[]
): Promise<VerifiedStep> {
  // 1. Execute the proposed action
  await applyAction(page, proposal);
 
  // 2. Goal-specific verification (not the model's self-report)
  const result = await verify(page);
 
  // 3. On failure, hand the observed fact back to the model
  if (!result.verified) {
    history.push({
      role: 'user',
      content:
        `The previous action (${proposal.type} ${proposal.selector ?? ''}) did not ` +
        `produce the expected effect "${proposal.expectedEffect}". ` +
        `Observed fact: ${result.evidence}. Propose a different approach.`,
    });
  }
 
  return {
    proposal,
    reasoning: proposal.expectedEffect,
    verified: result.verified,
    evidence: result.evidence,
  };
}
 
async function applyAction(page: Page, a: ActionProposal): Promise<void> {
  if (a.type === 'click' && a.selector) await page.click(a.selector, { timeout: 8000 });
  else if (a.type === 'fill' && a.selector) await page.fill(a.selector, a.value ?? '');
  else if (a.type === 'navigate' && a.url) await page.goto(a.url, { waitUntil: 'networkidle' });
}

Judge overall completion with the same philosophy: make the completion condition "did the final step verify true," and do not use whether the model said goalAchieved. Read the self-report as part of a proposal at most; let verification hold the final truth. Keep that asymmetry strict.

Use structured tool calls to make verification points explicit

Beyond screenshots, handing the page structure to the model as tools improves both action precision and verification confidence. What pays off here is adding a "tool that observes post-conditions" alongside the action tool, and letting the model pick its own verification points.

const tools: Anthropic.Tool[] = [
  {
    name: 'execute_action',
    description: 'Run a browser action. Always write a state-measurable post-condition in expectedEffect.',
    input_schema: {
      type: 'object',
      properties: {
        type: { type: 'string', enum: ['click', 'fill', 'navigate'] },
        selector: { type: 'string' },
        value: { type: 'string' },
        expectedEffect: {
          type: 'string',
          description: 'e.g. "cart-count increases by 1" (write it as state, not appearance)',
        },
      },
      required: ['type', 'expectedEffect'],
    },
  },
  {
    name: 'observe_state',
    description: 'Observation used to confirm a post-condition (element count, text, existence, current URL)',
    input_schema: {
      type: 'object',
      properties: {
        selector: { type: 'string' },
        property: { type: 'string', enum: ['count', 'text', 'exists', 'url'] },
      },
      required: ['property'],
    },
  },
];
 
const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1536,
  tools,
  messages: history,
});

With observe_state as its own tool, verification stops being hard-coded and instead runs on observation points the model chooses per screen. A practical split is to use a hard-coded function like verifyCartAdded for stable sites, and dynamic verification via observe_state for sites whose structure is hard to read.

Cache the system prompt to leave room for verification

Re-sending the same rules every step is wasteful, and doubling up on verification adds tokens. Cache the long system prompt with cache_control and spend the savings on verification round-trips.

const system: Anthropic.TextBlockParam[] = [
  {
    type: 'text',
    text:
      'You are an agent that operates a browser. ' +
      'Attach a state-measurable expectedEffect to every action, ' +
      'and when verification fails, try a different path rather than repeating the same action. ' +
      'Do not assert goal completion by self-report; defer to observed facts.',
    cache_control: { type: 'ephemeral' }, // current SDK takes this directly on messages.create
  },
];
 
const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1536,
  system,
  messages: recentHistory,
});

Many older articles use client.beta.promptCaching.messages.create, but the current SDK lets you put cache_control directly on a normal messages.create. If an old branch lingers in your code, clearing it out here makes maintenance easier.

Notice UI drift before it breaks

The breeding ground for false success is UI change. So rather than discovering it for the first time inside a production job, observe the change itself on a schedule and get ahead of it. What I keep running is a lightweight canary that simply confirms whether the verification points on key screens still exist.

// src/monitor/drift.ts
import { chromium } from 'playwright';
 
interface Anchor { url: string; selector: string; label: string; }
 
// Register the observation points you verify against as anchors
const ANCHORS: Anchor[] = [
  { url: 'https://example.com/product/1', selector: '[data-testid="add-to-cart"]', label: 'Add button' },
  { url: 'https://example.com/cart', selector: '[data-testid="cart-count"]', label: 'Cart count badge' },
];
 
export async function checkDrift(): Promise<{ ok: boolean; missing: string[] }> {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  const missing: string[] = [];
  for (const a of ANCHORS) {
    await page.goto(a.url, { waitUntil: 'domcontentloaded' });
    const exists = (await page.locator(a.selector).count()) > 0;
    if (!exists) missing.push(`${a.label} (${a.selector})`);
  }
  await browser.close();
  return { ok: missing.length === 0, missing };
}

Run this canary a few times a day, separate from the production loop, and alert before launching the collection job if any anchor has disappeared. A missing verification point means the exact conditions for false success are now in place. Rather than delegating to the agent and letting it fail quietly, a human updating a selector in five minutes is, in the end, cheaper — that's my call.

Meter the false-success rate and halt with a circuit breaker

The last safeguard is operational. Once you run actions through the verification gate, you have, per step, the share of "proposed but failed verification." Monitor that as a proxy for the false-success rate, and add a circuit breaker that halts the job once it crosses a threshold.

Metric	Meaning	Suggested threshold
Verification failure rate	Share of proposals not meeting their post-condition	Warn above 20% over the last 50 steps
Post-retry failure rate	Share still unachieved after the feedback round	Halt the job above 10%
Canary misses	Number of anchors that disappeared	Halt that site at 1 or more

// src/agent/breaker.ts
export class VerificationBreaker {
  private window: boolean[] = [];
  constructor(private size = 50, private maxFailRate = 0.2) {}
 
  record(verified: boolean): void {
    this.window.push(verified);
    if (this.window.length > this.size) this.window.shift();
  }
 
  shouldHalt(): boolean {
    if (this.window.length < this.size) return false;
    const fails = this.window.filter((v) => !v).length;
    return fails / this.window.length > this.maxFailRate;
  }
}

Lining up green logs isn't the goal. Not running while wrong is the goal. When the verification failure rate climbs, it usually isn't the agent's fault — it's a sign the target UI changed — so stop, fix the observation point, and run again. Being able to turn that cycle is, I think, the minimum condition for handing a job to an agent around the clock.

Where to start

If your existing browser agent decides completion from a goalAchieved self-report, start by adding one state-measurable post-verification to your most irreversible action (payment, submission, deletion). Take the verdict from a DOM fact, not a screenshot — change just that one place, and the meaning of your morning logs shifts from "probably succeeded" to "verified success." From there, add observation points and extend into canaries and the circuit breaker. That's the manageable order.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.