⬡ API & SDK/2026-06-25Advanced

When Your Support AI Is Confidently Wrong in Production — Notes on Refusing Outside Its Grounding and Routing to Humans

A field-tested approach to the Claude API support agent that demos perfectly yet states non-existent facts in production. Covers deciding 'don't answer' at retrieval time, grounded generation, measuring confident-wrong rate, and tuning escalation precision.

claude-api⁷² customer-support rag³ grounding hallucination³ escalation production¹⁰³

✦ Premium Article

It demoed perfectly, yet states things that don't exist

Most teams remember the first internal demo of a support AI vividly. It answers every prepared question fluently, the tone is polite, and everyone feels the first response line can be handed over. The trouble starts after you trust that feeling and ship.

As an indie developer at Dolice running semi-automated inquiry handling across several blogs and apps, the case that chilled me most was an agent that confidently quoted the terms of a campaign that did not exist — complete with numbers. The user acted on it, then wrote back: "I did exactly what it said." Technically nothing failed. The response was a 200, the prose was smooth, the honorifics were correct. Only the content was wrong.

This quiet kind of error is not solved by a smarter model. Claude is plenty smart and it still happens. The cause is not a lack of intelligence; it is the absence of a mechanism that makes the system say "I don't know" when it doesn't. Below I split that mechanism into four parts — retrieval, generation, measurement, and escalation — in the shape that actually held up in production.

Why it can't say "I don't know" — the grounding misunderstanding

People often tell me they have RAG in place yet confident errors persist. Dig in and the structure is usually identical. You pull related documents from a knowledge base, stuff them into the prompt, and instruct "answer based on the resources below." That part is fine. But even when the retrieved documents barely relate to the question, generation does not stop.

The model tries to use whatever context you hand it. Given only weakly relevant documents, it assembles a plausible-sounding answer anyway. So the weak point of grounding is not generation — it sits one step earlier. The real hole is failing to judge, before generation, whether there is genuinely enough grounding to be worth answering at all.

I moved that judgment into retrieval as an explicit "answer or hand off" branch. Generation is just the final stage that only the things passing the branch ever reach.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A retrieval-confidence pattern that decides 'answer vs. hand off' before generation runs

✦How to measure confident-wrong rate from production logs, and the level to hold it under

✦Tuning escalation with precision/recall and F1 so you avoid both over-routing and misses

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Decide "don't answer" at the retrieval stage

First, score retrieval confidence and refuse to proceed to generation once it drops below a threshold. This single step before calling Claude cut confident errors the most.

import os
from anthropic import Anthropic
from dataclasses import dataclass
 
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
 
@dataclass
class Retrieval:
    chunks: list[dict]   # {"text": ..., "score": float, "source": ...}
    top_score: float
    margin: float        # gap between the 1st and 2nd scores
 
def assess(retrieval: Retrieval) -> str:
    """Decide answer / clarify / escalate from grounding quality alone."""
    # Set thresholds from production logs. Start strict.
    if retrieval.top_score < 0.62:
        return "escalate"        # no relevant document at all
    if retrieval.margin < 0.05 and retrieval.top_score < 0.75:
        return "clarify"         # near-tied candidates = vague question
    return "answer"

Two things matter here. One is looking not only at the absolute top score but at the margin between first and second place. Even with a high top score, near-tied candidates mean the question is vague and the system can't pick a document. Routing to a clarifying question instead of an answer cuts mix-ups.

The other is not choosing thresholds by gut. For the first two weeks I logged every assess decision against the eventual correct/incorrect verdict, looked at the score distribution of cases that wrongly reached answer, and raised the threshold. Going from 0.55 to 0.62 alone made the confident-error feel noticeably better.

Don't let generation step outside the grounding

Only things that reached answer go to generation. The aim here is to stop the model from adding information outside the documents you handed it. Bind the grounding scope in the system prompt and make the exit for "can't answer" explicit.

SYSTEM = """You are a customer support agent. Follow these rules strictly.
 
- Ground every answer only in the text inside <resources>. Never supply facts,
  numbers, or conditions that are not in the resources.
- If the resources do not confirm it, do not guess. Reply: "Let me connect you
  with a team member to confirm."
- For each claim, cite the source you grounded it in, like [S1], in the body.
- Answer in the user's language, even if the resources are in another language.
"""
 
def generate(question: str, chunks: list[dict], user_lang: str) -> dict:
    resources = "\n\n".join(
        f"<doc source=\"{c['source']}\">{c['text']}</doc>" for c in chunks
    )
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=800,
        system=SYSTEM,
        messages=[{
            "role": "user",
            "content": f"<resources>\n{resources}\n</resources>\n\n"
                       f"Question ({user_lang}): {question}",
        }],
    )
    text = resp.content[0].text
    # An answer with no citation tag is suspected of fabricating outside resources
    grounded = "[S" in text or "connect you with a team member" in text
    return {"text": text, "grounded": grounded}

The last three lines quietly earn their keep. An answer that contains neither a [S...] citation nor the hand-off phrase is likely written without consulting the resources, so don't put it on the automated path — route it to a human. It is not a complete detector, but it's a final sieve that catches fabricated answers that slipped through.

For response language, declaring the detected language in the prompt is the reliable fix. When the resources are in Japanese and the question is in English, without an instruction the model can be pulled toward the resource language and reply in Japanese. I saw this a few times in production, and it stopped once I started passing user_lang explicitly every time.

Measure false confidence — three signals to watch

"It feels more accurate" doesn't keep an operation running. I watch only these three:

Signal	Definition	Operating guide
Deflection rate	Resolved without a human, with no repeat inquiry	Too high suggests it is forcing answers
Confident-wrong rate	Asserted with a citation yet the content was wrong	Most important. Hold under 1%
Escalation precision	Of cases sent to a human, the share that truly needed one	Low means over-routing tires the team

Of these, confident-wrong rate matters by a wide margin. You can lower deflection at will, but an AI with a high confident-wrong rate damages trust itself. I track it weekly and use it to decide threshold and grounding-scope changes. Measuring it isn't hard.

def confident_wrong_rate(logs: list[dict]) -> float:
    """Share that asserted (cited, not escalated) and was wrong."""
    confident = [x for x in logs if x["grounded"] and not x["escalated"]]
    if not confident:
        return 0.0
    wrong = [x for x in confident if x.get("verdict") == "incorrect"]
    return len(wrong) / len(confident)

You fill in verdict after the fact from repeat inquiries, low ratings, and human correction history. You don't need a human to review every case; sampling is enough. Early on I was content watching only the deflection rate, and I was slow to notice confident errors happening below the surface. The better a metric looks, the more it can hide the dangerous one — that was the biggest lesson from this work.

Escalation: tuned between "too few" and "too many"

The hand-off decision fails whether it's too strict or too loose. Too loose and confident errors rise; too strict and everything gets routed, the team burns out, and the point of the AI fades. I treat this as a tug-of-war between precision and recall and collapse it into one number with F1.

def escalation_scores(logs: list[dict]) -> dict:
    # Did we correctly route cases that needed a human?
    tp = sum(1 for x in logs if x["needed_human"] and x["escalated"])
    fp = sum(1 for x in logs if not x["needed_human"] and x["escalated"])
    fn = sum(1 for x in logs if x["needed_human"] and not x["escalated"])
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
    return {"precision": round(precision, 3),
            "recall": round(recall, 3), "f1": round(f1, 3)}

In practice, when to favor recall versus precision shifts by topic. Inquiries about money or contracts carry a large cost for false negatives, so lean toward recall even at the cost of some over-routing. Areas like how-to questions, where being wrong does little harm, lean toward precision and route less. I split the assess thresholds per category and hand the dangerous categories to a human earlier. Not trying to handle every category with one uniform threshold was the key to balancing team load against safety.

I also keep an emotional trigger separate. Inquiries that clearly read as strong dissatisfaction get routed to a human first, even when the content is answerable — because sometimes people need to be heard before they need the correct answer.

Small implementation choices that quietly helped

Beyond the big design, a few small things earned their place. One is masking PII before generation. Passing the raw inquiry straight into retrieval and prompts circulates email addresses and order numbers needlessly. Replacing them with placeholders via regex up front and restoring only what's needed right before replying lowers the exposure surface a notch.

Another is reusing the fixed portion with prompt caching. The system prompt and the common knowledge base don't change per request, so caching them clearly eases cost as volume grows. Once monthly inquiries reach the tens of thousands, the difference is not negligible.

Last, always store the grounding source structurally in the answer log. When you later trace confident errors, keeping only the body won't lead you back to the offending document. Saving {source, score, verdict} in a structured form lets you trace all the way back to the retrieval pattern behind a mistake. I skimped on this step early and burned extra time on root-cause analysis.

If you change one thing, start here

If you're wrestling with confident errors right now, the first place to touch is the "don't answer" branch in retrieval. No amount of polishing the generation prompt removes confident errors while the structure that answers ungrounded questions remains. Measure retrieval confidence with a single threshold, and hand off to a human when it drops below. Add only that today, and log for a week. Adjusting the threshold against your production correct/incorrect distribution follows naturally from there.

I hope it helps in your own build. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.