●SLACK — Claude Tag launches in beta on Slack: tag @Claude into channels to delegate tasks and connect tools, data, and codebases●SECURITY — Claude Code adds a sandbox.credentials setting to block sandboxed commands from reading credential files and secrets●FIX — Remote MCP tool calls that once hung for five minutes now abort with an error instead of blocking●MCP — Enterprise MCP connectors gain Okta provisioning, giving users zero-touch access on first login●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task●SLACK — Claude Tag launches in beta on Slack: tag @Claude into channels to delegate tasks and connect tools, data, and codebases●SECURITY — Claude Code adds a sandbox.credentials setting to block sandboxed commands from reading credential files and secrets●FIX — Remote MCP tool calls that once hung for five minutes now abort with an error instead of blocking●MCP — Enterprise MCP connectors gain Okta provisioning, giving users zero-touch access on first login●MODEL — Claude Fable 5 offers a 1M-token context, always-on adaptive thinking, and 128K output●LINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
When Your Support AI Is Confidently Wrong in Production — Notes on Refusing Outside Its Grounding and Routing to Humans
A field-tested approach to the Claude API support agent that demos perfectly yet states non-existent facts in production. Covers deciding 'don't answer' at retrieval time, grounded generation, measuring confident-wrong rate, and tuning escalation precision.
It demoed perfectly, yet states things that don't exist
Most teams remember the first internal demo of a support AI vividly. It answers every prepared question fluently, the tone is polite, and everyone feels the first response line can be handed over. The trouble starts after you trust that feeling and ship.
As an indie developer at Dolice running semi-automated inquiry handling across several blogs and apps, the case that chilled me most was an agent that confidently quoted the terms of a campaign that did not exist — complete with numbers. The user acted on it, then wrote back: "I did exactly what it said." Technically nothing failed. The response was a 200, the prose was smooth, the honorifics were correct. Only the content was wrong.
This quiet kind of error is not solved by a smarter model. Claude is plenty smart and it still happens. The cause is not a lack of intelligence; it is the absence of a mechanism that makes the system say "I don't know" when it doesn't. Below I split that mechanism into four parts — retrieval, generation, measurement, and escalation — in the shape that actually held up in production.
Why it can't say "I don't know" — the grounding misunderstanding
People often tell me they have RAG in place yet confident errors persist. Dig in and the structure is usually identical. You pull related documents from a knowledge base, stuff them into the prompt, and instruct "answer based on the resources below." That part is fine. But even when the retrieved documents barely relate to the question, generation does not stop.
The model tries to use whatever context you hand it. Given only weakly relevant documents, it assembles a plausible-sounding answer anyway. So the weak point of grounding is not generation — it sits one step earlier. The real hole is failing to judge, before generation, whether there is genuinely enough grounding to be worth answering at all.
I moved that judgment into retrieval as an explicit "answer or hand off" branch. Generation is just the final stage that only the things passing the branch ever reach.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A retrieval-confidence pattern that decides 'answer vs. hand off' before generation runs
✦How to measure confident-wrong rate from production logs, and the level to hold it under
✦Tuning escalation with precision/recall and F1 so you avoid both over-routing and misses
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
First, score retrieval confidence and refuse to proceed to generation once it drops below a threshold. This single step before calling Claude cut confident errors the most.
import osfrom anthropic import Anthropicfrom dataclasses import dataclassclient = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])@dataclassclass Retrieval: chunks: list[dict] # {"text": ..., "score": float, "source": ...} top_score: float margin: float # gap between the 1st and 2nd scoresdef assess(retrieval: Retrieval) -> str: """Decide answer / clarify / escalate from grounding quality alone.""" # Set thresholds from production logs. Start strict. if retrieval.top_score < 0.62: return "escalate" # no relevant document at all if retrieval.margin < 0.05 and retrieval.top_score < 0.75: return "clarify" # near-tied candidates = vague question return "answer"
Two things matter here. One is looking not only at the absolute top score but at the margin between first and second place. Even with a high top score, near-tied candidates mean the question is vague and the system can't pick a document. Routing to a clarifying question instead of an answer cuts mix-ups.
The other is not choosing thresholds by gut. For the first two weeks I logged every assess decision against the eventual correct/incorrect verdict, looked at the score distribution of cases that wrongly reached answer, and raised the threshold. Going from 0.55 to 0.62 alone made the confident-error feel noticeably better.
Don't let generation step outside the grounding
Only things that reached answer go to generation. The aim here is to stop the model from adding information outside the documents you handed it. Bind the grounding scope in the system prompt and make the exit for "can't answer" explicit.
SYSTEM = """You are a customer support agent. Follow these rules strictly.- Ground every answer only in the text inside <resources>. Never supply facts, numbers, or conditions that are not in the resources.- If the resources do not confirm it, do not guess. Reply: "Let me connect you with a team member to confirm."- For each claim, cite the source you grounded it in, like [S1], in the body.- Answer in the user's language, even if the resources are in another language."""def generate(question: str, chunks: list[dict], user_lang: str) -> dict: resources = "\n\n".join( f"<doc source=\"{c['source']}\">{c['text']}</doc>" for c in chunks ) resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=800, system=SYSTEM, messages=[{ "role": "user", "content": f"<resources>\n{resources}\n</resources>\n\n" f"Question ({user_lang}): {question}", }], ) text = resp.content[0].text # An answer with no citation tag is suspected of fabricating outside resources grounded = "[S" in text or "connect you with a team member" in text return {"text": text, "grounded": grounded}
The last three lines quietly earn their keep. An answer that contains neither a [S...] citation nor the hand-off phrase is likely written without consulting the resources, so don't put it on the automated path — route it to a human. It is not a complete detector, but it's a final sieve that catches fabricated answers that slipped through.
For response language, declaring the detected language in the prompt is the reliable fix. When the resources are in Japanese and the question is in English, without an instruction the model can be pulled toward the resource language and reply in Japanese. I saw this a few times in production, and it stopped once I started passing user_lang explicitly every time.
Measure false confidence — three signals to watch
"It feels more accurate" doesn't keep an operation running. I watch only these three:
Signal
Definition
Operating guide
Deflection rate
Resolved without a human, with no repeat inquiry
Too high suggests it is forcing answers
Confident-wrong rate
Asserted with a citation yet the content was wrong
Most important. Hold under 1%
Escalation precision
Of cases sent to a human, the share that truly needed one
Low means over-routing tires the team
Of these, confident-wrong rate matters by a wide margin. You can lower deflection at will, but an AI with a high confident-wrong rate damages trust itself. I track it weekly and use it to decide threshold and grounding-scope changes. Measuring it isn't hard.
def confident_wrong_rate(logs: list[dict]) -> float: """Share that asserted (cited, not escalated) and was wrong.""" confident = [x for x in logs if x["grounded"] and not x["escalated"]] if not confident: return 0.0 wrong = [x for x in confident if x.get("verdict") == "incorrect"] return len(wrong) / len(confident)
You fill in verdict after the fact from repeat inquiries, low ratings, and human correction history. You don't need a human to review every case; sampling is enough. Early on I was content watching only the deflection rate, and I was slow to notice confident errors happening below the surface. The better a metric looks, the more it can hide the dangerous one — that was the biggest lesson from this work.
Escalation: tuned between "too few" and "too many"
The hand-off decision fails whether it's too strict or too loose. Too loose and confident errors rise; too strict and everything gets routed, the team burns out, and the point of the AI fades. I treat this as a tug-of-war between precision and recall and collapse it into one number with F1.
def escalation_scores(logs: list[dict]) -> dict: # Did we correctly route cases that needed a human? tp = sum(1 for x in logs if x["needed_human"] and x["escalated"]) fp = sum(1 for x in logs if not x["needed_human"] and x["escalated"]) fn = sum(1 for x in logs if x["needed_human"] and not x["escalated"]) precision = tp / (tp + fp) if (tp + fp) else 0.0 recall = tp / (tp + fn) if (tp + fn) else 0.0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0 return {"precision": round(precision, 3), "recall": round(recall, 3), "f1": round(f1, 3)}
In practice, when to favor recall versus precision shifts by topic. Inquiries about money or contracts carry a large cost for false negatives, so lean toward recall even at the cost of some over-routing. Areas like how-to questions, where being wrong does little harm, lean toward precision and route less. I split the assess thresholds per category and hand the dangerous categories to a human earlier. Not trying to handle every category with one uniform threshold was the key to balancing team load against safety.
I also keep an emotional trigger separate. Inquiries that clearly read as strong dissatisfaction get routed to a human first, even when the content is answerable — because sometimes people need to be heard before they need the correct answer.
Small implementation choices that quietly helped
Beyond the big design, a few small things earned their place. One is masking PII before generation. Passing the raw inquiry straight into retrieval and prompts circulates email addresses and order numbers needlessly. Replacing them with placeholders via regex up front and restoring only what's needed right before replying lowers the exposure surface a notch.
Another is reusing the fixed portion with prompt caching. The system prompt and the common knowledge base don't change per request, so caching them clearly eases cost as volume grows. Once monthly inquiries reach the tens of thousands, the difference is not negligible.
Last, always store the grounding source structurally in the answer log. When you later trace confident errors, keeping only the body won't lead you back to the offending document. Saving {source, score, verdict} in a structured form lets you trace all the way back to the retrieval pattern behind a mistake. I skimped on this step early and burned extra time on root-cause analysis.
If you change one thing, start here
If you're wrestling with confident errors right now, the first place to touch is the "don't answer" branch in retrieval. No amount of polishing the generation prompt removes confident errors while the structure that answers ungrounded questions remains. Measure retrieval confidence with a single threshold, and hand off to a human when it drops below. Add only that today, and log for a week. Adjusting the threshold against your production correct/incorrect distribution follows naturally from there.
I hope it helps in your own build. Thank you for reading.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.