●MCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)●LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter management●AGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returning●LIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based Enterprise●BILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limits●FIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task status●MCP — Enterprise-managed MCP connectors arrive: admins provision once, users get zero-touch access on first login (Okta, Team/Enterprise beta)●LEGAL — 20+ legal MCP connectors and 12 practice-area plugins ship for research, contracts, and matter management●AGENTS — Code w/ Claude unveils Managed Agents: plan the work, fan out to hundreds of subagents, verify before returning●LIMIT — The 5-hour Claude Code rate window is doubled for Pro, Max, Team, and seat-based Enterprise●BILLING — The June 15 Agent SDK credit split was paused; this usage stays within your subscription limits●FIX — Claude Code stability fixes continue: stuck spinners, subagent transcripts, and remote task status
Putting Cloudflare AI Gateway in Front of Claude Made the Numbers I Needed Disappear — Field Notes on Instrumentation
After putting Cloudflare AI Gateway in front of Claude API, here is where I actually got stung — cost attribution, semantic-cache false hits, fallback quietly lowering quality, and budgets that don't really stop anything — with the code I used to fix each.
The week after I added the gateway, my cost breakdown stopped making sense
The reasons for putting Cloudflare AI Gateway in front of Claude API usually collapse into four: make requests observable, throttle yourself before you hit the provider's rate limit, cut duplicate calls with caching, and route around a model outage. All legitimate, and the gateway genuinely handles them in a single managed layer.
Yet the week after I, as an indie developer, placed it in front of the content pipeline for my four Dolice Labs sites, I ran into a paradox: the breakdown was harder to read than before. The total-cost and latency graphs came out clean. But "which feature, which batch, used how much" had fallen out of the dashboard. The gateway only brokers traffic, so unless you hand it your own context, every request is recorded as one indistinguishable blob.
This is not a setup guide. It's a record of the four places where, after installing the gateway, I realized it sees and does less than I assumed — cost attribution, cache false hits, fallback's quiet quality drop, and budget enforcement — along with the code and the calls I made.
First, bind instrumentation context to every request
Whether you can filter the gateway logs later is decided at the moment you send the request: did you attach metadata? Whatever you pass in the cf-aig-metadata header lands in the logs, so put every axis you'll want to slice by in there. For me that was three: which site, which generation type, which batch run.
// src/lib/claude-gateway.tsimport Anthropic from "@anthropic-ai/sdk";const GATEWAY_BASE_URL = process.env.CLOUDFLARE_GATEWAY_URL;const ANTHROPIC_API_KEY = process.env.ANTHROPIC_API_KEY;if (!GATEWAY_BASE_URL || !ANTHROPIC_API_KEY) { throw new Error("Both CLOUDFLARE_GATEWAY_URL and ANTHROPIC_API_KEY are required");}// Just swap baseURL to the gateway; existing SDK calls go through unchangedconst client = new Anthropic({ apiKey: ANTHROPIC_API_KEY, baseURL: GATEWAY_BASE_URL,});export interface CallContext { site: string; // e.g. "claudelab" workload: string; // e.g. "recovery" / "brushup" / "daily" runId: string; // per-batch identifier}export async function gatewayChat( params: Anthropic.MessageCreateParamsNonStreaming, ctx: CallContext) { return client.messages.create(params, { headers: { // Whatever you set here lands in the AI Gateway logs, sliceable later "cf-aig-metadata": JSON.stringify({ site: ctx.site, workload: ctx.workload, runId: ctx.runId, ts: new Date().toISOString(), }), }, });}
The reason a baseURL swap is enough: the Anthropic TypeScript SDK takes the HTTP destination from baseURL and reuses its auth headers and retry logic as-is. The gateway proxies Anthropic-compatible paths, so your app logic doesn't change by a line. The flip side is that any request where you forget the metadata is recorded as part of the anonymous blob. My logs right after install were exactly that.
To aggregate attribution, hit the Logs API and group by metadata. Only here does "which workload is driving spend" become a number.
If the "unattributed" bucket is swelling, that's a sign calls are still going out without metadata. I treat driving that number toward zero as my measure of whether instrumentation is finished.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How to restore per-feature and per-workload cost attribution that the gateway dashboard alone leaves out, through deliberate metadata design
✦An implementation that stops the semantic cache from returning a stale answer to a close-but-different question, using cache-key namespaces and skip conditions
✦Code that surfaces fallback's 'stays up but quietly degrades' behavior and a gate that actually stops budget overruns
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Semantic cache crashes on "similar but not the same"
The semantic cache reuses past responses by the meaning-distance of the request. In stable Q&A areas like FAQs it works dramatically. The catch is questions that are close in meaning but should have different answers — say "what's the return policy?" versus "what's the return policy for international orders?" — which embedding distance alone can conflate. When the former's cached answer is served to the latter, an answer that isn't false but is inaccurate ships quietly.
My fix is two layers. First, partition what may share a cache with cache-key namespaces. Second, explicitly opt user-specific or freshness-sensitive calls out of the cache.
// Shareable FAQ-type call: a namespace prevents conflationconst faq = await gatewayChat( { model: "claude-haiku-4-5", max_tokens: 512, system: "Answer concisely as a customer support agent.", messages: [{ role: "user", content: "What is the return policy for domestic orders?" }], }, ctx);// Attach these headers on the call (extend gatewayChat to pass them):// "cf-aig-cache-ttl": "3600"// "cf-aig-cache-key": "faq:returns:domestic" ← pin the topic by namespace// High-individuality / freshness-required calls skip the cacheconst summary = await gatewayChat( { model: "claude-sonnet-4-6", max_tokens: 1024, messages: [{ role: "user", content: `Summarize the status of order ${orderId}` }], }, ctx);// "cf-aig-skip-cache": "true"
What helps here is not leaving the cache key to embedding distance, but pinning it to a unit of meaning — a topic — that you define. Keep faq:returns:domestic and faq:returns:international in separate namespaces and even near-identical wording is treated as distinct, so no conflation. Rather than handing everything to the cleverness of embeddings, draw one boundary you already know as a human. That was the single most effective operational call.
A high hit rate is not automatically good. As an early warning for false hits, I track "low-rating rate on cached responses" separately and read it next to the hit rate. Chase hit rate alone and you can drift into the worst optimization there is: costs fall on false hits while quality silently erodes.
Nature of the call
Cache policy
Key design
Canned FAQ / help replies
On (longer TTL)
Pin by topic namespace
Document summaries (shared docs)
On (short TTL)
Include doc hash in the key
User-specific lookups
Skip
—
Generative / creative
Skip
—
Fallback "stays up" but "quietly lowers quality"
Fallback automatically reroutes to another model when the primary returns an error. Availability does rise. What's easy to miss is the obvious fact that the destination isn't necessarily of equal quality. A job that expected Opus 4.8 dropping to Haiku 4.5 during an outage and returning as a "success" leaves the system looking healthy while the output quietly thins out.
So I made sure to always record whether a fallback happened, and to keep which model actually responded on the app side too. The response carries the model that served it, so I push that into the same log stream as the attribution metadata.
// Detect on the caller side that a fallback occurred and record itexport async function chatWithFallbackTrace( params: Anthropic.MessageCreateParamsNonStreaming, ctx: CallContext, expectedModel: string) { const res = await client.messages.create(params, { headers: { "cf-aig-metadata": JSON.stringify({ ...ctx, expectedModel }), }, }); // The model that actually answered comes back in res.model const served = (res as { model?: string }).model ?? "unknown"; if (served !== expectedModel) { // Leave the silent quality drop as an "event" (tally frequency later) console.warn( JSON.stringify({ kind: "fallback_served", runId: ctx.runId, expectedModel, served, ts: new Date().toISOString(), }) ); } return { res, served, degraded: served !== expectedModel };}
As an operational call, don't make the fallback target "just the cheapest model." For higher-importance workloads I cap the drop at one quality tier below the primary. Throwing quality to the floor for availability doesn't pencil out in front of articles I deliver to paying readers. Staying up matters less than knowing you fell.
A budget isn't stopped by "logs." Put a real gate elsewhere
The gateway's rate limits and budget displays are excellent as observation, but "actually stop requests once you hit the cap" was more reliably guaranteed by a gate on my side. Even when the display tells me the budget is blown, the batch keeps running in the meantime. So I wedge a light budget check right before the call.
// src/lib/budget-gate.ts — check today's cost before the call and stopinterface BudgetState { date: string; spentUsd: number; }const DAILY_CAP_USD = 5; // per-workload capexport async function withinBudget( read: () => Promise<BudgetState>, write: (s: BudgetState) => Promise<void>, estimateUsd: number): Promise<boolean> { const today = new Date().toISOString().slice(0, 10); const state = await read(); const base = state.date === today ? state : { date: today, spentUsd: 0 }; if (base.spentUsd + estimateUsd > DAILY_CAP_USD) { return false; // over cap → don't call } await write({ date: today, spentUsd: base.spentUsd + estimateUsd }); return true;}// Usage (assuming Cloudflare Workers KV for read/write)// const ok = await withinBudget(readKV, writeKV, estimateForRequest(params));// if (!ok) { return new Response("budget exceeded", { status: 429 }); }
Here estimateUsd is figured up front from a rough input-token count and the model price. You don't need perfect precision. You need one thing: to reliably stop a running batch at the cap. Once I split the roles — display after the fact, gate before the fact — unexpected billing spikes stopped happening.
Pull it onto Cloudflare Workers and it all sits inside one boundary
My pipeline runs on Cloudflare Workers, so the gateway, the budget gate, and the attribution logs all ended up inside the same edge boundary. The client only needs to know the Worker, the Claude API key stays closed inside the Worker's env, and gateway config and cost aggregation complete in one place.
// worker.ts — one stroke from budget to generation to attribution at the entryexport default { async fetch(req: Request, env: Env): Promise<Response> { const ctx = { site: "claudelab", workload: "daily", runId: crypto.randomUUID() }; const ok = await withinBudget( () => readBudget(env.KV, ctx.workload), (s) => writeBudget(env.KV, ctx.workload, s), 0.02 // rough cost of this call ); if (!ok) return new Response("budget exceeded", { status: 429 }); const { res, degraded } = await chatWithFallbackTrace( { model: "claude-sonnet-4-6", max_tokens: 1024, messages: [{ role: "user", content: await req.text() }], }, ctx, "claude-sonnet-4-6" ); return Response.json({ degraded, content: res.content }); },};
In this shape, the quality drop during an outage, a budget refusal, and the cost breakdown can all be skewered on the same runId. Being able to trace "what happened in that batch on that day" along a single line is what mattered most for running several sites alone. When you operate solo, the total observation effort you can afford is finite. That's exactly why binding context once at the entry and tying everything to that context is, in the end, what lasts longest.
Next step
If you're about to add the gateway, make your first task "metadata design." Cost attribution, cache keys, fallback detection, and the budget gate all ride on the one context you bind at the entry. Turning on the visualization and savings features can wait until after that. Try to bolt context on later and you'll spend a week like mine — unable to read the breakdown the week after you installed it.
Thanks for reading. I hope it gives some footing to others running AI inside the edge.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.