●MODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yet●PRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31●CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context window●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)●CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoff●ENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alerts●MODEL — Claude Sonnet 5 is now the default model across all plans, the most agentic Sonnet yet●PRICE — Sonnet 5 launches at $2/$10 per million tokens, available through August 31●CODE — Claude Code adopts Sonnet 5 by default with a native 1M-token context window●GATEWAY — A self-hosted Claude apps gateway arrives for Amazon Bedrock and Google Cloud (SSO, policy, cost)●CHROME — Claude in Chrome is now generally available with background notifications and draft PR handoff●ENTERPRISE — Enterprise gains richer admin analytics, model-level entitlements, and spend alerts
Reading the Claude apps gateway Announcement, I Rebuilt My Indie-Scale Control Plane
The self-hosted Claude apps gateway is a control-plane/data-plane separation you can scale down. Per-app cost attribution, model allowlists, and fail-closed spend caps, implemented as a small Cloudflare Workers proxy.
At the start of the month, I opened my Claude API bill and stopped. I could see the total. I could not see the breakdown.
In my setup, three things shared a single API key: an automation that drafts multilingual replies to App Store reviews, the content pipeline behind my blogs, and an experimental script I had written on a whim. Which one spent what? The billing page could not tell me.
Around the same time, Anthropic announced a self-hosted Claude apps gateway for Amazon Bedrock and Google Cloud. SSO, centralized policy enforcement, role-based access, per-user cost attribution, spend limits. The vocabulary is enterprise, but the underlying problem was exactly mine on that morning.
Rather than filing the announcement under "enterprise features that don't concern me," I tried reading it as a design document. Then I scaled the idea down to indie size and built a small proxy. This is the record of that implementation and migration.
What the gateway actually centralizes
Unpacking the official description, the gateway provides four things:
Centralized authorization — deciding who (which app) may call a model, at the entrance of the call path
Centralized policy — enforcing which models and features are allowed on the path itself, not in each app's code
Cost attribution — recording usage per user (or per app), so the bill becomes decomposable
Spend limits — capping attributed cost, and refusing requests before they go through once the cap is hit
The interesting part is that every one of these depends on a single move: funneling all model calls through one path. As long as each app calls api.anthropic.com directly, none of this is possible. Once the path is unified, authorization, measurement, and policy can all ride on that one point.
Borrowing networking vocabulary, this is a separation of control plane and data plane. The app focuses on the data plane — inference in, output out — while keys, policy, and metering move to the path. The gateway, as a product, is that separation in purchasable form. That is how I read it.
What to keep and what to drop at indie scale
There is no need to import an enterprise control plane wholesale. At my scale — one person, three products — the triage looked like this.
Safe to drop: SSO and role-based access. When the only operator is me, putting human authentication on the path adds little.
Worth keeping: three things.
Per-app tokens — apps never see the real key; each calls with its own internal token. This becomes the unit of attribution, and if a token leaks, you revoke that app alone
Model allowlists — pin which models each app may use, on the path. This catches typos in model strings and expensive models left in experimental code
Spend caps — a monthly ceiling per app, enforced by refusing requests. As discussed below, this should be fail-closed
Here is the migration order I followed:
Inventory every product using the API key (they hide in surprising places — I had forgotten one in a cron job)
Deploy the proxy; the real key lives only in the proxy's secrets
Issue an internal token per app and repoint each app's base URL at the proxy
Only after confirming every app has switched, rotate the real key (any stragglers still calling the API directly get flushed out here)
Putting step 4 last matters. Rotate first and any app you missed dies suddenly.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦A complete minimal control-plane proxy on Cloudflare Workers and Durable Objects (per-app tokens, model allowlists, spend caps)
✦A real migration from three products sharing one API key to per-app cost attribution, including the forgotten high-cost config it exposed
✦A decision framework for choosing between a self-built proxy, LiteLLM, and the official gateway based on scale and required control granularity
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
I built this on Cloudflare Workers with Durable Objects. Workers because I already run other things there, so no new operational surface; Durable Objects because a monthly counter needs serialized writes.
// gateway.ts — a minimal control-plane proxyinterface AppPolicy { token: string; // per-app internal token allowedModels: string[]; // model allowlist monthlyCapUsd: number; // monthly spend cap}// Policies come from the APP_POLICIES (JSON) env var, not hardcoded// Price table (USD per 1M tokens). Update here when pricing changesconst PRICING: Record<string, { in: number; out: number }> = { "claude-sonnet-5": { in: 2, out: 10 }, // intro pricing through 2026-08-31 "claude-opus-4-8": { in: 5, out: 25 }, "claude-haiku-4-5": { in: 1, out: 5 },};export default { async fetch(req: Request, env: Env): Promise<Response> { const policies: AppPolicy[] = JSON.parse(env.APP_POLICIES); const token = req.headers.get("x-app-token") ?? ""; const app = policies.find((p) => p.token === token); if (!app) return new Response("unknown app", { status: 401 }); const body = await req.json<{ model: string }>(); if (!app.allowedModels.includes(body.model)) { return new Response("model not allowed for this app", { status: 403 }); } // Spend cap check (fail-closed: if the ledger is unreadable, refuse) const ledger = env.LEDGER.get(env.LEDGER.idFromName(ledgerKey(app.token))); const spent = await ledger.fetch("https://ledger/get").then((r) => r.json<number>()); if (spent >= app.monthlyCapUsd * 0.9) { return new Response("monthly cap reached", { status: 429 }); } // The real key is attached here and only here const upstream = await fetch("https://api.anthropic.com/v1/messages", { method: "POST", headers: { "x-api-key": env.ANTHROPIC_API_KEY, // Workers secret "anthropic-version": "2023-06-01", "content-type": "application/json", }, body: JSON.stringify(body), }); // Post-hoc accounting: compute cost from usage and add it to the ledger const json = await upstream.json<any>(); if (json.usage && PRICING[body.model]) { const p = PRICING[body.model]; const cost = (json.usage.input_tokens / 1e6) * p.in + (json.usage.output_tokens / 1e6) * p.out; await ledger.fetch("https://ledger/add", { method: "POST", body: JSON.stringify({ cost, requestId: json.id }), }); } return new Response(JSON.stringify(json), { status: upstream.status, headers: { "content-type": "application/json" }, }); },};function ledgerKey(token: string): string { const m = new Date().toISOString().slice(0, 7); // YYYY-MM return `${token}:${m}`;}
The ledger side is a small Durable Object whose only jobs are serializing additions and staying idempotent.
// ledger.ts — monthly ledger (Durable Object)export class SpendLedger { constructor(private state: DurableObjectState) {} async fetch(req: Request): Promise<Response> { const url = new URL(req.url); if (url.pathname === "/get") { const v = (await this.state.storage.get<number>("spent")) ?? 0; return Response.json(v); } if (url.pathname === "/add") { const { cost, requestId } = await req.json<any>(); // Dedupe by request id so retries never double-count a response const seen = await this.state.storage.get<boolean>(`seen:${requestId}`); if (!seen) { const v = (await this.state.storage.get<number>("spent")) ?? 0; await this.state.storage.put("spent", v + cost); await this.state.storage.put(`seen:${requestId}`, true); } return Response.json(true); } return new Response("not found", { status: 404 }); }}
On the app side, the change is only the base URL and swapping the x-api-key header for x-app-token. If you use the SDK, a custom baseURL and header get you there — application logic stays untouched.
Why the cap is fail-closed, and what that costs
The design decision I weighed longest was the character of the spend cap.
Cap checks have a structural limit: token usage is unknown until the response returns, so strict pre-authorization is impossible. What you get is an approximation — a pre-check against the running total, plus post-hoc accounting of the returned usage. Under concurrent requests, some spend will slip past the cap.
That is exactly why the check fires at 90% of the cap. Budget for the slippage and lean toward stopping reliably rather than stopping precisely. The goal is not to enforce the cap to the cent; it is to make "I woke up to a three-digit bill" structurally impossible.
The second decision: when the ledger is unreadable, do you let requests through or refuse them? I refuse (fail-closed). Fail-open here means that during an outage, your most expensive path is the one left wide open. Comparing a content pipeline pausing for a few hours against a runaway retry loop running all night, the former is cheaper — the same instinct I described when narrowing the blast radius of API keys.
The trade-off deserves honesty too: this design does not handle streaming. With SSE, usage is only final at the last event, and post-hoc accounting in a proxy becomes fiddly. My workloads (batch generation and non-interactive auto-replies) are non-streaming, so I accepted the limitation. If you front an interactive UI, you will need to extract usage from the trailing message_delta event.
What the first month of attribution revealed
Breaking down June through this ledger: $41.72 total, of which the content pipeline was $27.9, review-reply automation $8.6, and the experimental script $5.2.
More useful than the numbers were two things that only became visible once the bill decomposed.
First, the experimental script still had claude-opus-4-8 in its model string. A throwaway that had quietly stayed in cron, spending a bit over $5 a month. Its allowlist is now Haiku-only, so this class of neglect surfaces immediately as a 403.
Second, the pipeline — 67% of spend — currently rides Sonnet 5's intro pricing ($2/$10 through the end of August). That mix shifts when the intro price expires, so I combined the ledger with effective-dated price forecasting and pre-computed September's breakdown. With attribution in place, a price change stops being "the total will probably grow" and becomes "this app grows by this much."
Choosing between self-built, LiteLLM, and the official gateway
Three tools address the same problem. Here is the comparison I actually used.
Option
Fits when
Watch out for
Self-built proxy (this article)
One to a few people, a handful of products, and the controls you need are attribution, allowlists, and caps
You maintain it; features like streaming support are yours to build
OSS gateways such as LiteLLM
Multiple providers in play, and you want fallbacks and key management bundled
A long-running process to operate; configuration has a real learning curve
Official Claude apps gateway
Team operation on Bedrock or Google Cloud where SSO and RBAC are requirements
A self-hosted control plane — more than an individual needs
The axis is control granularity. If attribution and caps are enough, 150 lines of proxy suffice. The moment you want cross-provider abstraction, look at a gateway like LiteLLM. When human identity and permissions enter the picture, look at the official control plane. Conversely, adopting the heavy option before you need SSO leaves you with nothing but maintenance.
Closing: start by unifying the path
"Control plane" sounds grand, but the substance is one move: unify the call path, then load attribution, policy, and caps onto it. Enterprises buy that as a product; an individual can write it in 150 lines. The scale differs; the skeleton is the same.
Start with a question: can you explain your bill's breakdown? If not, a proxy and a ledger are the first step. Caps and allowlists can be added any time once the path is single.
This small control plane has already cut the time I spend squinting at invoices. If it lightens someone else's start-of-month routine as well, I will be glad I wrote it up.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.