⬡ API & SDK/2026-06-17Advanced

Making the Numbers Add Up in a Multi-Tenant Claude API SaaS — Field Notes on Isolation and Cost Attribution

The first thing that breaks when you make a Claude API SaaS multi-tenant is the month-end reconciliation. Here are field notes on a single metering chokepoint, atomic counters, reconciling against Anthropic's bill, and proving tenant isolation with adversarial tests — with production TypeScript.

Claude API⁷⁵ multi-tenant² SaaS¹⁵ cost attribution rate limiting Row Level Security TypeScript¹⁶ production⁹⁵

✦ Premium Article

If you run several products at once, there's always a moment when a Claude API SaaS graduates from "it works" to "I can bill customers for it." The first thing that breaks at that moment isn't a feature or latency. It's the month-end math.

Your Anthropic invoice says $312, but the per-tenant amounts you've accumulated sum to $270. You can no longer tell whose usage the missing $42 was. This "the totals don't match" state quietly undermines your entire pricing model: if you don't know your cost of goods, you can't know the price that turns a profit.

These are field notes on implementing tenant isolation and cost attribution at the level of "reconcilable," not "roughly right" — with production TypeScript and the operational calls you'll be forced to make along the way. I assume Next.js + PostgreSQL + Redis (or Cloudflare KV), but the reasoning is stack-agnostic.

Why "roughly right" eventually collapses

For an ordinary web API, cost behaves like a fixed server bill, and a request log is enough to see who hit what. Claude API is different. Cost is driven by tokens, not request counts, and the unit cost of a single request swings from cents to dollars depending on prompt length, model, and how well caching lands.

Approximate that variability as "request count × average price" and you will systematically underbill your heavy users. The tenant sending long prompts has the most expensive requests, and averaging erases exactly that. You end up undercounting the customer you should be charging most.

So the starting point is simple: take the usage block that's in every Claude API response and attribute it, on the spot, to the tenant that caused the request. Never fill gaps with estimates. Record only the real numbers the response gives you. Whether you can hold that line determines whether month-end reconciliation is even possible.

Make metering inseparable from the API call

The single biggest reason the books stop balancing is forgotten metering. If every new feature calls anthropic.messages.create directly, each one needs metering bolted on, and eventually one won't get it. The unmetered path's cost vanishes silently and reappears as a discrepancy at month-end.

The fix is to funnel every path to Claude through one function. Build a single function that takes a tenant context, and make it the only way anything in the app can reach Claude.

// lib/claude-client.ts
import Anthropic from '@anthropic-ai/sdk';
import type { TenantContext } from '@/types/tenant';
import { checkRateLimit } from '@/lib/rate-limiter';
import { recordUsage } from '@/lib/usage-tracker';
 
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
 
export interface ClaudeRequestParams {
  model: string;
  messages: Anthropic.MessageParam[];
  system?: string;
  maxTokens?: number;
}
 
export async function callClaudeForTenant(
  tenant: TenantContext,
  params: ClaudeRequestParams,
): Promise<Anthropic.Message> {
  // 1. Rate limit first. Reject here and you spend zero tokens.
  if (!(await checkRateLimit(tenant.tenantId, tenant.requestsPerMinuteLimit))) {
    throw new Error(`RATE_LIMIT_EXCEEDED:${tenant.tenantId}`);
  }
 
  const message = await anthropic.messages.create({
    model: params.model,
    max_tokens: params.maxTokens ?? 4096,
    system: params.system,
    messages: params.messages,
  });
 
  // 2. Don't await metering. A failure must not block the response — but must not be swallowed.
  recordUsage(tenant.tenantId, {
    inputTokens: message.usage.input_tokens,
    outputTokens: message.usage.output_tokens,
    cacheReadTokens: message.usage.cache_read_input_tokens ?? 0,
    cacheWriteTokens: message.usage.cache_creation_input_tokens ?? 0,
    model: params.model,
    requestId: message.id,
  }).catch((err) => {
    // End this with console.error alone and nobody will notice the metering gap.
    reportCriticalError('usage_tracking_failed', { tenantId: tenant.tenantId, err });
  });
 
  return message;
}

Two things matter here. First, pull cache_read_input_tokens and cache_creation_input_tokens out of usage. Cache reads cost less than normal input; cache writes cost slightly more. Drop the distinction and you'll systematically misprice tenants who lean on caching.

Second, never let a metering failure end at console.error. Running four sites on autopilot as an indie developer, I've lived through a gap that "was in the logs" but that nobody was watching. Metering failures are quieter than feature bugs, and they map directly to money. They belong on a channel a human can't miss — Sentry, Slack, whatever you actually read.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A metering design that reconciles per-tenant cost against Anthropic's actual invoice within a few percent

✦A decision rule for the unavoidable race between budget checks and API calls — soft limits vs. atomic decrement

✦How to prove tenant isolation actually works by wiring an adversarial test into CI

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Counters must be atomic, and reconciled at month-start

When you accumulate tokens per tenant, the naive "read, add, write back" races under concurrency. Batch your increment commands in a Redis pipeline and update atomically.

// lib/usage-tracker.ts
import { redis } from '@/lib/redis';
 
// USD per 1M tokens (as of June 2026 — always refresh from the official source)
const MODEL_PRICING: Record<string, { input: number; output: number; cacheRead: number }> = {
  'claude-opus-4-8':           { input: 15.0, output: 75.0, cacheRead: 1.5 },
  'claude-sonnet-4-6':         { input: 3.0,  output: 15.0, cacheRead: 0.3 },
  'claude-haiku-4-5-20251001': { input: 0.8,  output: 4.0,  cacheRead: 0.08 },
};
 
interface UsageRecord {
  inputTokens: number; outputTokens: number;
  cacheReadTokens: number; cacheWriteTokens: number;
  model: string; requestId: string;
}
 
export async function recordUsage(tenantId: string, r: UsageRecord): Promise<void> {
  const p = MODEL_PRICING[r.model] ?? { input: 3.0, output: 15.0, cacheRead: 0.3 };
  // Approximate cache writes at 1.25x input (refine per model where it matters)
  const costUsd =
    (r.inputTokens     / 1e6) * p.input +
    (r.outputTokens    / 1e6) * p.output +
    (r.cacheReadTokens / 1e6) * p.cacheRead +
    (r.cacheWriteTokens/ 1e6) * p.input * 1.25;
 
  const ym = new Date().toISOString().slice(0, 7); // YYYY-MM
  const key = `saas:v1:usage:monthly:${tenantId}:${ym}`;
 
  const pipe = redis.pipeline();
  pipe.hincrbyfloat(key, 'costUsd', costUsd);
  pipe.hincrby(key, 'inputTokens', r.inputTokens);
  pipe.hincrby(key, 'outputTokens', r.outputTokens);
  pipe.hincrby(key, 'requestCount', 1);
  pipe.expire(key, 60 * 60 * 24 * 120); // 120 days — gone once reconciliation is done
  await pipe.exec();
}

This gets you per-tenant accumulation, but nothing yet guarantees it's correct. You verify correctness by reconciling. At the start of each month, pull Anthropic's previous-month total (from the Usage/Cost page or the organization billing API) and compare it against the sum of your per-tenant figures.

I treat anything over a 3% gap as worth investigating. Within 3% is the tolerance you'd expect from rounding and approximate cache pricing. Beyond it means a call path is bypassing the metering chokepoint, or recordUsage is failing silently. Reconciliation isn't "confirming the numbers match" — it's the only tool that finds the holes in your metering. If you want to push invoice matching further, my notes on reconciling Claude API cost go deeper.

Stop the noisy neighbor — per-tenant rate limits

Anthropic's rate limits apply per API key. Serve 100 tenants from one key and the moment one tenant pushes near the ceiling, the other 99 catch 429s (my production retry notes for 429s cover the recovery side). Prevent it by enforcing a per-tenant budget in your app layer, before the request reaches Anthropic.

A sliding window in a Lua script is the dependable approach. Write "evict old entries → check count → add" as separate commands and concurrent arrivals will double-spend the window.

// lib/rate-limiter.ts
import { redis } from '@/lib/redis';
 
const SLIDING_WINDOW = `
  local key = KEYS[1]
  local now, win_start, limit = tonumber(ARGV[1]), tonumber(ARGV[2]), tonumber(ARGV[3])
  redis.call('ZREMRANGEBYSCORE', key, '-inf', win_start)
  if redis.call('ZCARD', key) >= limit then return 0 end
  redis.call('ZADD', key, now, now .. ':' .. math.random(1, 1e9))
  redis.call('EXPIRE', key, 120)
  return 1
`;
 
export async function checkRateLimit(tenantId: string, rpm: number): Promise<boolean> {
  const now = Date.now();
  const res = await redis.eval(
    SLIDING_WINDOW, [`saas:v1:ratelimit:${tenantId}`],
    [String(now), String(now - 60_000), String(rpm)],
  );
  return res === 1;
}
 
export const PLAN_LIMITS = {
  starter:    { requestsPerMinute: 10,  monthlyBudgetUsd: 5 },
  pro:        { requestsPerMinute: 60,  monthlyBudgetUsd: 50 },
  enterprise: { requestsPerMinute: 300, monthlyBudgetUsd: 500 },
} as const;

Keep the sum of all per-tenant limits below your real Anthropic ceiling. I cap the whole SaaS at 70% of the actual limit. The remaining 30% is headroom for unmetered batch jobs and unexpected spikes. If summing your per-tenant limits already exceeds Anthropic's ceiling, your app-layer rate limiting is just decoration.

How to handle the budget-check vs. API-call race

When you want to stop a tenant at a monthly budget, you'll always hit a race. Between checking the budget and calling the API, another request can blow through it.

// Wrong: there's a gap between the check and the call
const { allowed } = await checkMonthlyBudget(tenantId, budget);
if (!allowed) throw new Error('BUDGET_EXCEEDED');
const msg = await anthropic.messages.create({ /* ... */ }); // can exceed here

Chasing "never a cent over" makes the implementation heavy. The decision rule is this: does it hurt for you to eat the overage, or can you recover it even if a user goes a bit over?

For most solo SaaS, it's the latter. In that case, stopping softly at 90% of budget is plenty. The remaining 10% absorbs whatever leaks through the race.

// Right: a soft limit whose 10% headroom absorbs the race
const { allowed, remainingUsd } = await checkMonthlyBudget(tenantId, budget * 0.9);

Strictness is only warranted on high-margin-risk plans where an overage immediately means a loss. There, and only there, decrement the budget atomically with Redis DECRBY before the call and add it back if you exceeded. Imposing strict accounting on every tenant isn't worth the code complexity relative to the accuracy you gain. Which one you pick should be a decision driven by a number: your cost ratio.

Isolation is something you prove with a test

If you store conversation history yourself, isolation between tenants is your responsibility. "Claude API is external, so it's isolated" is a misconception — if anything leaks, it's your own database. In PostgreSQL, enforce it with Row Level Security.

ALTER TABLE messages ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON messages
  USING (tenant_id = current_setting('app.current_tenant_id', true));

// lib/tenant-db.ts — set the tenant on a session variable before querying.
// The third arg `true` scopes it to the transaction.
export async function withTenant<T>(tenantId: string, fn: (c: PoolClient) => Promise<T>): Promise<T> {
  const client = await pool.connect();
  try {
    await client.query(`SELECT set_config('app.current_tenant_id', $1, true)`, [tenantId]);
    return await fn(client);
  } finally {
    await client.query(`SELECT set_config('app.current_tenant_id', '', true)`);
    client.release();
  }
}

Merely implementing it gives you no assurance isolation holds. One wrong line in an RLS policy still passes every happy-path test. You prove isolation with an adversarial test, and I wire it into CI as a gate.

// __tests__/tenant-isolation.test.ts
it('tenant A can never read tenant B\'s conversation', async () => {
  const convB = await withTenant('tenant-B', (c) =>
    c.query(`INSERT INTO messages (tenant_id, content) VALUES ('tenant-B', 'secret') RETURNING id`),
  );
  const rows = await withTenant('tenant-A', (c) =>
    c.query(`SELECT content FROM messages WHERE id = $1`, [convB.rows[0].id]),
  );
  expect(rows.rowCount).toBe(0); // With RLS on, a direct ID still returns 0 rows
});
 
it('confirms a leak when RLS is dropped, validating the test itself', async () => {
  // Deliberately verify RLS dependence. If nothing leaks, the test above may be guarding nothing.
});

The second test is the important one. Unless you also confirm that "dropping RLS leaks," you can't tell whether the first test is actually verifying isolation. A bug that always returns zero rows would pass the first test too. A guard test is only trustworthy once you've deliberately broken it.

What happens when Redis or KV goes down

The moment you put rate limiting and budgets on Redis, Redis becomes a single point of failure. You must decide, at design time: when Redis is down, do you let requests through or block them?

Letting them through (fail-open) protects availability, but during the outage your rate limits and budget guards are off, and cost can run unbounded. Blocking (fail-closed) protects cost, but a brief Redis blip takes the whole service down.

I run rate limiting fail-open and budgets fail-closed. Stopping every tenant on each blip is excessive, but a monetary wall like a budget is safer kept shut precisely when you can't measure. That asymmetry is the result of saying out loud, per guard, whether it exists to protect availability or cost.

export async function checkRateLimit(tenantId: string, rpm: number): Promise<boolean> {
  try {
    return (await redis.eval(/* ... */)) === 1;
  } catch {
    reportDegraded('ratelimit_failopen', { tenantId });
    return true; // fail-open: availability first
  }
}

Your next move

Start by grepping your SaaS for Claude calls and counting whether anthropic.messages.create is ever invoked outside the metering chokepoint. Even one call on the outside is the source of your month-end gap. Funnel everything through one entry point, accumulate atomically, and reconcile against Anthropic's invoice at the start of each month. Only when those three are in place do you have the ground to design pricing with real numbers.

Multi-tenancy is often treated as something you can add later, but data isolation is the one piece whose existing-data migration cost spikes after the fact. If there's one thing worth building in from the start, I've come to feel it's this.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.