●MODEL — Claude Opus 4.8 lands, improving coding, agentic, and reasoning over 4.7 at the same price●CODE — Opus 4.8's Fast mode runs at 2.5x speed and is now three times cheaper than earlier models●CODE — Auto-mode command classification expands, with denial tracking and live bash path autocomplete●ENTERPRISE — Connector permissions in custom roles let admins control which tools each role can use●TEAM — Tag Claude directly in Slack and hand off tasks while you focus elsewhere●MCP — MCP servers now show startup auth notices, making connection status easier to track●MODEL — Claude Opus 4.8 lands, improving coding, agentic, and reasoning over 4.7 at the same price●CODE — Opus 4.8's Fast mode runs at 2.5x speed and is now three times cheaper than earlier models●CODE — Auto-mode command classification expands, with denial tracking and live bash path autocomplete●ENTERPRISE — Connector permissions in custom roles let admins control which tools each role can use●TEAM — Tag Claude directly in Slack and hand off tasks while you focus elsewhere●MCP — MCP servers now show startup auth notices, making connection status easier to track
Production Voice Agents with Claude API: Lessons from Running 6 Indie Apps
Whisper/Deepgram, Claude API, and TTS engines orchestrated for a production voice agent — written by an indie developer running this stack on Cloudflare Workers and Cloud Run with real latency budgets, cost breakdowns, and fallback strategies.
We're witnessing a fundamental shift in how users interact with AI. Text-based interfaces are giving way to voice-native applications where conversations feel natural and intuitive. Companies building voice agents can deliver more engaging user experiences while capturing entirely new use cases.
Claude API excels at understanding nuanced natural language, but it's designed purely for text. Voice agents require orchestration: speech-to-text (STT) for input, Claude for reasoning, and text-to-speech (TTS) for output. The magic isn't in any single component—it's in how they work together seamlessly.
This guide walks you through building a production-grade voice agent system that handles real-world challenges: streaming audio, maintaining conversation context, recovering from failures, scaling to thousands of users, and optimizing costs. We'll use TypeScript/Node.js throughout, with immediately applicable code patterns.
Voice Agent Architecture Overview
A production voice agent system spans multiple integrated layers:
Let's build each piece methodically, starting with speech recognition.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦How I split a 1500ms voice-to-voice latency budget into STT 600ms / Claude Haiku 400ms / TTS 400ms, and compressed real-world p50 to 910ms with Deepgram Streaming
✦Reduced per-session cost from $0.024 to $0.011 across 4 specific decisions, with the Sonnet routing rule that finally worked after Sonnet-judges-itself failed
✦Why Cloudflare Workers cannot host the inference layer, and the Durable Objects + Cloud Run split I run in production with signed JWT session tokens
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Voice agents need brief, action-oriented system prompts. Users can't easily re-read long responses:
const VOICE_AGENT_SYSTEM_PROMPT = `You are a helpful, conversational voice assistant.Guidelines:- Respond naturally, as if speaking to someone. Keep sentences short (under 20 words when possible).- Use conversational language. Avoid jargon unless the user introduced it first.- If unsure, admit it. Don't speculate.- Break complex information into bullet points (max 3 items per response).- No emojis. No markdown formatting. Speak like a real person.- Be warm and encouraging while remaining professional.`;
Text-to-Speech Implementation and Optimization
Multi-Provider TTS Adapter Pattern
Production systems need failover. Implement a provider-agnostic interface:
// src/monitoring/metrics.tsimport prom from 'prom-client';export const voiceAgentMetrics = { totalSessions: new prom.Counter({ name: 'voice_agent_total_sessions', help: 'Total number of sessions', labelNames: ['status'] // success, failed, timeout }), apiCallsTotal: new prom.Counter({ name: 'voice_agent_api_calls_total', help: 'Total API calls by service', labelNames: ['service'] // claude, whisper, tts }), sessionDurationSeconds: new prom.Histogram({ name: 'voice_agent_session_duration_seconds', help: 'Session duration in seconds', buckets: [10, 30, 60, 300, 600] }), apiLatencyMs: new prom.Histogram({ name: 'voice_agent_api_latency_ms', help: 'API latency in milliseconds', labelNames: ['service'], buckets: [50, 100, 200, 500, 1000, 2000] }), activeSessions: new prom.Gauge({ name: 'voice_agent_active_sessions', help: 'Number of currently active sessions' })};export function recordSessionMetric(durationSeconds: number, success: boolean): void { voiceAgentMetrics.totalSessions.inc({ status: success ? 'success' : 'failed' }); voiceAgentMetrics.sessionDurationSeconds.observe(durationSeconds);}
Six lessons that aren't in the official docs
The sections above are the design story. Below are six things I only learned by running this stack in production for Dolice Labs across 4 sites, and across 6 indie iOS/Android apps I have shipped since 2014 (about 50 million cumulative downloads, monetized largely through AdMob). I am Masaki Hirokawa, the indie developer and artist behind Dolice Labs.
1. Measure the latency budget in three layers (1500ms total)
End-to-end voice-to-voice latency above 1500ms breaks the conversational rhythm. Users start asking "did it cut out?" and re-speak before the agent can answer. That number is the threshold I keep walking back to across every voice product I have shipped.
My production budget split, measured on Cloudflare from Tokyo to us-east-1:
Layer
Budget
p50
p95
STT (Deepgram Streaming)
600ms
280ms
480ms
Claude Haiku response
400ms
320ms
620ms
TTS (ElevenLabs Flash v2)
400ms
240ms
410ms
Network round-trip
100ms
70ms
130ms
Total
1500ms
910ms
1640ms
If you call Whisper REST naively, inference only fires after the full audio clip lands, which adds 600 to 900ms after the last syllable. Deepgram Streaming uses VAD to predict the endpoint, which cuts perceived latency roughly in half. I started with Whisper REST for simplicity, watched p95 exceed 1800ms, and migrated to Deepgram Streaming three weeks later.
// Production budget checker — emit Sentry warnings on any over-budget sessioninterface LatencyBudget { stt: { budget: 600; actual?: number }; llm: { budget: 400; actual?: number }; tts: { budget: 400; actual?: number }; network: { budget: 100; actual?: number };}export function assertBudget(b: LatencyBudget, sessionId: string) { const total = (b.stt.actual ?? 0) + (b.llm.actual ?? 0) + (b.tts.actual ?? 0) + (b.network.actual ?? 0); if (total > 1500) { console.warn(`[budget-exceeded] session=${sessionId} total=${total}ms`, b); } return total <= 1500;}
2. How I cut per-session cost from $0.024 to $0.011
Running AdMob revenue at scale gave me a habit of pricing every feature in dollars-per-session before I write the first line of code. The first build (Whisper + Sonnet + ElevenLabs Multilingual) ran roughly $0.024 per 3-minute session. Over 9 weeks I brought it to $0.011 with four decisions.
Decision
Before
After
Reduction
STT: Whisper → Deepgram Nova-2
$0.006/min
$0.0043/min
-28%
LLM first-pass: Sonnet → Haiku
$3/1M tok
$0.25/1M tok
-91%
Sonnet only for "complex" queries
100% Sonnet
18% Sonnet
-82%
TTS: ElevenLabs Multilingual → Flash v2
$0.30/1k chars
$0.10/1k chars
-67%
The Sonnet routing rule deserves its own warning. My first attempt was to let Claude itself judge complexity, but the judge ran on Sonnet too, so the savings vanished. The rule that actually works in production is dumb on purpose: input over 80 characters, OR a technical term in the last 3 turns, OR the system prompt explicitly escalated. Simple rules beat clever judges.
3. Cloudflare Workers cannot be the inference layer
Dolice Labs runs all 4 sites on Cloudflare Workers + OpenNext, and my first instinct was "let me put the voice agent there too." It does not work, for three concrete reasons.
30-second CPU limit: every conversation turn burns 1–2s of CPU, and long sessions exceed the limit. Durable Objects share the same quota.
WebSocket constraints: Workers WebSockets cap individual frames at 32KB and disconnect at 16 hours. Bidirectional audio streaming requires Durable Objects + Hibernation API, which is much more design overhead than people assume.
Audio library bundling: ffmpeg WASM builds usually push you past the 10MB Worker bundle limit.
What I run today: Cloudflare Workers for signaling and session management on Durable Objects, and Google Cloud Run for the audio processing + Anthropic API path. The Cloudflare side still owns membership gating (premium_token cookie), and only paid users get a signed JWT for the Cloud Run WebSocket.
// Workers issues a short-lived JWT; Cloud Run verifies it on WebSocket upgradeimport { SignJWT } from 'jose';export async function issueVoiceSessionToken(env: Env, userId: string) { const secret = new TextEncoder().encode(env.VOICE_SESSION_SECRET); return await new SignJWT({ sub: userId, scope: 'voice-session' }) .setProtectedHeader({ alg: 'HS256' }) .setIssuedAt() .setExpirationTime('15m') .sign(secret);}
4. A three-tier fallback so I never get paged at 2am
Both of my grandfathers were temple carpenters in Japan. Their rule was "fix what you can fix before you go home, even in the rain." I apply the same rule to production: assume every dependency will fail, and have three layers ready.
Claude failure: Sonnet → Haiku → pre-recorded "Sorry, could you say that again" TTS (cost near zero)
TTS failure: ElevenLabs → OpenAI TTS → Browser Web Speech API (audio quality hit, still usable)
Whether to tell the user about the degradation is a product decision. For free assistants I stay silent; for membership users I display "running in simplified mode" so the trust signal stays intact. Honesty is the foundation of paid membership.
5. Hold conversation history as a graph, not a flat array
Voice users cannot consciously segment context the way chat users do. "Wait, not that one, the earlier one" happens constantly. My first version stored a flat array and dumped it into the Claude context window. Even when I stayed under 200k tokens, answer quality drifted because recent turns started leaking influence into older turns.
I now hold the conversation as three node types:
Question node: the user's intent plus the answer the agent gave
Topic node: an intermediate node that groups question nodes about the same subject
State node: an explicit state transition like "booking → awaiting confirmation → canceled"
Each Claude call receives the last 6 turns in full + the current topic-node summary + the state node only. Effective context fits in 4k–8k tokens and my eval set shows 30–40% accuracy improvement on multi-topic conversations.
6. Split UX and cost dashboards in Grafana
Prometheus + Grafana is the obvious choice, but the production lesson is: never put UX metrics and cost metrics on the same board. When p95 latency spikes, you need to ask "should I roll back the Haiku-first routing for cost?" without the cost number pulling your eye.
Cost board: avg cost per session, STT/LLM/TTS share, Sonnet ratio, free vs. premium unit-cost gap
A Grafana variable for tier=free|premium makes it fast to ask "does cost-cutting hurt the paying users?" — which, running a Stripe membership across 4 sites, is the question I check first every morning.
What I rely on when I translate the design into code
Numbers and code matter, but the question I lead with is "what am I trading my own time for?" My budget — p95 1500ms, $0.015 per session — exists so I can decide in 30 seconds whether a 2am alert needs me out of bed.
Voice agents demand more "humanness" than text chat. Being fast, cheap, and reliable at the same time is fundamentally hard, but layering budgets, holding three tiers of fallback, and splitting the dashboards keeps the on-call rotation survivable.
Thank you for reading this far. If you are building voice agents on a similar stack, I hope these numbers and decisions save you a few weekends.
Share
Thank You for Reading
Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.