⬡ API & SDK/2026-03-28Advanced

Production Voice Agents with Claude API: Lessons from Running 6 Indie Apps

Whisper/Deepgram, Claude API, and TTS engines orchestrated for a production voice agent — written by an indie developer running this stack on Cloudflare Workers and Cloud Run with real latency budgets, cost breakdowns, and fallback strategies.

Claude API⁹³ voice agents speech-to-text text-to-speech production architecture

✦ Premium Article

Setup and context: The Voice-First Revolution

We're witnessing a fundamental shift in how users interact with AI. Text-based interfaces are giving way to voice-native applications where conversations feel natural and intuitive. Companies building voice agents can deliver more engaging user experiences while capturing entirely new use cases.

Claude API excels at understanding nuanced natural language, but it's designed purely for text. Voice agents require orchestration: speech-to-text (STT) for input, Claude for reasoning, and text-to-speech (TTS) for output. The magic isn't in any single component—it's in how they work together seamlessly.

This guide walks you through building a production-grade voice agent system that handles real-world challenges: streaming audio, maintaining conversation context, recovering from failures, scaling to thousands of users, and optimizing costs. We'll use TypeScript/Node.js throughout, with immediately applicable code patterns.

Voice Agent Architecture Overview

A production voice agent system spans multiple integrated layers:

┌─────────────────────────────────────────────────────────────┐
│              Client Layer (Web / Mobile)                    │
└─────────────────────┬───────────────────────────────────────┘
                      │ WebSocket / HTTP
┌─────────────────────▼───────────────────────────────────────┐
│           API Gateway / Auth / Rate Limiting                │
└─────────────────────┬───────────────────────────────────────┘
                      │
    ┌─────────────────┼─────────────────┐
    │                 │                 │
    ▼                 ▼                 ▼
┌─────────┐   ┌─────────────┐   ┌─────────────┐
│ STT     │   │ Response    │   │ TTS Engine  │
│ (Whisper│   │ Generation  │   │ (Multi-     │
│/Deepgram)  │ (Claude API) │   │  Provider)  │
└─────────┘   └─────────────┘   └─────────────┘
                      │
    ┌─────────────────┼─────────────────┐
    │                 │                 │
    ▼                 ▼                 ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Cache (Redis) │ │ Conversation │ │Analytics &   │
│              │ │ Storage      │ │Logging       │
│              │ │(PostgreSQL)  │ │              │
└──────────────┘ └──────────────┘ └──────────────┘

Core responsibilities of each layer:

STT (Speech-to-Text): Whisper for high accuracy and offline capability; Deepgram for low-latency streaming
Response Generation: Claude API with streaming support and conversation history
TTS (Text-to-Speech): Multi-provider support (Google Cloud, ElevenLabs, Amazon Polly) with failover
State Management: Session persistence, conversation history, user context
Infrastructure: Caching, rate limiting, monitoring, logging

Let's build each piece methodically, starting with speech recognition.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How I split a 1500ms voice-to-voice latency budget into STT 600ms / Claude Haiku 400ms / TTS 400ms, and compressed real-world p50 to 910ms with Deepgram Streaming

✦Reduced per-session cost from $0.024 to $0.011 across 4 specific decisions, with the Sonnet routing rule that finally worked after Sonnet-judges-itself failed

✦Why Cloudflare Workers cannot host the inference layer, and the Durable Objects + Cloud Run split I run in production with signed JWT session tokens

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Speech-to-Text Pipeline: Whisper and Deepgram Integration

Whisper API for High-Fidelity Transcription

Whisper is the gold standard for accuracy. We'll integrate it via OpenAI's API for production reliability:

// src/services/stt/whisper.ts
import FormData from 'form-data';
import fs from 'fs';
 
interface WhisperResponse {
  text: string;
  language: string;
  duration: number;
}
 
export class WhisperSTTService {
  private apiKey: string;
  private apiUrl = 'https://api.openai.com/v1/audio/transcriptions';
 
  constructor(apiKey: string) {
    this.apiKey = apiKey;
  }
 
  async transcribe(audioPath: string, language?: string): Promise<WhisperResponse> {
    const form = new FormData();
    form.append('file', fs.createReadStream(audioPath));
    form.append('model', 'whisper-1');
 
    if (language) {
      form.append('language', language);
    }
 
    try {
      const response = await fetch(this.apiUrl, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${this.apiKey}`,
          ...form.getHeaders()
        },
        body: form
      });
 
      const data = await response.json();
 
      if (!response.ok) {
        throw new Error(`Whisper API error: ${data.error?.message}`);
      }
 
      return {
        text: data.text,
        language: data.language || 'unknown',
        duration: 0
      };
    } catch (error) {
      console.error('Whisper transcription failed:', error);
      throw new Error(`STT processing failed: ${error.message}`);
    }
  }
}

Deepgram Streaming for Real-Time Low Latency

When you need sub-second response times (think live customer support), Deepgram's streaming API is your answer:

// src/services/stt/deepgram-streaming.ts
import { createClient } from '@deepgram/sdk';
import { PassThrough } from 'stream';
 
interface StreamingSTTConfig {
  apiKey: string;
  language?: string;
  model?: string;
}
 
export class DeepgramStreamingSTT {
  private client: ReturnType<typeof createClient>;
  private language: string;
  private model: string;
 
  constructor(config: StreamingSTTConfig) {
    this.client = createClient(config.apiKey);
    this.language = config.language || 'en';
    this.model = config.model || 'nova-2';
  }
 
  async transcribeStream(audioStream: PassThrough): Promise<string> {
    return new Promise((resolve, reject) => {
      let finalTranscript = '';
 
      const connection = this.client.listen.live({
        model: this.model,
        language: this.language,
        punctuate: true,
        interim_results: true
      });
 
      connection
        .on('open', () => {
          console.log('Deepgram connection opened');
          audioStream.pipe(connection);
        })
        .on('results', (data) => {
          const transcript = data.result?.results?.[0]?.alternatives?.[0]?.transcript;
          if (transcript) {
            finalTranscript = transcript;
            // Emit interim_results to UI if needed
          }
        })
        .on('close', () => {
          resolve(finalTranscript);
        })
        .on('error', (error) => {
          reject(new Error(`Deepgram error: ${error.message}`));
        });
 
      audioStream.on('end', () => {
        connection.finish();
      });
    });
  }
}

Choosing your STT provider:

Whisper API: Best for accuracy, supports 99+ languages, good for offline-first scenarios
Deepgram Nova-2: Best latency (50-300ms), real-time streaming, conversational focus
Local Whisper: Most cost-effective at scale, but requires GPU infrastructure

Intelligent Response Generation with Claude API

Conversation Context Management

Claude's strength lies in understanding nuance. To leverage this in voice agents, maintain rich conversation context:

// src/services/claude-agent.ts
import Anthropic from '@anthropic-ai/sdk';
 
interface ConversationMessage {
  role: 'user' | 'assistant';
  content: string;
}
 
interface VoiceAgentConfig {
  apiKey: string;
  systemPrompt: string;
  maxContextMessages?: number;
  temperature?: number;
}
 
export class VoiceAgent {
  private client: Anthropic;
  private systemPrompt: string;
  private maxContextMessages: number;
  private temperature: number;
 
  constructor(config: VoiceAgentConfig) {
    this.client = new Anthropic({ apiKey: config.apiKey });
    this.systemPrompt = config.systemPrompt;
    this.maxContextMessages = config.maxContextMessages || 10;
    this.temperature = config.temperature || 0.7;
  }
 
  async generateResponse(
    userInput: string,
    conversationHistory: ConversationMessage[]
  ): Promise<string> {
    // Keep only recent messages to optimize token usage
    const recentHistory = conversationHistory.slice(-this.maxContextMessages);
 
    const messages = [
      ...recentHistory,
      { role: 'user' as const, content: userInput }
    ];
 
    try {
      const response = await this.client.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 1024,
        temperature: this.temperature,
        system: this.systemPrompt,
        messages
      });
 
      const assistantMessage = response.content[0];
      if (assistantMessage.type !== 'text') {
        throw new Error('Unexpected response type from Claude');
      }
 
      return assistantMessage.text;
    } catch (error) {
      console.error('Claude API call failed:', error);
      throw new Error(`Response generation failed: ${error.message}`);
    }
  }
 
  async generateResponseStream(
    userInput: string,
    conversationHistory: ConversationMessage[],
    onChunk: (chunk: string) => void
  ): Promise<string> {
    const recentHistory = conversationHistory.slice(-this.maxContextMessages);
 
    const messages = [
      ...recentHistory,
      { role: 'user' as const, content: userInput }
    ];
 
    let fullResponse = '';
 
    try {
      const stream = await this.client.messages.stream({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 1024,
        temperature: this.temperature,
        system: this.systemPrompt,
        messages
      });
 
      for await (const chunk of stream) {
        if (chunk.type === 'content_block_delta' && chunk.delta?.type === 'text_delta') {
          const text = chunk.delta.text;
          fullResponse += text;
          onChunk(text); // Stream to client
        }
      }
 
      return fullResponse;
    } catch (error) {
      console.error('Claude streaming failed:', error);
      throw new Error(`Streaming response failed: ${error.message}`);
    }
  }
}

Optimized System Prompts for Voice

Voice agents need brief, action-oriented system prompts. Users can't easily re-read long responses:

const VOICE_AGENT_SYSTEM_PROMPT = `You are a helpful, conversational voice assistant.
 
Guidelines:
- Respond naturally, as if speaking to someone. Keep sentences short (under 20 words when possible).
- Use conversational language. Avoid jargon unless the user introduced it first.
- If unsure, admit it. Don't speculate.
- Break complex information into bullet points (max 3 items per response).
- No emojis. No markdown formatting. Speak like a real person.
- Be warm and encouraging while remaining professional.`;

Text-to-Speech Implementation and Optimization

Multi-Provider TTS Adapter Pattern

Production systems need failover. Implement a provider-agnostic interface:

// src/services/tts/tts-provider.ts
export interface TTSResponse {
  audioUrl: string;
  audioBuffer?: Buffer;
  duration: number;
  format: 'mp3' | 'wav' | 'ogg';
}
 
export interface TTSProvider {
  synthesize(text: string, voice?: string): Promise<TTSResponse>;
  getAvailableVoices(): Promise<string[]>;
  estimateDuration(text: string): number;
}
 
// Google Cloud Text-to-Speech
import textToSpeech from '@google-cloud/text-to-speech';
 
export class GoogleCloudTTS implements TTSProvider {
  private client: textToSpeech.TextToSpeechClient;
 
  constructor() {
    this.client = new textToSpeech.TextToSpeechClient();
  }
 
  async synthesize(text: string, voice = 'en-US-Neural2-C'): Promise<TTSResponse> {
    const request = {
      input: { text },
      voice: {
        languageCode: 'en-US',
        name: voice
      },
      audioConfig: {
        audioEncoding: 'MP3' as const,
        pitch: 0,
        speakingRate: 1.0
      }
    };
 
    try {
      const [response] = await this.client.synthesizeSpeech(request);
      const audioBuffer = response.audioContent as Buffer;
 
      // Estimate duration: ~150 WPM average
      const words = text.split(/\s+/).length;
      const duration = (words / 150) * 60;
 
      return {
        audioBuffer,
        audioUrl: '',
        duration,
        format: 'mp3'
      };
    } catch (error) {
      throw new Error(`Google Cloud TTS failed: ${error.message}`);
    }
  }
 
  async getAvailableVoices(): Promise<string[]> {
    const [result] = await this.client.listVoices({});
    return result.voices
      ?.filter(v => v.languageCodes?.includes('en-US'))
      .map(v => v.name!) || [];
  }
 
  estimateDuration(text: string): number {
    const words = text.split(/\s+/).length;
    return (words / 150) * 60;
  }
}
 
// ElevenLabs for natural, expressive voices
import ElevenLabs from 'elevenlabs-node';
 
export class ElevenLabsTTS implements TTSProvider {
  private client: ElevenLabs;
  private voiceId: string;
 
  constructor(apiKey: string, voiceId: string) {
    this.client = new ElevenLabs({ apiKey });
    this.voiceId = voiceId;
  }
 
  async synthesize(text: string): Promise<TTSResponse> {
    try {
      const audioBuffer = await this.client.textToSpeech.convert({
        voice_id: this.voiceId,
        text,
        model_id: 'eleven_multilingual_v2',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75
        }
      });
 
      const words = text.split(/\s+/).length;
      const duration = (words / 150) * 60;
 
      return {
        audioBuffer: Buffer.from(audioBuffer),
        audioUrl: '',
        duration,
        format: 'mp3'
      };
    } catch (error) {
      throw new Error(`ElevenLabs TTS failed: ${error.message}`);
    }
  }
 
  async getAvailableVoices(): Promise<string[]> {
    const voices = await this.client.voices.getAll();
    return voices.map(v => v.voice_id);
  }
 
  estimateDuration(text: string): number {
    const words = text.split(/\s+/).length;
    return (words / 150) * 60;
  }
}

Intelligent TTS Caching

Synthesizing speech is expensive and slow. Cache aggressively:

// src/services/tts/tts-cache.ts
import Redis from 'ioredis';
import crypto from 'crypto';
 
export class TTSCache {
  private redis: Redis;
  private ttlSeconds = 86400 * 30; // 30 days
 
  constructor(redisUrl: string) {
    this.redis = new Redis(redisUrl);
  }
 
  private generateKey(text: string, voice: string, provider: string): string {
    const hash = crypto
      .createHash('sha256')
      .update(`${text}:${voice}:${provider}`)
      .digest('hex');
    return `tts:${hash}`;
  }
 
  async getOrSynthesize(
    text: string,
    voice: string,
    provider: TTSProvider,
    providerName: string
  ): Promise<TTSResponse> {
    const key = this.generateKey(text, voice, providerName);
 
    // Check cache first
    const cached = await this.redis.getBuffer(key);
    if (cached) {
      console.log(`TTS cache hit: ${key}`);
      return {
        audioBuffer: cached,
        audioUrl: '',
        duration: provider.estimateDuration(text),
        format: 'mp3'
      };
    }
 
    // Synthesize and cache
    console.log(`TTS cache miss: ${key}`);
    const result = await provider.synthesize(text, voice);
 
    if (result.audioBuffer) {
      await this.redis.setex(
        key,
        this.ttlSeconds,
        result.audioBuffer
      );
    }
 
    return result;
  }
}

Real-Time Streaming Architecture

WebSocket-Based Bidirectional Streaming

The magic happens when audio flows in real-time. Here's how to orchestrate it:

// src/websocket/voice-agent-handler.ts
import WebSocket from 'ws';
import { VoiceAgent } from '../services/claude-agent';
import { DeepgramStreamingSTT } from '../services/stt/deepgram-streaming';
import { TTSCache } from '../services/tts/tts-cache';
 
interface StreamingSessionConfig {
  sessionId: string;
  userId: string;
  voiceAgent: VoiceAgent;
  sttService: DeepgramStreamingSTT;
  ttsCache: TTSCache;
}
 
export class VoiceAgentStreamHandler {
  private sessionConfig: StreamingSessionConfig;
  private conversationHistory: Array<{ role: string; content: string }> = [];
 
  constructor(config: StreamingSessionConfig) {
    this.sessionConfig = config;
  }
 
  async handleWebSocketConnection(ws: WebSocket): Promise<void> {
    console.log(`Session ${this.sessionConfig.sessionId} connected`);
 
    ws.on('message', async (data: Buffer) => {
      try {
        // Receive audio chunk from client
        const audioChunk = data;
 
        // Transcribe with Deepgram
        const transcript = await this.transcribeAudioChunk(audioChunk);
        if (!transcript) return;
 
        // Generate response with Claude
        const response = await this.sessionConfig.voiceAgent.generateResponse(
          transcript,
          this.conversationHistory
        );
 
        // Update history
        this.conversationHistory.push(
          { role: 'user', content: transcript },
          { role: 'assistant', content: response }
        );
 
        // Synthesize speech
        const audioBuffer = await this.generateSpeech(response);
 
        // Send back to client
        ws.send(JSON.stringify({
          type: 'response',
          transcript,
          response,
          audioBuffer: audioBuffer.toString('base64')
        }));
      } catch (error) {
        console.error('Stream processing error:', error);
        ws.send(JSON.stringify({
          type: 'error',
          message: error.message
        }));
      }
    });
 
    ws.on('close', () => {
      console.log(`Session ${this.sessionConfig.sessionId} closed`);
      this.cleanupSession();
    });
 
    ws.on('error', (error) => {
      console.error(`WebSocket error: ${error.message}`);
    });
  }
 
  private async transcribeAudioChunk(audioData: Buffer): Promise<string> {
    // Integration with Deepgram streaming service
    return '';
  }
 
  private async generateSpeech(text: string): Promise<Buffer> {
    // Integration with TTS cache service
    return Buffer.alloc(0);
  }
 
  private cleanupSession(): void {
    this.conversationHistory = [];
  }
}

Conversation Context and Memory Management

PostgreSQL Session Persistence

Store conversations for later retrieval and user continuity:

// src/db/session-repository.ts
import { Pool } from 'pg';
 
interface VoiceSession {
  sessionId: string;
  userId: string;
  startTime: Date;
  endTime?: Date;
  conversationHistory: Array<{ role: string; content: string }>;
  metadata: Record<string, any>;
}
 
export class VoiceSessionRepository {
  private pool: Pool;
 
  constructor(connectionString: string) {
    this.pool = new Pool({ connectionString });
  }
 
  async createSession(session: VoiceSession): Promise<void> {
    const query = `
      INSERT INTO voice_sessions (session_id, user_id, start_time, conversation_history, metadata)
      VALUES ($1, $2, $3, $4, $5)
    `;
 
    await this.pool.query(query, [
      session.sessionId,
      session.userId,
      session.startTime,
      JSON.stringify(session.conversationHistory),
      JSON.stringify(session.metadata)
    ]);
  }
 
  async updateConversation(
    sessionId: string,
    history: Array<{ role: string; content: string }>
  ): Promise<void> {
    const query = `
      UPDATE voice_sessions
      SET conversation_history = $1, updated_at = NOW()
      WHERE session_id = $2
    `;
 
    await this.pool.query(query, [JSON.stringify(history), sessionId]);
  }
 
  async getSession(sessionId: string): Promise<VoiceSession | null> {
    const query = `
      SELECT session_id, user_id, start_time, conversation_history, metadata
      FROM voice_sessions
      WHERE session_id = $1
    `;
 
    const result = await this.pool.query(query, [sessionId]);
    if (result.rows.length === 0) return null;
 
    const row = result.rows[0];
    return {
      sessionId: row.session_id,
      userId: row.user_id,
      startTime: row.start_time,
      conversationHistory: JSON.parse(row.conversation_history),
      metadata: JSON.parse(row.metadata)
    };
  }
 
  async endSession(sessionId: string): Promise<void> {
    const query = `
      UPDATE voice_sessions
      SET end_time = NOW()
      WHERE session_id = $1
    `;
 
    await this.pool.query(query, [sessionId]);
  }
}

User Profile Personalization

Tailor responses based on user history and preferences:

// src/services/user-context.ts
export interface UserProfile {
  userId: string;
  name: string;
  preferences: {
    language: string;
    voiceGender: 'male' | 'female';
    formality: 'casual' | 'formal';
  };
  interactionHistory: {
    totalSessions: number;
    averageSessionLength: number;
    favoriteTopics: string[];
  };
}
 
export class UserContextManager {
  async loadUserProfile(userId: string): Promise<UserProfile> {
    // Load from database
    return {
      userId,
      name: 'User',
      preferences: {
        language: 'en',
        voiceGender: 'female',
        formality: 'formal'
      },
      interactionHistory: {
        totalSessions: 0,
        averageSessionLength: 0,
        favoriteTopics: []
      }
    };
  }
 
  buildSystemPromptWithContext(basePrompt: string, profile: UserProfile): string {
    return `${basePrompt}
 
User Context:
- Name: ${profile.name}
- Preferred formality: ${profile.preferences.formality}
- Previous sessions: ${profile.interactionHistory.totalSessions}
- Favorite topics: ${profile.interactionHistory.favoriteTopics.join(', ')}`;
  }
}

Error Handling and Fallback Strategies

Comprehensive Error Handling

// src/error-handling/voice-agent-errors.ts
export class VoiceAgentError extends Error {
  constructor(
    public code: string,
    message: string,
    public retryable: boolean = false,
    public fallback?: string
  ) {
    super(message);
    this.name = 'VoiceAgentError';
  }
}
 
export class STTError extends VoiceAgentError {
  constructor(message: string, retryable = true) {
    super('STT_ERROR', message, retryable, 'I had trouble hearing you. Could you try again?');
  }
}
 
export class ClaudeAPIError extends VoiceAgentError {
  constructor(message: string, retryable = true) {
    super('CLAUDE_ERROR', message, retryable, 'I need a moment. Can you repeat that?');
  }
}
 
export class TTSError extends VoiceAgentError {
  constructor(message: string, retryable = true) {
    super('TTS_ERROR', message, retryable, 'I cannot generate audio right now.');
  }
}
 
// Error handling middleware
export class VoiceAgentErrorHandler {
  async handle(error: any, sessionId: string): Promise<{ message: string; fallback: string }> {
    console.error(`[${sessionId}] Error:`, error);
 
    if (error instanceof VoiceAgentError) {
      return {
        message: error.message,
        fallback: error.fallback || 'Something went wrong. Please try again.'
      };
    }
 
    return {
      message: 'Unexpected error',
      fallback: 'I encountered an issue. Let\'s start over.'
    };
  }
}

Retry Logic with Exponential Backoff

// src/resilience/retry-strategy.ts
export interface RetryConfig {
  maxAttempts: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
}
 
const DEFAULT_RETRY_CONFIG: RetryConfig = {
  maxAttempts: 3,
  initialDelayMs: 100,
  maxDelayMs: 5000,
  backoffMultiplier: 2
};
 
export async function withRetry<T>(
  fn: () => Promise<T>,
  config: RetryConfig = DEFAULT_RETRY_CONFIG
): Promise<T> {
  let lastError: Error | null = null;
  let delay = config.initialDelayMs;
 
  for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.warn(`Attempt ${attempt} failed: ${lastError.message}`);
 
      if (attempt < config.maxAttempts) {
        await new Promise(resolve => setTimeout(resolve, delay));
        delay = Math.min(delay * config.backoffMultiplier, config.maxDelayMs);
      }
    }
  }
 
  throw lastError;
}
 
// Multi-provider TTS fallback
export class TTSFallbackManager {
  private providers: Array<{ name: string; provider: TTSProvider }>;
  private primaryIndex = 0;
 
  constructor(providers: Array<{ name: string; provider: TTSProvider }>) {
    this.providers = providers;
  }
 
  async synthesizeWithFallback(text: string): Promise<TTSResponse> {
    for (let i = 0; i < this.providers.length; i++) {
      const index = (this.primaryIndex + i) % this.providers.length;
      const { name, provider } = this.providers[index];
 
      try {
        console.log(`Attempting TTS with ${name}...`);
        const result = await provider.synthesize(text);
        this.primaryIndex = index; // Promote successful provider
        return result;
      } catch (error) {
        console.warn(`${name} failed: ${error.message}`);
        if (i === this.providers.length - 1) {
          throw new TTSError('All TTS providers unavailable');
        }
      }
    }
 
    throw new TTSError('No TTS providers available');
  }
}

Production Deployment and Scaling

Kubernetes Deployment Configuration

# k8s/voice-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: voice-agent
  template:
    metadata:
      labels:
        app: voice-agent
    spec:
      containers:
      - name: voice-agent
        image: voice-agent:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        env:
        - name: CLAUDE_API_KEY
          valueFrom:
            secretKeyRef:
              name: voice-agent-secrets
              key: claude-api-key
        - name: DEEPGRAM_API_KEY
          valueFrom:
            secretKeyRef:
              name: voice-agent-secrets
              key: deepgram-api-key
        - name: REDIS_URL
          valueFrom:
            configMapKeyRef:
              name: voice-agent-config
              key: redis-url
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
 
---
apiVersion: v1
kind: Service
metadata:
  name: voice-agent-service
spec:
  selector:
    app: voice-agent
  ports:
  - protocol: TCP
    port: 80
    targetPort: 3000
  type: LoadBalancer

Session Affinity for Stateful Connections

// src/infra/session-affinity.ts
import Redis from 'ioredis';
 
export class SessionAffinityManager {
  private redis: Redis;
 
  constructor(redisUrl: string) {
    this.redis = new Redis(redisUrl);
  }
 
  async assignPodToSession(sessionId: string, podName: string): Promise<void> {
    await this.redis.setex(`session:affinity:${sessionId}`, 86400, podName);
  }
 
  async getPodForSession(sessionId: string): Promise<string | null> {
    return this.redis.get(`session:affinity:${sessionId}`);
  }
 
  async releaseSessionAffinity(sessionId: string): Promise<void> {
    await this.redis.del(`session:affinity:${sessionId}`);
  }
}

Cost Optimization and Monitoring

API Cost Tracking

Claude 3.5 Sonnet costs $3 per million input tokens and $15 per million output tokens. Track every call:

// src/monitoring/cost-tracker.ts
import { CloudWatch } from 'aws-sdk';
 
export interface APICost {
  service: string; // 'claude', 'whisper', 'tts-google', etc
  inputUnits: number; // tokens, minutes, characters
  outputUnits: number;
  costUSD: number;
  timestamp: Date;
}
 
export class CostTracker {
  private cloudwatch: CloudWatch;
  private costs: APICost[] = [];
 
  constructor() {
    this.cloudwatch = new CloudWatch();
  }
 
  recordClaudeCall(inputTokens: number, outputTokens: number): void {
    const inputCost = (inputTokens / 1_000_000) * 3;
    const outputCost = (outputTokens / 1_000_000) * 15;
 
    this.costs.push({
      service: 'claude',
      inputUnits: inputTokens,
      outputUnits: outputTokens,
      costUSD: inputCost + outputCost,
      timestamp: new Date()
    });
  }
 
  recordWhisperCall(duration: number): void {
    // Whisper: $0.02/minute
    const cost = (duration / 60) * 0.02;
 
    this.costs.push({
      service: 'whisper',
      inputUnits: Math.ceil(duration),
      outputUnits: 0,
      costUSD: cost,
      timestamp: new Date()
    });
  }
 
  recordTTSCall(characters: number, provider: string): void {
    let cost = 0;
    if (provider === 'google') {
      // Google Cloud TTS: $16 per million characters
      cost = (characters / 1_000_000) * 16;
    } else if (provider === 'elevenlabs') {
      // ElevenLabs: $0.30 per 10,000 characters
      cost = (characters / 10_000) * 0.30;
    }
 
    this.costs.push({
      service: `tts-${provider}`,
      inputUnits: characters,
      outputUnits: 0,
      costUSD: cost,
      timestamp: new Date()
    });
  }
 
  async publishMetrics(sessionId: string): Promise<void> {
    const totalCost = this.costs.reduce((sum, c) => sum + c.costUSD, 0);
 
    await this.cloudwatch.putMetricData({
      Namespace: 'VoiceAgent',
      MetricData: [
        {
          MetricName: 'SessionCost',
          Value: totalCost,
          Unit: 'None',
          Dimensions: [{ Name: 'SessionId', Value: sessionId }]
        },
        {
          MetricName: 'APICallCount',
          Value: this.costs.length,
          Unit: 'Count'
        }
      ]
    }).promise();
 
    console.log(`Session ${sessionId} cost: $${totalCost.toFixed(4)}`);
  }
}

Metrics Collection with Prometheus

// src/monitoring/metrics.ts
import prom from 'prom-client';
 
export const voiceAgentMetrics = {
  totalSessions: new prom.Counter({
    name: 'voice_agent_total_sessions',
    help: 'Total number of sessions',
    labelNames: ['status'] // success, failed, timeout
  }),
 
  apiCallsTotal: new prom.Counter({
    name: 'voice_agent_api_calls_total',
    help: 'Total API calls by service',
    labelNames: ['service'] // claude, whisper, tts
  }),
 
  sessionDurationSeconds: new prom.Histogram({
    name: 'voice_agent_session_duration_seconds',
    help: 'Session duration in seconds',
    buckets: [10, 30, 60, 300, 600]
  }),
 
  apiLatencyMs: new prom.Histogram({
    name: 'voice_agent_api_latency_ms',
    help: 'API latency in milliseconds',
    labelNames: ['service'],
    buckets: [50, 100, 200, 500, 1000, 2000]
  }),
 
  activeSessions: new prom.Gauge({
    name: 'voice_agent_active_sessions',
    help: 'Number of currently active sessions'
  })
};
 
export function recordSessionMetric(durationSeconds: number, success: boolean): void {
  voiceAgentMetrics.totalSessions.inc({ status: success ? 'success' : 'failed' });
  voiceAgentMetrics.sessionDurationSeconds.observe(durationSeconds);
}

Six lessons that aren't in the official docs

The sections above are the design story. Below are six things I only learned by running this stack in production for Dolice Labs across 4 sites, and across 6 indie iOS/Android apps I have shipped since 2014 (about 50 million cumulative downloads, monetized largely through AdMob). I am Masaki Hirokawa, the indie developer and artist behind Dolice Labs.

1. Measure the latency budget in three layers (1500ms total)

End-to-end voice-to-voice latency above 1500ms breaks the conversational rhythm. Users start asking "did it cut out?" and re-speak before the agent can answer. That number is the threshold I keep walking back to across every voice product I have shipped.

My production budget split, measured on Cloudflare from Tokyo to us-east-1:

Layer	Budget	p50	p95
STT (Deepgram Streaming)	600ms	280ms	480ms
Claude Haiku response	400ms	320ms	620ms
TTS (ElevenLabs Flash v2)	400ms	240ms	410ms
Network round-trip	100ms	70ms	130ms
Total	1500ms	910ms	1640ms

If you call Whisper REST naively, inference only fires after the full audio clip lands, which adds 600 to 900ms after the last syllable. Deepgram Streaming uses VAD to predict the endpoint, which cuts perceived latency roughly in half. I started with Whisper REST for simplicity, watched p95 exceed 1800ms, and migrated to Deepgram Streaming three weeks later.

// Production budget checker — emit Sentry warnings on any over-budget session
interface LatencyBudget {
  stt: { budget: 600; actual?: number };
  llm: { budget: 400; actual?: number };
  tts: { budget: 400; actual?: number };
  network: { budget: 100; actual?: number };
}
 
export function assertBudget(b: LatencyBudget, sessionId: string) {
  const total = (b.stt.actual ?? 0) + (b.llm.actual ?? 0) +
                (b.tts.actual ?? 0) + (b.network.actual ?? 0);
  if (total > 1500) {
    console.warn(`[budget-exceeded] session=${sessionId} total=${total}ms`, b);
  }
  return total <= 1500;
}

2. How I cut per-session cost from $0.024 to $0.011

Running AdMob revenue at scale gave me a habit of pricing every feature in dollars-per-session before I write the first line of code. The first build (Whisper + Sonnet + ElevenLabs Multilingual) ran roughly $0.024 per 3-minute session. Over 9 weeks I brought it to $0.011 with four decisions.

Decision	Before	After	Reduction
STT: Whisper → Deepgram Nova-2	$0.006/min	$0.0043/min	-28%
LLM first-pass: Sonnet → Haiku	$3/1M tok	$0.25/1M tok	-91%
Sonnet only for "complex" queries	100% Sonnet	18% Sonnet	-82%
TTS: ElevenLabs Multilingual → Flash v2	$0.30/1k chars	$0.10/1k chars	-67%

The Sonnet routing rule deserves its own warning. My first attempt was to let Claude itself judge complexity, but the judge ran on Sonnet too, so the savings vanished. The rule that actually works in production is dumb on purpose: input over 80 characters, OR a technical term in the last 3 turns, OR the system prompt explicitly escalated. Simple rules beat clever judges.

3. Cloudflare Workers cannot be the inference layer

Dolice Labs runs all 4 sites on Cloudflare Workers + OpenNext, and my first instinct was "let me put the voice agent there too." It does not work, for three concrete reasons.

30-second CPU limit: every conversation turn burns 1–2s of CPU, and long sessions exceed the limit. Durable Objects share the same quota.
WebSocket constraints: Workers WebSockets cap individual frames at 32KB and disconnect at 16 hours. Bidirectional audio streaming requires Durable Objects + Hibernation API, which is much more design overhead than people assume.
Audio library bundling: ffmpeg WASM builds usually push you past the 10MB Worker bundle limit.

What I run today: Cloudflare Workers for signaling and session management on Durable Objects, and Google Cloud Run for the audio processing + Anthropic API path. The Cloudflare side still owns membership gating (premium_token cookie), and only paid users get a signed JWT for the Cloud Run WebSocket.

// Workers issues a short-lived JWT; Cloud Run verifies it on WebSocket upgrade
import { SignJWT } from 'jose';
 
export async function issueVoiceSessionToken(env: Env, userId: string) {
  const secret = new TextEncoder().encode(env.VOICE_SESSION_SECRET);
  return await new SignJWT({ sub: userId, scope: 'voice-session' })
    .setProtectedHeader({ alg: 'HS256' })
    .setIssuedAt()
    .setExpirationTime('15m')
    .sign(secret);
}

4. A three-tier fallback so I never get paged at 2am

Both of my grandfathers were temple carpenters in Japan. Their rule was "fix what you can fix before you go home, even in the rain." I apply the same rule to production: assume every dependency will fail, and have three layers ready.

STT failure: Deepgram → Whisper REST (adds 600ms, acceptable UX hit)
Claude failure: Sonnet → Haiku → pre-recorded "Sorry, could you say that again" TTS (cost near zero)
TTS failure: ElevenLabs → OpenAI TTS → Browser Web Speech API (audio quality hit, still usable)

Whether to tell the user about the degradation is a product decision. For free assistants I stay silent; for membership users I display "running in simplified mode" so the trust signal stays intact. Honesty is the foundation of paid membership.

5. Hold conversation history as a graph, not a flat array

Voice users cannot consciously segment context the way chat users do. "Wait, not that one, the earlier one" happens constantly. My first version stored a flat array and dumped it into the Claude context window. Even when I stayed under 200k tokens, answer quality drifted because recent turns started leaking influence into older turns.

I now hold the conversation as three node types:

Question node: the user's intent plus the answer the agent gave
Topic node: an intermediate node that groups question nodes about the same subject
State node: an explicit state transition like "booking → awaiting confirmation → canceled"

Each Claude call receives the last 6 turns in full + the current topic-node summary + the state node only. Effective context fits in 4k–8k tokens and my eval set shows 30–40% accuracy improvement on multi-topic conversations.

6. Split UX and cost dashboards in Grafana

Prometheus + Grafana is the obvious choice, but the production lesson is: never put UX metrics and cost metrics on the same board. When p95 latency spikes, you need to ask "should I roll back the Haiku-first routing for cost?" without the cost number pulling your eye.

My two boards:

UX board: p50/p95 latency, recognition failure rate, mid-session abandonment, user re-ask rate
Cost board: avg cost per session, STT/LLM/TTS share, Sonnet ratio, free vs. premium unit-cost gap

A Grafana variable for tier=free|premium makes it fast to ask "does cost-cutting hurt the paying users?" — which, running a Stripe membership across 4 sites, is the question I check first every morning.

What I rely on when I translate the design into code

Numbers and code matter, but the question I lead with is "what am I trading my own time for?" My budget — p95 1500ms, $0.015 per session — exists so I can decide in 30 seconds whether a 2am alert needs me out of bed.

Voice agents demand more "humanness" than text chat. Being fast, cheap, and reliable at the same time is fundamentally hard, but layering budgets, holding three tiers of fallback, and splitting the dashboards keeps the on-call rotation survivable.

Thank you for reading this far. If you are building voice agents on a similar stack, I hope these numbers and decisions save you a few weekends.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.