Building an AI Chatbot with Claude API — Streaming, Conversation History & Cost Optimization

Why Build Your Own Claude-Powered Chatbot?

Off-the-shelf chatbot tools are convenient, but they impose serious limitations when you need full customization for your specific use case. By using the Claude API directly, you gain complete control over system prompts, direct integration with your own data, fine-tuned cost optimization, and the freedom to embed AI anywhere in your stack.

ℹ️

All code in this article requires **Python 3.10+**. For API key setup, see [Claude API Quickstart](/articles/api-sdk/api-quickstart).

STEP 1: The Minimal Chatbot

Start with the simplest possible implementation.

import anthropic
 
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
def chat(user_message: str) -> str:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": user_message}
        ]
    )
    return message.content[0].text
 
# Run it
response = chat("Write a Python function to check if a number is prime")
print(response)

This works, but there's a critical flaw: conversation context isn't preserved. Each message starts fresh with no memory of previous turns.

STEP 2: Managing Conversation History

The Claude API is stateless, so you must manage conversation history on the client side.

import anthropic
from typing import List
 
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
class ChatSession:
    def __init__(self, system_prompt: str = ""):
        self.history: List[dict] = []
        self.system_prompt = system_prompt
 
    def send(self, user_message: str) -> str:
        # Add user message to history
        self.history.append({
            "role": "user",
            "content": user_message
        })
 
        # Call API with full history
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=self.system_prompt,
            messages=self.history
        )
 
        assistant_message = response.content[0].text
 
        # Add assistant response to history
        self.history.append({
            "role": "assistant",
            "content": assistant_message
        })
 
        return assistant_message
 
    def clear(self):
        """Reset conversation"""
        self.history = []
 
# Usage example
session = ChatSession(
    system_prompt="You are a Python code reviewer. "
                  "Identify issues and suggest improvements with clear explanations."
)
 
print(session.send("Review this code:\ndef add(a, b): return a+b"))
print(session.send("Now add type hints to the improved version"))  # Remembers context

ℹ️

Pass `system` as a separate parameter rather than including it in history. This ensures consistent behavior regardless of conversation length.

STEP 3: Implementing Streaming Responses

Long responses create a poor UX when users wait for the full reply. Stream tokens as they're generated.

import anthropic
 
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
class StreamingChatSession:
    def __init__(self, system_prompt: str = ""):
        self.history = []
        self.system_prompt = system_prompt
 
    def send_stream(self, user_message: str):
        """Yield text chunks as they arrive"""
        self.history.append({"role": "user", "content": user_message})
 
        full_response = ""
 
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=self.system_prompt,
            messages=self.history
        ) as stream:
            for text in stream.text_stream:
                full_response += text
                yield text  # Return text incrementally
 
        # Save complete response to history
        self.history.append({"role": "assistant", "content": full_response})
 
# CLI example
session = StreamingChatSession(system_prompt="You are a helpful assistant.")
 
while True:
    user_input = input("\nYou: ").strip()
    if user_input.lower() in ["quit", "exit"]:
        break
 
    print("Claude: ", end="", flush=True)
    for chunk in session.send_stream(user_input):
        print(chunk, end="", flush=True)
    print()  # newline

STEP 4: Exposing as a Web API with FastAPI

Go beyond CLI and build a proper API your frontend can consume.

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import anthropic
import json
from typing import Optional
 
app = FastAPI()
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
# Simple in-memory session store (use Redis in production)
sessions: dict[str, list] = {}
 
class ChatRequest(BaseModel):
    session_id: str
    message: str
    system_prompt: Optional[str] = "You are a helpful assistant."
 
class ChatResponse(BaseModel):
    response: str
    session_id: str
 
@app.post("/chat")
async def chat(request: ChatRequest):
    if request.session_id not in sessions:
        sessions[request.session_id] = []
 
    history = sessions[request.session_id]
    history.append({"role": "user", "content": request.message})
 
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=request.system_prompt,
            messages=history
        )
 
        assistant_message = response.content[0].text
        history.append({"role": "assistant", "content": assistant_message})
 
        return ChatResponse(response=assistant_message, session_id=request.session_id)
 
    except anthropic.APIError as e:
        raise HTTPException(status_code=500, detail=str(e))
 
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    """Server-Sent Events streaming endpoint"""
    if request.session_id not in sessions:
        sessions[request.session_id] = []
 
    history = sessions[request.session_id]
    history.append({"role": "user", "content": request.message})
 
    def generate():
        full_response = ""
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=request.system_prompt,
            messages=history
        ) as stream:
            for text in stream.text_stream:
                full_response += text
                yield f"data: {json.dumps({'text': text})}\n\n"
 
        history.append({"role": "assistant", "content": full_response})
        yield f"data: {json.dumps({'done': True})}\n\n"
 
    return StreamingResponse(generate(), media_type="text/event-stream")
 
@app.delete("/session/{session_id}")
async def clear_session(session_id: str):
    sessions.pop(session_id, None)
    return {"status": "cleared"}

Run it:

pip install fastapi uvicorn anthropic
uvicorn main:app --reload

STEP 5: Cost Optimization

As conversations grow, token consumption compounds rapidly. Here are the key optimization techniques.

5-1: Summarize and Trim History

def summarize_and_trim(
    client: anthropic.Anthropic,
    history: list,
    max_turns: int = 10
) -> list:
    if len(history) <= max_turns * 2:
        return history
 
    old_history = history[:-max_turns * 2]
    recent_history = history[-max_turns * 2:]
 
    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use cheap model for summarization
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 3 sentences:\n\n{json.dumps(old_history)}"
        }]
    )
 
    summary = summary_response.content[0].text
 
    return [
        {"role": "user", "content": f"[Conversation summary] {summary}"},
        {"role": "assistant", "content": "Understood. I have context from our previous conversation."},
        *recent_history
    ]

5-2: Prompt Caching for Long System Prompts

If your system prompt is long (internal docs, specs), caching can cut costs by up to 90%.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your very long system prompt here (internal docs, product specs, etc.)...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=history
)

5-3: Choose the Right Model for Each Task

Use Case	Recommended Model	Reason
Simple Q&A	claude-haiku-4-5-20251001	Fast, cheap
Coding assistance	claude-sonnet-4-6	Best balance
Complex reasoning	claude-opus-4-6	Highest accuracy
Summarization/classification	claude-haiku-4-5-20251001	Cost optimal

STEP 6: Error Handling and Retries

Production systems need robust error handling.

import anthropic
import time
from typing import Optional
 
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
 
def safe_chat(
    messages: list,
    system: str = "",
    max_retries: int = 3,
    retry_delay: float = 1.0
) -> Optional[str]:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=2048,
                system=system,
                messages=messages
            )
            return response.content[0].text
 
        except anthropic.RateLimitError:
            wait_time = retry_delay * (2 ** attempt)
            print(f"Rate limited. Retrying in {wait_time}s...")
            time.sleep(wait_time)
 
        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                print(f"Server error ({e.status_code}). Retrying...")
                time.sleep(retry_delay)
            else:
                print(f"Client error: {e.message}")
                return None
 
        except anthropic.APIConnectionError:
            print("Connection error. Retrying...")
            time.sleep(retry_delay)
 
    print("Max retries reached")
    return None

STEP 7: The Complete Production Class

Everything unified into a battle-tested class.

import anthropic
import uuid
from typing import Generator
 
class ProductionChatBot:
    def __init__(
        self,
        api_key: str,
        system_prompt: str = "You are a helpful assistant.",
        model: str = "claude-sonnet-4-6",
        max_tokens: int = 2048,
        max_history_turns: int = 20,
    ):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.system_prompt = system_prompt
        self.model = model
        self.max_tokens = max_tokens
        self.max_history_turns = max_history_turns
        self.history = []
        self.session_id = str(uuid.uuid4())
        self.total_tokens_used = 0
 
    def send(self, message: str) -> str:
        self.history.append({"role": "user", "content": message})
 
        response = self.client.messages.create(
            model=self.model,
            max_tokens=self.max_tokens,
            system=self.system_prompt,
            messages=self.history
        )
 
        assistant_text = response.content[0].text
        self.history.append({"role": "assistant", "content": assistant_text})
        self.total_tokens_used += response.usage.input_tokens + response.usage.output_tokens
 
        if len(self.history) > self.max_history_turns * 2:
            self.history = self.history[-(self.max_history_turns * 2):]
 
        return assistant_text
 
    def stream(self, message: str) -> Generator[str, None, None]:
        self.history.append({"role": "user", "content": message})
        full_response = ""
 
        with self.client.messages.stream(
            model=self.model,
            max_tokens=self.max_tokens,
            system=self.system_prompt,
            messages=self.history
        ) as stream:
            for text in stream.text_stream:
                full_response += text
                yield text
 
        self.history.append({"role": "assistant", "content": full_response})
 
    def get_stats(self) -> dict:
        return {
            "session_id": self.session_id,
            "turns": len(self.history) // 2,
            "total_tokens": self.total_tokens_used,
            "estimated_cost_usd": self.total_tokens_used / 1_000_000 * 3.0
        }
 
    def reset(self):
        self.history = []
        self.total_tokens_used = 0
 
# Example usage
bot = ProductionChatBot(
    api_key="YOUR_API_KEY",
    system_prompt="""You are a senior Python engineer with deep expertise
    in clean code, performance optimization, and production best practices."""
)
 
print("Claude: ", end="")
for chunk in bot.stream("What are the best practices for building a FastAPI CRUD app?"):
    print(chunk, end="", flush=True)
print()
 
print(f"\n📊 Stats: {bot.get_stats()}")

STEP 8: Connecting to LINE, Telegram, and Discord — Platform-Specific Differences

So far we've run the chatbot from a terminal and our own web API. To put it in front of real users, you'll often want it on a messaging platform like LINE or Telegram.

When I first put a chatbot built as an indie developer onto LINE, the thing that caught me off guard was the reply token's short lifetime.

Both share the same skeleton: the platform POSTs to your server via a webhook, you verify the signature, you call Claude, and you reply. The differences live in the details — and that's where people get stuck.

LINE: the reply token has a short life

The first thing that trips people up with the LINE Messaging API is the reply token. A reply token can be used only once, and it expires quickly (about a minute after it's issued). A few seconds of Claude latency is fine, but if you add long generations or tool calls, the token can expire before you reply.

A stable pattern is a two-step approach: show a loading animation first to preserve the reply token, then send the answer with a push message once it's ready.

Signature verification is mandatory. If you don't verify the X-Line-Signature header, anyone can send forged events to your endpoint.

import hmac
import hashlib
import base64
import os
import requests
from fastapi import FastAPI, Request, HTTPException
 
app = FastAPI()
LINE_CHANNEL_SECRET = os.environ["LINE_CHANNEL_SECRET"]
LINE_ACCESS_TOKEN = os.environ["LINE_CHANNEL_ACCESS_TOKEN"]
 
def verify_line_signature(body: bytes, signature: str) -> bool:
    """Verify X-Line-Signature with HMAC-SHA256."""
    digest = hmac.new(
        LINE_CHANNEL_SECRET.encode("utf-8"), body, hashlib.sha256
    ).digest()
    expected = base64.b64encode(digest).decode("utf-8")
    return hmac.compare_digest(expected, signature)
 
def line_push(user_id: str, text: str) -> None:
    """Send via push message, which doesn't depend on a reply token."""
    requests.post(
        "https://api.line.me/v2/bot/message/push",
        headers={"Authorization": f"Bearer {LINE_ACCESS_TOKEN}"},
        json={"to": user_id, "messages": [{"type": "text", "text": text}]},
        timeout=10,
    )
 
@app.post("/line/webhook")
async def line_webhook(request: Request):
    body = await request.body()
    signature = request.headers.get("X-Line-Signature", "")
    if not verify_line_signature(body, signature):
        raise HTTPException(status_code=403, detail="signature mismatch")
 
    events = (await request.json()).get("events", [])
    for event in events:
        if event.get("type") == "message" and event["message"]["type"] == "text":
            user_id = event["source"]["userId"]
            user_text = event["message"]["text"]
            # Keep history per user (reuse ProductionChatBot from STEP 7)
            reply = get_bot_for(user_id).send(user_text)
            line_push(user_id, reply)
    return {"ok": True}

get_bot_for(user_id) is meant to look up a ProductionChatBot instance keyed by user ID. An in-process dictionary is wiped on restart, so in production it's safer to move conversation history to an external store like Redis.

Telegram: choose webhook or long polling

Telegram offers two ways to receive messages: long polling, where you loop on getUpdates, and webhook, where the platform POSTs to a public endpoint.

Polling is far easier during development — no public URL, no HTTPS certificate, it just runs locally. For a server that stays up in production, a webhook is leaner. When you use a webhook, register it with a secret_token and verify the X-Telegram-Bot-Api-Secret-Token header on incoming requests.

Since Claude takes a moment to respond, calling sendChatAction to show "typing…" makes the experience feel noticeably better.

import os
import requests
 
TG_TOKEN = os.environ["TELEGRAM_BOT_TOKEN"]
API = f"https://api.telegram.org/bot{TG_TOKEN}"
 
def tg_typing(chat_id: int) -> None:
    requests.post(f"{API}/sendChatAction",
                  json={"chat_id": chat_id, "action": "typing"}, timeout=10)
 
def tg_send(chat_id: int, text: str) -> None:
    requests.post(f"{API}/sendMessage",
                  json={"chat_id": chat_id, "text": text}, timeout=10)
 
def run_polling():
    """Development: receive via long polling."""
    offset = None
    while True:
        resp = requests.get(f"{API}/getUpdates",
                            params={"timeout": 30, "offset": offset}, timeout=40)
        for update in resp.json().get("result", []):
            offset = update["update_id"] + 1
            msg = update.get("message")
            if not msg or "text" not in msg:
                continue
            chat_id = msg["chat"]["id"]
            tg_typing(chat_id)
            reply = get_bot_for(str(chat_id)).send(msg["text"])
            tg_send(chat_id, reply)

Discord: You Have Three Seconds to Say "Got It"

LINE and Telegram both arrive over webhooks. I started writing the Discord integration with the same assumptions, and every slash command I built came back with nothing at all.

With discord.py, events don't arrive over a webhook. They come through a persistent WebSocket connection (the Gateway), so there's no signature to verify. Discord swaps that requirement for a different one: a slash command must acknowledge within three seconds, or the interaction is dead.

Claude takes a few seconds to generate. Write the obvious version — receive the question, call Claude, reply — and you will blow through that window almost every time.

The fix is ctx.defer(). Call it first and Discord shows a thinking state, which frees you to send the real answer later via followup.send(). It's close cousin to holding back the LINE reply token and switching to a push message.

The second constraint is the 2,000-character limit per message. Claude will sail past that if you let it, so nudge it in the system prompt and truncate on the way out anyway.

import discord
from discord import option
from discord.ext import commands
 
SYSTEM_PROMPT = """You are an AI assistant living in a Discord server.
Keep answers concise and mind the 2,000-character limit per message.
Always wrap code in code blocks."""
 
class ClaudeCog(commands.Cog):
    def __init__(self, bot: commands.Bot):
        self.bot = bot
 
    @discord.slash_command(name="ask", description="Ask Claude a question")
    @option("question", description="Your question")
    async def ask(self, ctx: discord.ApplicationContext, question: str):
        # Dodge the three-second deadline: acknowledge first.
        await ctx.defer()
 
        # Reuse the ProductionChatBot from STEP 7, keyed per user.
        reply = get_bot_for(str(ctx.author.id)).send(question)
 
        # Stay under the 2,000-character ceiling.
        if len(reply) > 1900:
            reply = reply[:1900] + "\n\n…(truncated for length)"
 
        await ctx.followup.send(reply)
 
def setup(bot: commands.Bot):
    bot.add_cog(ClaudeCog(bot))

On the startup side, enable the Message Content Intent explicitly. Your code and the Developer Portal both have to agree, or the bot throws on boot and never comes up.

intents = discord.Intents.default()
intents.message_content = True  # Turn this on in the Developer Portal too.
 
bot = commands.Bot(command_prefix="!", intents=intents)

Three Discord-specific snags worth writing down, all of which I hit myself:

Symptom	Cause	Fix
`PrivilegedIntentsRequired` on startup	Message Content Intent is off in the Developer Portal	Bot tab → Privileged Gateway Intents → enable Message Content Intent
Slash commands never appear in the picker	Invite URL is missing the `applications.commands` scope	Add the scope in the OAuth2 URL Generator and re-invite the bot
Command edits don't show up	Global commands can take up to an hour to propagate	Register guild commands during development for instant updates

That last one is unglamorous, and it cost me more time than the other two combined. I was convinced the code was broken and rewrote it twice, when all I was actually doing was waiting for propagation. Register guild-scoped commands while you develop. It changes the rhythm of the whole feedback loop.

The Three Platforms, Side by Side

Lined up together, the differences collapse into three things: how messages arrive, what your deadline is, and how you're expected to reply.

	LINE	Telegram	Discord
Transport	Webhook	Webhook or long polling	Gateway (persistent WebSocket)
Inbound verification	`X-Line-Signature`, HMAC-SHA256	`secret_token` header comparison	None needed (Gateway is authenticated)
Reply deadline	Reply token expires in ~1 minute	Effectively none	3 seconds for slash commands
How you beat the deadline	Show loading, then push message	—	`ctx.defer()`, then `followup.send()`
Typing indicator	Loading animation API	`sendChatAction`	`channel.typing()`
Per-message character cap	5,000	4,096	2,000

What clicked for me, staring at that table, is that all three are asking for the same thing in different accents: prove your server is alive, and prove it fast. Try to wait for Claude to finish before you say anything, and you'll meet the same wall wearing a different name.

Acknowledge first, deliver the body afterward. Build it that way from the start and porting to the next platform stops being a rewrite.

STEP 9: Don't Send Raw Personal Data to Claude — PII Masking

When you route internal data or customer messages through a chatbot, personal information (PII) like email addresses and phone numbers flows straight into the API. Not wanting raw values to land in your input logs or model requests is a common production requirement.

A practical approach is reversible masking: replace PII with placeholders before sending, then restore it after the response comes back. Claude only ever sees sanitized strings like {{EMAIL_1}}, while the user gets a natural reply with the real values restored.

import re
 
class PIIMasker:
    """Reversibly mask PII before sending and restore it after."""
 
    PATTERNS = {
        "EMAIL": re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"),
        "PHONE": re.compile(r"\b\d{3}[-.]?\d{3,4}[-.]?\d{4}\b"),
        "CARD": re.compile(r"\b(?:\d[ -]?){13,16}\b"),
    }
 
    def mask(self, text: str):
        mapping = {}
        counters = {}
        def replace(kind, m):
            counters[kind] = counters.get(kind, 0) + 1
            token = f"{{{{{kind}_{counters[kind]}}}}}"
            mapping[token] = m.group(0)
            return token
        for kind, pattern in self.PATTERNS.items():
            text = pattern.sub(lambda m, k=kind: replace(k, m), text)
        return text, mapping
 
    def unmask(self, text: str, mapping: dict) -> str:
        for token, original in mapping.items():
            text = text.replace(token, original)
        return text
 
# Usage
masker = PIIMasker()
user_text = "Send the invoice to tanaka@example.com and also call 555-123-4567"
masked, mapping = masker.mask(user_text)
print(masked)
# -> Send the invoice to {{EMAIL_1}} and also call {{PHONE_1}}
 
reply = bot.send(masked)              # Claude only sees the masked text
restored = masker.unmask(reply, mapping)  # the user gets the real values back

One honest caveat: regex-based masking is not a silver bullet. Fixed-format values like emails and phone numbers are easy to catch, but context-dependent PII such as names and addresses will slip through unless you add named-entity recognition (NER) or a dictionary.

If there's information that absolutely must not leave your systems, the most reliable design is to never send that field to Claude in the first place. Treat masking as one layer that reduces accidental leakage, and separate the data you truly need to protect at the input stage. With that two-layer mindset, you'll have a foundation you can explain even when an audit comes around.

Summary and Next Steps

This article covered AI chatbot implementation with the Claude API from zero to production:

STEP 1–2: Basic implementation and conversation history
STEP 3: Streaming for better UX
STEP 4: FastAPI web API
STEP 5: Cost optimization strategies
STEP 6: Production-grade error handling
STEP 7: The complete production class
STEP 8: Connecting to LINE & Telegram, with signature verification
STEP 9: Protecting personal data with PII masking

Continue your journey with:

Tool Use Complete Guide — Add search and compute to your chatbot
Multimodal Input Guide — Vision-capable chatbots
Prompt Caching Deep Dive — Scaling to production

ℹ️

**Built something cool?** Share it with **#ClaudeLab** on X — we'd love to see what you make!

Building an AI Chatbot with Claude API — Streaming, Conversation History & Cost Optimization

Why Build Your Own Claude-Powered Chatbot?

STEP 1: The Minimal Chatbot

STEP 2: Managing Conversation History

STEP 3: Implementing Streaming Responses

STEP 4: Exposing as a Web API with FastAPI

STEP 5: Cost Optimization

5-1: Summarize and Trim History

5-2: Prompt Caching for Long System Prompts

5-3: Choose the Right Model for Each Task

STEP 6: Error Handling and Retries

STEP 7: The Complete Production Class

STEP 8: Connecting to LINE, Telegram, and Discord — Platform-Specific Differences

LINE: the reply token has a short life

Telegram: choose webhook or long polling

Discord: You Have Three Seconds to Say "Got It"

The Three Platforms, Side by Side

STEP 9: Don't Send Raw Personal Data to Claude — PII Masking

Summary and Next Steps

Thank You for Reading

Related Articles

Related Articles

⬡ API & SDK2026-07-24
When Memory Store Listings Returned Half the Rows: Migrating to agent-memory-2026-07-22
The Managed Agents memory listing behavior changed under agent-memory-2026-07-22: fixed ordering, stricter depth, and segment-based path_prefix matching. Here is how an audit script quietly dropped rows, and how to design a memory access layer that survives the change.

⬡ API & SDK2026-07-14
A Two-Stage Pre-Publish Gate for User-Facing AI Text in Consumer Apps
Design a two-stage pre-publish gate for short AI-generated text you ship to end users: a deterministic rule layer plus a Claude classifier, with fail-closed handling, generation-time vetting, and a cost model. Full implementation code included.

⬡ API & SDK2026-07-12
A Long Non-Streaming Response Was Billed Twice Past the 10-Minute Wall: Redesigning the SDK's Default Timeout and Retries
The Anthropic SDK's default 10-minute timeout and two automatic retries can silently re-run a long non-streaming response and bill you twice. Here is how the trap works, and how to close it with streaming, explicit timeout/max_retries, and a small local ledger — with measured before/after numbers.