⬡ API & SDK/2026-04-16Advanced

Claude API with Go: Production— Anthropic Go SDK, Concurrency, Tool Use & Microservice Integration

A practical guide to using Claude API with Go in production. Covers streaming with goroutines, concurrent Tool Use, rate limiting with channels, Gin/Echo integration, graceful shutdown, and Kubernetes deployment with working code examples.

go golang api-sdk¹³ streaming²¹ tool-use²² microservices² production¹¹¹ concurrency

✦ Premium Article

When you first try to call the Claude API from Go, you run into friction that Python and TypeScript developers never encounter.

Streaming responses through a goroutine leads to panics when the context gets cancelled early. Parallel Tool Use triggers rate limit errors. Wiring the Gin handler to SSE requires flusher configuration that isn't documented anywhere obvious.

These are Go-specific patterns. Articles about the Python or TypeScript SDK won't help you here. I've integrated Claude API into several production Go services, and this guide documents the walls I hit along the way — with working code that solves each one.

The official Anthropic Go SDK (anthropic-sdk-go) was released in late 2024 and is still actively developed. Because it's newer than the Python SDK, there's significantly less community content available, which makes Go backend engineers especially likely to get stuck.

Setting Up the Anthropic Go SDK

Let's start with the basics: adding the dependency and writing a first call that actually works.

go get github.com/anthropics/anthropic-sdk-go

// main.go
package main
 
import (
    "context"
    "fmt"
    "log"
    "os"
 
    "github.com/anthropics/anthropic-sdk-go"
    "github.com/anthropics/anthropic-sdk-go/option"
)
 
func main() {
    // Always load the API key from environment — never hardcode it
    client := anthropic.NewClient(
        option.WithAPIKey(os.Getenv("ANTHROPIC_API_KEY")),
    )
 
    ctx := context.Background()
    msg, err := client.Messages.New(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(1024)),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock("Describe Go's concurrency model in three sentences.")),
        }),
    })
    if err \!= nil {
        log.Fatalf("API call failed: %v", err)
    }
 
    for _, block := range msg.Content {
        if block.Type == anthropic.ContentBlockTypeText {
            fmt.Println(block.Text)
        }
    }
}

Key insight: The anthropic.F() helper wraps values in an Option type. The Go SDK explicitly tracks whether a field has been set — distinguishing between nil and "not set." It looks verbose at first, but this design gives you compile-time guarantees about required vs. optional API parameters.

Managing System Prompts and Conversation History

Real applications need to maintain conversation history. In Go, a struct is the natural way to hold this state.

// conversation.go
package claude
 
import (
    "context"
    "fmt"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
)
 
// ConversationSession manages the state of a single conversation
type ConversationSession struct {
    client     *anthropic.Client
    systemText string
    history    []anthropic.MessageParam
    model      anthropic.Model
}
 
func NewConversationSession(client *anthropic.Client, systemPrompt string) *ConversationSession {
    return &ConversationSession{
        client:     client,
        systemText: systemPrompt,
        history:    make([]anthropic.MessageParam, 0),
        model:      anthropic.ModelClaude_Sonnet_4_6,
    }
}
 
// Send submits a user message and returns the assistant's reply
func (s *ConversationSession) Send(ctx context.Context, userMsg string) (string, error) {
    s.history = append(s.history, anthropic.NewUserMessage(
        anthropic.NewTextBlock(userMsg),
    ))
 
    params := anthropic.MessageNewParams{
        Model:     anthropic.F(s.model),
        MaxTokens: anthropic.F(int64(2048)),
        Messages:  anthropic.F(s.history),
    }
 
    if s.systemText \!= "" {
        params.System = anthropic.F([]anthropic.TextBlockParam{
            anthropic.NewTextBlock(s.systemText),
        })
    }
 
    resp, err := s.client.Messages.New(ctx, params)
    if err \!= nil {
        // Roll back the user message on failure
        // Without this, the next call will fail: "messages must alternate user/assistant"
        s.history = s.history[:len(s.history)-1]
        return "", fmt.Errorf("API call failed: %w", err)
    }
 
    var result string
    for _, block := range resp.Content {
        if block.Type == anthropic.ContentBlockTypeText {
            result += block.Text
        }
    }
 
    s.history = append(s.history, anthropic.NewAssistantMessage(
        anthropic.NewTextBlock(result),
    ))
 
    return result, nil
}

The history rollback on error is easy to overlook, but it causes real problems. If a user message lands in history without a corresponding assistant response, the next API call fails with a message-ordering validation error.

Streaming: The Right Way

Streaming is where most Go developers get into trouble. Here are the concrete mistakes and their fixes.

The Goroutine Leak Pattern

// BAD: This leaks a goroutine
func badStreaming(client *anthropic.Client) {
    ctx := context.Background()
    stream := client.Messages.NewStreaming(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(1024)),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock("Hello")),
        }),
    })
 
    go func() {
        for stream.Next() {
            event := stream.Current()
            _ = event
        }
        // stream.Close() is never called — goroutine leaks
    }()
    // Function returns, goroutine is now orphaned
}

// GOOD: Proper streaming with context-aware channel output
func streamToChannel(ctx context.Context, client *anthropic.Client, userMsg string, output chan<- string) error {
    stream := client.Messages.NewStreaming(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(2048)),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock(userMsg)),
        }),
    })
    defer stream.Close() // Always defer Close
 
    for stream.Next() {
        // Respect context cancellation between tokens
        select {
        case <-ctx.Done():
            return ctx.Err()
        default:
        }
 
        event := stream.Current()
        switch ev := event.AsUnion().(type) {
        case anthropic.ContentBlockDeltaEvent:
            if delta, ok := ev.Delta.AsUnion().(anthropic.TextDelta); ok {
                // Send with context awareness — don't block forever
                select {
                case output <- delta.Text:
                case <-ctx.Done():
                    return ctx.Err()
                }
            }
        }
    }
 
    return stream.Err()
}

Two things matter here: defer stream.Close() to prevent resource leaks, and checking ctx.Done() on every channel send so the goroutine exits cleanly when the client disconnects.

SSE Streaming Endpoint with Gin

// handler/stream.go
package handler
 
import (
    "fmt"
    "net/http"
 
    "github.com/gin-gonic/gin"
)
 
func (h *Handler) StreamChat(c *gin.Context) {
    var req struct {
        Message string `json:"message" binding:"required"`
    }
    if err := c.ShouldBindJSON(&req); err \!= nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }
 
    // SSE headers
    c.Header("Content-Type", "text/event-stream")
    c.Header("Cache-Control", "no-cache")
    c.Header("Connection", "keep-alive")
    c.Header("X-Accel-Buffering", "no") // Critical: disables nginx buffering
 
    flusher, ok := c.Writer.(http.Flusher)
    if \!ok {
        c.JSON(http.StatusInternalServerError, gin.H{"error": "streaming not supported"})
        return
    }
 
    ctx := c.Request.Context()
    tokenCh := make(chan string, 10) // Buffered to absorb backpressure
    errCh := make(chan error, 1)
 
    go func() {
        defer close(tokenCh)
        err := h.claude.StreamMessage(ctx, req.Message, tokenCh)
        errCh <- err
    }()
 
    for {
        select {
        case token, ok := <-tokenCh:
            if \!ok {
                fmt.Fprintf(c.Writer, "data: [DONE]\n\n")
                flusher.Flush()
                return
            }
            fmt.Fprintf(c.Writer, "data: %s\n\n", token)
            flusher.Flush()
 
        case err := <-errCh:
            if err \!= nil {
                fmt.Fprintf(c.Writer, "event: error\ndata: %s\n\n", err.Error())
                flusher.Flush()
            }
            return
 
        case <-ctx.Done():
            // Client disconnected — goroutine exits via ctx cancellation
            return
        }
    }
}

The X-Accel-Buffering: no header is easy to miss, but without it nginx buffers the response and your streaming looks broken from the client's perspective.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Developers stuck on goroutine leaks and context cancellation in Streaming will walk away with production-stable patterns they can implement today

✦Learn to safely execute concurrent Tool Use calls with Go's channel and errgroup model, including proper rate limiting that won't blow your API quota

✦Get a complete microservice architecture that works — from Gin/Echo integration to Docker containerization and Kubernetes graceful shutdown

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Concurrent Tool Use

Claude can request multiple tools simultaneously. Go's concurrency model is a natural fit here — but the implementation details matter.

Tool Engine with Parallel Execution

// tools/engine.go
package tools
 
import (
    "context"
    "encoding/json"
    "fmt"
    "sync"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
    "golang.org/x/sync/errgroup"
)
 
type ToolFunc func(ctx context.Context, input json.RawMessage) (string, error)
 
type ToolEngine struct {
    tools map[string]ToolFunc
    defs  []anthropic.ToolParam
    mu    sync.RWMutex
}
 
func NewToolEngine() *ToolEngine {
    return &ToolEngine{
        tools: make(map[string]ToolFunc),
        defs:  make([]anthropic.ToolParam, 0),
    }
}
 
func (e *ToolEngine) Register(name, description string, inputSchema interface{}, fn ToolFunc) {
    e.mu.Lock()
    defer e.mu.Unlock()
 
    schemaBytes, _ := json.Marshal(inputSchema)
    e.tools[name] = fn
    e.defs = append(e.defs, anthropic.ToolParam{
        Name:        anthropic.F(name),
        Description: anthropic.F(description),
        InputSchema: anthropic.F(anthropic.ToolInputSchemaParam{
            Type:       anthropic.F(anthropic.ToolInputSchemaTypeObject),
            Properties: anthropic.Raw[interface{}](schemaBytes),
        }),
    })
}
 
func (e *ToolEngine) Definitions() []anthropic.ToolParam {
    e.mu.RLock()
    defer e.mu.RUnlock()
    return e.defs
}
 
// ExecuteParallel runs all tool calls concurrently using errgroup
func (e *ToolEngine) ExecuteParallel(ctx context.Context, calls []anthropic.ToolUseBlock) ([]anthropic.ToolResultBlockParam, error) {
    e.mu.RLock()
    defer e.mu.RUnlock()
 
    results := make([]anthropic.ToolResultBlockParam, len(calls))
    eg, ctx := errgroup.WithContext(ctx)
 
    for i, call := range calls {
        i, call := i, call // capture loop variables (required before Go 1.22)
        eg.Go(func() error {
            fn, ok := e.tools[call.Name]
            if \!ok {
                // Return the error to Claude, don't fail the whole batch
                results[i] = anthropic.NewToolResultBlock(
                    call.ID,
                    fmt.Sprintf("tool '%s' not registered", call.Name),
                    true,
                )
                return nil
            }
 
            output, err := fn(ctx, call.Input)
            if err \!= nil {
                results[i] = anthropic.NewToolResultBlock(call.ID, err.Error(), true)
                return nil
            }
            results[i] = anthropic.NewToolResultBlock(call.ID, output, false)
            return nil
        })
    }
 
    if err := eg.Wait(); err \!= nil {
        return nil, err
    }
    return results, nil
}

The key design decision: tool execution failures are returned to Claude via isError: true, not propagated as Go errors. This way, one failing tool doesn't abort the results from the others.

Complete Agent Loop

// agent/loop.go
package agent
 
import (
    "context"
    "fmt"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
    "your-module/tools"
)
 
func Run(ctx context.Context, client *anthropic.Client, engine *tools.ToolEngine, userMsg string) (string, error) {
    messages := []anthropic.MessageParam{
        anthropic.NewUserMessage(anthropic.NewTextBlock(userMsg)),
    }
 
    const maxIterations = 10 // Always set a ceiling — runaway loops are expensive
    for i := 0; i < maxIterations; i++ {
        resp, err := client.Messages.New(ctx, anthropic.MessageNewParams{
            Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
            MaxTokens: anthropic.F(int64(4096)),
            Tools:     anthropic.F(engine.Definitions()),
            Messages:  anthropic.F(messages),
        })
        if err \!= nil {
            return "", fmt.Errorf("iteration %d: API error: %w", i+1, err)
        }
 
        messages = append(messages, anthropic.NewAssistantMessage(resp.Content...))
 
        if resp.StopReason == anthropic.StopReasonEndTurn {
            for _, block := range resp.Content {
                if block.Type == anthropic.ContentBlockTypeText {
                    return block.Text, nil
                }
            }
            return "", nil
        }
 
        if resp.StopReason \!= anthropic.StopReasonToolUse {
            return "", fmt.Errorf("unexpected stop_reason: %s", resp.StopReason)
        }
 
        var toolCalls []anthropic.ToolUseBlock
        for _, block := range resp.Content {
            if block.Type == anthropic.ContentBlockTypeToolUse {
                toolCalls = append(toolCalls, block.AsToolUseBlock())
            }
        }
 
        toolResults, err := engine.ExecuteParallel(ctx, toolCalls)
        if err \!= nil {
            return "", fmt.Errorf("tool execution error: %w", err)
        }
 
        resultBlocks := make([]anthropic.ContentBlockParamUnion, len(toolResults))
        for j, r := range toolResults {
            resultBlocks[j] = r
        }
        messages = append(messages, anthropic.NewUserMessage(resultBlocks...))
    }
 
    return "", fmt.Errorf("reached max iterations (%d)", maxIterations)
}

Rate Limiting for Concurrent Services

When multiple users hit your Go service simultaneously, every request competes for Claude API quota. Here's a production-ready rate limiter using golang.org/x/time/rate.

// ratelimit/limiter.go
package ratelimit
 
import (
    "context"
    "fmt"
    "time"
 
    "golang.org/x/time/rate"
)
 
type ClaudeLimiter struct {
    reqLimiter   *rate.Limiter
    tokenLimiter *rate.Limiter
}
 
// NewClaudeLimiter creates a dual limiter for request count and token budget
// reqPerMin: max requests per minute, tokensPerMin: max tokens per minute
func NewClaudeLimiter(reqPerMin, tokensPerMin int) *ClaudeLimiter {
    return &ClaudeLimiter{
        reqLimiter:   rate.NewLimiter(rate.Every(time.Minute/time.Duration(reqPerMin)), reqPerMin/10),
        tokenLimiter: rate.NewLimiter(rate.Limit(float64(tokensPerMin))/60, tokensPerMin/10),
    }
}
 
// Wait blocks until the rate limits allow proceeding, or ctx is cancelled
func (l *ClaudeLimiter) Wait(ctx context.Context, estimatedTokens int) error {
    if err := l.reqLimiter.Wait(ctx); err \!= nil {
        return fmt.Errorf("request limit wait cancelled: %w", err)
    }
    if err := l.tokenLimiter.WaitN(ctx, estimatedTokens); err \!= nil {
        return fmt.Errorf("token limit wait cancelled: %w", err)
    }
    return nil
}

For Claude Sonnet 4.6, the quota is approximately 40,000 tokens/minute at standard tier. When estimating tokens for the limiter, a safe heuristic is: input character count × 1.5 (accounts for both input and expected output).

Common Pitfalls and Fixes

Pitfall 1: Context Timeout Too Short for Streaming

// BAD: 5s times out before streaming completes
func badTimeout(client *anthropic.Client, msg string) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    stream := client.Messages.NewStreaming(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(2048)),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock(msg)),
        }),
    })
    defer stream.Close()
    // Most non-trivial responses take longer than 5s to stream fully
}
 
// GOOD: Give streaming enough runway
func goodTimeout(client *anthropic.Client, msg string) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()
 
    stream := client.Messages.NewStreaming(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(2048)),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock(msg)),
        }),
    })
    defer stream.Close()
    // ...
}

Pitfall 2: Type Assertion Panics on Union Types

// BAD: Panics if the block is a ToolUseBlock, not a TextBlock
block := resp.Content[0]
text := block.AsTextBlock() // panic if block.Type \!= ContentBlockTypeText
 
// GOOD: Always use a type switch
for _, block := range resp.Content {
    switch block.Type {
    case anthropic.ContentBlockTypeText:
        fmt.Println(block.Text)
    case anthropic.ContentBlockTypeToolUse:
        toolBlock := block.AsToolUseBlock()
        fmt.Printf("Tool: %s, ID: %s\n", toolBlock.Name, toolBlock.ID)
    default:
        // Handle future block types gracefully
        fmt.Printf("unknown block type: %s\n", block.Type)
    }
}

Pitfall 3: Nginx Upstream Timeout Cutting Streaming Short

# nginx.conf — dedicated streaming endpoint configuration
location /api/stream {
    proxy_pass http://backend;
    proxy_read_timeout 300s;    # Extend from the 60s default
    proxy_send_timeout 300s;
    proxy_buffering off;        # Required for SSE
    proxy_cache off;
    chunked_transfer_encoding on;
}

Pitfall 4: Loop Variable Capture (Go < 1.22)

// BAD: All goroutines reference the last value of i and call
for i, call := range calls {
    go func() {
        process(i, call) // wrong values after loop ends
    }()
}
 
// GOOD: Shadow the loop variables inside the loop body
for i, call := range calls {
    i, call := i, call
    go func() {
        process(i, call) // correct
    }()
}

Go 1.22 fixed this by default, but many production services still run on 1.21 or earlier.

Graceful Shutdown for Kubernetes

When Kubernetes rolls out a new Pod version, running Streaming requests need time to finish.

// main.go
func main() {
    router := setupRouter()
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      router,
        ReadTimeout:  30 * time.Second,
        WriteTimeout: 10 * time.Minute, // Long enough for streaming responses
        IdleTimeout:  120 * time.Second,
    }
 
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
 
    go func() {
        if err := srv.ListenAndServe(); err \!= nil && err \!= http.ErrServerClosed {
            log.Fatalf("server error: %v", err)
        }
    }()
 
    log.Println("Server started on :8080")
    <-quit
    log.Println("Shutting down — waiting for in-flight requests to complete")
 
    shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 90*time.Second)
    defer shutdownCancel()
 
    if err := srv.Shutdown(shutdownCtx); err \!= nil {
        log.Printf("forced shutdown: %v", err)
    }
    log.Println("Shutdown complete")
}

Set terminationGracePeriodSeconds in your Kubernetes deployment to match:

spec:
  terminationGracePeriodSeconds: 120  # shutdown timeout (90s) + buffer
  containers:
    - name: claude-service
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 30
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 5
        periodSeconds: 10

Testing with Interfaces

Make your Claude client testable by defining it as an interface.

// claude/interface.go
package claude
 
import (
    "context"
    anthropic "github.com/anthropics/anthropic-sdk-go"
)
 
type MessagesService interface {
    New(ctx context.Context, params anthropic.MessageNewParams) (*anthropic.Message, error)
}
 
// MockMessagesService returns preset responses for testing
type MockMessagesService struct {
    Responses []string
    Errors    []error
    callCount int
}
 
func (m *MockMessagesService) New(ctx context.Context, params anthropic.MessageNewParams) (*anthropic.Message, error) {
    idx := m.callCount
    m.callCount++
 
    if idx < len(m.Errors) && m.Errors[idx] \!= nil {
        return nil, m.Errors[idx]
    }
 
    response := "default response"
    if idx < len(m.Responses) {
        response = m.Responses[idx]
    }
 
    return &anthropic.Message{
        Content: []anthropic.ContentBlock{
            {Type: anthropic.ContentBlockTypeText, Text: response},
        },
        StopReason: anthropic.StopReasonEndTurn,
    }, nil
}

Tag integration tests separately so CI doesn't require an API key:

# Unit tests (no API key needed)
go test ./...
 
# Integration tests (requires ANTHROPIC_API_KEY)
go test -tags=integration ./...

Where to Go From Here

Start by adding the Anthropic SDK to an existing Go service and implementing ConversationSession. Once you have Streaming and Tool Use working together in an agent loop, the fundamentals are solid.

From there, adding OpenTelemetry-based AI observability gives you the production monitoring layer that makes it safe to scale.

Go's channel and context model turns out to be a genuinely good fit for LLM streaming — the concurrency primitives that Go developers already know map naturally onto the async, token-by-token nature of Claude's responses. Once the patterns click, integrating Claude into a Go microservice feels surprisingly clean.

Echo Framework Integration

If you're using Echo instead of Gin, the streaming setup is slightly different. Echo's Response() writer implements http.Flusher directly.

// server/echo.go
package server
 
import (
    "fmt"
    "net/http"
 
    "github.com/labstack/echo/v4"
    "github.com/labstack/echo/v4/middleware"
)
 
func SetupEchoServer(h *Handler) *echo.Echo {
    e := echo.New()
    e.HideBanner = true
 
    e.Use(middleware.Logger())
    e.Use(middleware.Recover())
    e.Use(middleware.CORS())
    e.Use(middleware.RateLimiter(middleware.NewRateLimiterMemoryStore(20)))
 
    api := e.Group("/api/v1")
    api.POST("/chat", h.Chat)
    api.POST("/chat/stream", h.ChatStream)
    api.POST("/agent", h.RunAgent)
 
    // Health check for Kubernetes readiness probes
    e.GET("/health", func(c echo.Context) error {
        return c.JSON(http.StatusOK, map[string]string{"status": "ok"})
    })
 
    return e
}
 
// ChatStream — Echo variant
func (h *Handler) ChatStream(c echo.Context) error {
    var req ChatRequest
    if err := c.Bind(&req); err \!= nil {
        return echo.NewHTTPError(http.StatusBadRequest, err.Error())
    }
 
    c.Response().Header().Set("Content-Type", "text/event-stream")
    c.Response().Header().Set("Cache-Control", "no-cache")
    c.Response().Header().Set("X-Accel-Buffering", "no")
    c.Response().WriteHeader(http.StatusOK)
 
    ctx := c.Request().Context()
    tokenCh := make(chan string, 20)
 
    go func() {
        defer close(tokenCh)
        _ = h.claude.StreamMessage(ctx, req.Message, tokenCh)
    }()
 
    for {
        select {
        case token, ok := <-tokenCh:
            if \!ok {
                fmt.Fprintf(c.Response(), "data: [DONE]\n\n")
                c.Response().Flush()
                return nil
            }
            fmt.Fprintf(c.Response(), "data: %s\n\n", token)
            c.Response().Flush()
        case <-ctx.Done():
            return nil
        }
    }
}

The Echo version calls c.Response().Flush() instead of a separate flusher variable — Echo's response writer wraps http.Flusher under the hood, so no type assertion is needed.

Configuration Management

Production services need externalized configuration. Hardcoding model names or token limits is a maintenance problem — every change requires a recompile.

// config/config.go
package config
 
import (
    "fmt"
    "os"
    "strconv"
    "time"
)
 
type Config struct {
    APIKey         string
    Model          string
    MaxTokens      int
    RequestTimeout time.Duration
    MaxRetries     int
    ReqPerMin      int
    TokensPerMin   int
    Port           int
}
 
func Load() (*Config, error) {
    apiKey := os.Getenv("ANTHROPIC_API_KEY")
    if apiKey == "" {
        return nil, fmt.Errorf("ANTHROPIC_API_KEY is not set")
    }
 
    maxTokens, _ := strconv.Atoi(getEnv("CLAUDE_MAX_TOKENS", "2048"))
    timeoutSec, _ := strconv.Atoi(getEnv("CLAUDE_TIMEOUT_SEC", "300"))
    reqPerMin, _ := strconv.Atoi(getEnv("CLAUDE_REQ_PER_MIN", "50"))
    tokensPerMin, _ := strconv.Atoi(getEnv("CLAUDE_TOKENS_PER_MIN", "40000"))
    port, _ := strconv.Atoi(getEnv("PORT", "8080"))
 
    return &Config{
        APIKey:         apiKey,
        Model:          getEnv("CLAUDE_MODEL", "claude-sonnet-4-6"),
        MaxTokens:      maxTokens,
        RequestTimeout: time.Duration(timeoutSec) * time.Second,
        MaxRetries:     3,
        ReqPerMin:      reqPerMin,
        TokensPerMin:   tokensPerMin,
        Port:           port,
    }, nil
}
 
func getEnv(key, defaultVal string) string {
    if val := os.Getenv(key); val \!= "" {
        return val
    }
    return defaultVal
}

For Kubernetes deployments, these values map cleanly to ConfigMap and Secret resources:

# k8s/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: claude-service-config
data:
  CLAUDE_MODEL: "claude-sonnet-4-6"
  CLAUDE_MAX_TOKENS: "2048"
  CLAUDE_TIMEOUT_SEC: "300"
  CLAUDE_REQ_PER_MIN: "50"
  PORT: "8080"
 
---
# k8s/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: claude-service-secrets
type: Opaque
stringData:
  ANTHROPIC_API_KEY: "your-api-key-here"  # Use a secrets manager in production

Never put the API key in a ConfigMap. It belongs in a Secret, or better, pulled from a secrets manager like AWS Secrets Manager or HashiCorp Vault at startup.

Prompt Caching Integration

If your service sends the same long system prompt on every request, prompt caching can cut costs and latency significantly. The Go SDK supports it via the cache_control field.

// cache/cached_client.go
package cache
 
import (
    "context"
    "fmt"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
)
 
// CachedConversation wraps a fixed system prompt with cache_control enabled
type CachedConversation struct {
    client     *anthropic.Client
    systemText string // Long, expensive system prompt to be cached
}
 
func NewCachedConversation(client *anthropic.Client, systemPrompt string) *CachedConversation {
    return &CachedConversation{
        client:     client,
        systemText: systemPrompt,
    }
}
 
func (c *CachedConversation) Ask(ctx context.Context, userMsg string) (string, error) {
    resp, err := c.client.Messages.New(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(2048)),
        System: anthropic.F([]anthropic.TextBlockParam{
            {
                Type: anthropic.F(anthropic.TextBlockParamTypeText),
                Text: anthropic.F(c.systemText),
                CacheControl: anthropic.F(anthropic.CacheControlEphemeralParam{
                    Type: anthropic.F(anthropic.CacheControlEphemeralTypeEphemeral),
                }),
            },
        }),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock(userMsg)),
        }),
    })
    if err \!= nil {
        return "", fmt.Errorf("cached request failed: %w", err)
    }
 
    var result string
    for _, block := range resp.Content {
        if block.Type == anthropic.ContentBlockTypeText {
            result += block.Text
        }
    }
 
    // Log cache hit/miss ratio for monitoring
    if resp.Usage.CacheReadInputTokens > 0 {
        fmt.Printf("Cache hit: %d tokens served from cache\n", resp.Usage.CacheReadInputTokens)
    }
 
    return result, nil
}

Prompt caching is most valuable when your system prompt exceeds 1,024 tokens (the minimum cacheable size). For a document-analysis service where you prepend a large reference document to every request, this can reduce both cost and latency by up to 90%.

Observability: Logging and Metrics

A production service needs visibility into API usage, error rates, and latency. Here's a minimal but useful instrumentation setup using structured logging and Prometheus metrics.

// observability/metrics.go
package observability
 
import (
    "time"
 
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)
 
var (
    APIRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "claude_api_requests_total",
            Help: "Total number of Claude API requests",
        },
        []string{"model", "status"}, // status: "success" | "error" | "rate_limited"
    )
 
    APIRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "claude_api_request_duration_seconds",
            Help:    "Claude API request duration in seconds",
            Buckets: []float64{0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0},
        },
        []string{"model"},
    )
 
    TokensUsed = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "claude_tokens_used_total",
            Help: "Total tokens consumed",
        },
        []string{"model", "type"}, // type: "input" | "output" | "cache_read"
    )
)
 
// RecordRequest wraps an API call with metric collection
func RecordRequest(model string, fn func() error) error {
    start := time.Now()
    err := fn()
    duration := time.Since(start).Seconds()
 
    status := "success"
    if err \!= nil {
        status = "error"
    }
 
    APIRequestsTotal.WithLabelValues(model, status).Inc()
    APIRequestDuration.WithLabelValues(model).Observe(duration)
 
    return err
}

Expose the /metrics endpoint and scrape it with Prometheus. The token usage metric is particularly useful for cost forecasting and quota planning.

Dockerfile for Production

# Build stage
FROM golang:1.22-alpine AS builder
 
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
 
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o claude-service ./cmd/server
 
# Runtime stage — minimal image
FROM alpine:3.19
 
RUN apk --no-cache add ca-certificates tzdata
WORKDIR /root/
 
COPY --from=builder /app/claude-service .
 
# Never run as root in production
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser
 
EXPOSE 8080
 
CMD ["./claude-service"]

The multi-stage build keeps the final image small (typically under 20MB). Running as a non-root user is a Kubernetes security requirement in many organizations.

Putting It All Together

The patterns in this guide work well independently, but they're designed to compose. A production Claude service in Go typically looks like this:

config.Load() reads all settings from environment variables at startup
NewClaudeLimiter() enforces rate limits before every API call
ToolEngine registers domain-specific functions once at startup
ConversationSession holds per-request state
Gin or Echo routes expose HTTP/SSE endpoints
Metrics middleware wraps every API call for Prometheus scraping
Graceful shutdown gives in-flight streaming requests 90 seconds to complete

The best starting point is ConversationSession. Add it to an existing service, confirm that conversations work correctly, then layer in streaming, then Tool Use. Trying to add everything at once makes debugging much harder.

Once the integration is stable, the OpenTelemetry observability guide covers distributed tracing across multiple services — useful when your Claude service is one component in a larger system.

Go's concurrency model is, somewhat surprisingly, a natural fit for LLM APIs. The token-by-token stream maps cleanly to a channel, context cancellation propagates cleanly across goroutines, and errgroup handles parallel tool execution without much ceremony. The rough edges are real, but once you've hit each pitfall once, the patterns become second nature.

Exponential Backoff and Retry Logic

Rate limit errors (HTTP 429) and transient server errors (HTTP 529) happen in production. Rather than letting them surface as user-facing failures, implement retry logic with exponential backoff.

// retry/retry.go
package retry
 
import (
    "context"
    "errors"
    "fmt"
    "net/http"
    "time"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
)
 
// isRetryable returns true for errors that warrant a retry
func isRetryable(err error) bool {
    var apiErr *anthropic.Error
    if errors.As(err, &apiErr) {
        switch apiErr.StatusCode {
        case http.StatusTooManyRequests,    // 429: rate limited
            http.StatusServiceUnavailable,  // 503: temporary outage
            529:                            // Claude-specific overload code
            return true
        }
    }
    return false
}
 
// WithRetry wraps an API call with retries and exponential backoff
func WithRetry(ctx context.Context, maxAttempts int, fn func() (*anthropic.Message, error)) (*anthropic.Message, error) {
    var lastErr error
    for attempt := 0; attempt < maxAttempts; attempt++ {
        msg, err := fn()
        if err == nil {
            return msg, nil
        }
 
        lastErr = err
        if \!isRetryable(err) {
            return nil, err // Non-retryable: fail immediately
        }
 
        if attempt == maxAttempts-1 {
            break // Last attempt failed
        }
 
        // Exponential backoff: 1s, 2s, 4s, ...
        backoff := time.Duration(1<<uint(attempt)) * time.Second
        select {
        case <-time.After(backoff):
        case <-ctx.Done():
            return nil, fmt.Errorf("retry cancelled: %w", ctx.Err())
        }
    }
 
    return nil, fmt.Errorf("all %d attempts failed, last error: %w", maxAttempts, lastErr)
}

Usage in your service layer:

msg, err := retry.WithRetry(ctx, 3, func() (*anthropic.Message, error) {
    return client.Messages.New(ctx, params)
})

Three retry attempts with exponential backoff handles the vast majority of transient failures without adding significant latency for successful calls.

A Note from an Indie Developer

Structured Output Parsing

When you need Claude to return structured data rather than free-form text, combining a JSON schema instruction in the system prompt with Go's encoding/json gives reliable results.

// structured/parser.go
package structured
 
import (
    "context"
    "encoding/json"
    "fmt"
 
    anthropic "github.com/anthropics/anthropic-sdk-go"
)
 
// ExtractJSON sends a prompt instructing Claude to return valid JSON,
// then unmarshals the response into the target struct
func ExtractJSON[T any](ctx context.Context, client *anthropic.Client, userPrompt string) (*T, error) {
    systemPrompt := `You are a data extraction assistant. Always respond with valid JSON only.
Do not include markdown code fences, explanations, or any text outside the JSON object.`
 
    resp, err := client.Messages.New(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(anthropic.ModelClaude_Sonnet_4_6),
        MaxTokens: anthropic.F(int64(1024)),
        System: anthropic.F([]anthropic.TextBlockParam{
            anthropic.NewTextBlock(systemPrompt),
        }),
        Messages: anthropic.F([]anthropic.MessageParam{
            anthropic.NewUserMessage(anthropic.NewTextBlock(userPrompt)),
        }),
    })
    if err \!= nil {
        return nil, fmt.Errorf("API call failed: %w", err)
    }
 
    var rawJSON string
    for _, block := range resp.Content {
        if block.Type == anthropic.ContentBlockTypeText {
            rawJSON = block.Text
            break
        }
    }
 
    var result T
    if err := json.Unmarshal([]byte(rawJSON), &result); err \!= nil {
        return nil, fmt.Errorf("JSON unmarshal failed (response: %q): %w", rawJSON, err)
    }
 
    return &result, nil
}
 
// Example usage:
type ProductReview struct {
    Sentiment string   `json:"sentiment"` // "positive" | "neutral" | "negative"
    Score     int      `json:"score"`     // 1-10
    Keywords  []string `json:"keywords"`
    Summary   string   `json:"summary"`
}
 
func AnalyzeReview(ctx context.Context, client *anthropic.Client, reviewText string) (*ProductReview, error) {
    prompt := fmt.Sprintf(`Analyze the following product review and extract structured data:
 
%s
 
Return a JSON object with these fields:
- sentiment: "positive", "neutral", or "negative"
- score: integer 1-10 (10 = most positive)
- keywords: array of key terms from the review
- summary: one sentence summary`, reviewText)
 
    return ExtractJSON[ProductReview](ctx, client, prompt)
}

The generic ExtractJSON[T] function works for any struct that can be represented as JSON. In practice, I add a validation step after unmarshaling to check required fields — Claude occasionally returns partial JSON under heavy load or when the schema is ambiguous.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.