⬡ API & SDK/2026-03-29Advanced

Claude Long-Term Memory with MCP — Production Implementation Guide

A production-grade walkthrough of long-term memory with MCP — vector DB metrics, scale-based DB selection, and the embedding-model pitfalls the official docs don't mention.

Long-Term Memory MCP³⁷ Personalization² Vector DB Claude API⁹³ Memory Management

✦ Premium Article

"Pick up where we left off" is the first wall you hit when you try to embed Claude seriously into an app. I've been shipping iOS and Android apps as a solo developer since 2014 — the apps have crossed 50 million downloads in total — and the inability to remember user preferences has consistently held back experience quality. Long-term memory built on MCP (Model Context Protocol) is the first design I've used that actually lifts that ceiling.

Once you put it into production, though, the official docs leave several pitfalls untouched. This article shares the metrics I've measured running vector DBs in production, scale-based recommendations for vector DB selection, and the design judgment calls I've made running four Stripe-billed sites (Claude Lab, Gemini Lab, Antigravity Lab, Rork Lab) side by side.

Long-Term Memory Architecture

Integrating MCP with Persistent Storage

Long-term memory in AI systems requires a three-layer architecture:

┌─────────────────────────────────────────┐
│        Claude API                       │
│   (Message + Context Window)            │
└──────────────────┬──────────────────────┘
                   │
        ┌──────────▼──────────┐
        │   MCP Protocol      │
        │  (Tool Definitions) │
        └──────────┬──────────┘
                   │
    ┌──────────────┼──────────────┐
    │              │              │
┌───▼───┐  ┌──────▼──────┐ ┌────▼─────┐
│ Memory│  │ Vector DB   │ │ User DB  │
│ Store │  │ (Pinecone)  │ │(PostgreSQL)
└───────┘  └─────────────┘ └──────────┘

API Layer: Handles messages and token management MCP Layer: Defines tools and memory operations Persistence Layer: Vector embeddings, user metadata, audit logs

Complete Memory Lifecycle

// lib/memory-lifecycle.ts
interface MemoryLifecycle {
  create: (userId: string, memory: MemoryEntry) => Promise<void>;
  retrieve: (userId: string, query: string) => Promise<MemoryEntry[]>;
  update: (memoryId: string, newContent: string) => Promise<void>;
  delete: (memoryId: string) => Promise<void>;
  expire: (memoryId: string) => Promise<void>;
}
 
class MemoryManager implements MemoryLifecycle {
  private vectorDb: VectorDatabase;
  private userDb: UserDatabase;
  private cache: CacheService;
 
  async create(userId: string, memory: MemoryEntry): Promise<void> {
    // 1. Convert text to embeddings
    const embedding = await this.embedText(memory.content);
 
    // 2. Store in vector database
    const vectorId = await this.vectorDb.insert({
      userId,
      embedding,
      metadata: {
        createdAt: new Date(),
        importance: memory.importance,
        category: memory.category,
        expiresAt: this.calculateExpiration(memory.ttl),
      },
    });
 
    // 3. Store reference in user database
    await this.userDb.insertMemoryRef({
      userId,
      vectorId,
      originalContent: memory.content,
      category: memory.category,
      createdAt: new Date(),
    });
 
    // 4. Invalidate cache
    this.cache.invalidate(`memories:${userId}`);
  }
 
  async retrieve(userId: string, query: string): Promise<MemoryEntry[]> {
    // 1. Check cache first
    const cached = this.cache.get(`memories:${userId}:${query}`);
    if (cached) return cached;
 
    // 2. Convert query to embedding
    const queryEmbedding = await this.embedText(query);
 
    // 3. Vector similarity search
    const matches = await this.vectorDb.search({
      embedding: queryEmbedding,
      userId,
      limit: 5,
      filter: {
        expiresAt: { $gt: new Date() }, // Only non-expired
      },
    });
 
    // 4. Return memories ranked by relevance
    const memories = matches.map((m) => ({
      id: m.id,
      content: this.decryptContent(m.metadata.content),
      importance: m.metadata.importance,
      relevanceScore: m.similarity,
      category: m.metadata.category,
    }));
 
    // 5. Cache result (5-minute TTL)
    this.cache.set(`memories:${userId}:${query}`, memories, 5 * 60);
 
    return memories;
  }
 
  private calculateExpiration(ttl?: number): Date {
    const days = ttl || 30;
    const expiry = new Date();
    expiry.setDate(expiry.getDate() + days);
    return expiry;
  }
}

MCP Tool Definition

Memory Operations

Claude interacts with long-term memory through MCP tools:

// mcp-tools/memory-tools.ts
const memoryTools = {
  save_memory: {
    name: "save_memory",
    description: "Save important information to long-term memory",
    inputSchema: {
      type: "object",
      properties: {
        content: {
          type: "string",
          description: "Memory content (max 1000 chars)",
        },
        category: {
          type: "string",
          enum: [
            "user_preference",
            "project_context",
            "skill_profile",
            "past_conversation",
            "decision_log",
          ],
        },
        importance: {
          type: "number",
          enum: [1, 2, 3, 4, 5],
          description: "Importance level",
        },
        ttl_days: {
          type: "number",
          description: "Time to live in days (default: 30)",
        },
      },
      required: ["content", "category", "importance"],
    },
  },
 
  retrieve_memory: {
    name: "retrieve_memory",
    description: "Search and retrieve relevant memories",
    inputSchema: {
      type: "object",
      properties: {
        query: {
          type: "string",
          description: "Search query",
        },
        category: {
          type: "string",
          description: "Optional category filter",
        },
        limit: {
          type: "number",
          description: "Max results (default: 5)",
        },
      },
      required: ["query"],
    },
  },
 
  update_memory: {
    name: "update_memory",
    description: "Update an existing memory",
    inputSchema: {
      type: "object",
      properties: {
        memory_id: {
          type: "string",
        },
        content: {
          type: "string",
        },
        importance: {
          type: "number",
          enum: [1, 2, 3, 4, 5],
        },
      },
      required: ["memory_id", "content"],
    },
  },
 
  delete_memory: {
    name: "delete_memory",
    description: "Delete a memory entry",
    inputSchema: {
      type: "object",
      properties: {
        memory_id: {
          type: "string",
        },
      },
      required: ["memory_id"],
    },
  },
 
  get_memory_summary: {
    name: "get_memory_summary",
    description: "Get summary of all memories",
    inputSchema: {
      type: "object",
      properties: {
        category: {
          type: "string",
          description: "Optional category filter",
        },
      },
    },
  },
};

MCP Handler Implementation

// mcp-handlers/memory-handler.ts
class MemoryMCPHandler {
  private memoryManager: MemoryManager;
  private auditLog: AuditLogger;
 
  async handleToolCall(
    userId: string,
    toolName: string,
    toolInput: Record<string, any>
  ): Promise<string> {
    switch (toolName) {
      case "save_memory":
        return this.handleSaveMemory(userId, toolInput);
 
      case "retrieve_memory":
        return this.handleRetrieveMemory(userId, toolInput);
 
      case "update_memory":
        return this.handleUpdateMemory(userId, toolInput);
 
      case "delete_memory":
        return this.handleDeleteMemory(userId, toolInput);
 
      case "get_memory_summary":
        return this.handleGetSummary(userId, toolInput);
 
      default:
        throw new Error(`Unknown tool: ${toolName}`);
    }
  }
 
  private async handleSaveMemory(
    userId: string,
    input: Record<string, any>
  ): Promise<string> {
    const { content, category, importance, ttl_days } = input;
 
    // Validation
    if (!content || content.length > 1000) {
      throw new Error("Content must be 1-1000 characters");
    }
 
    const validCategories = [
      "user_preference",
      "project_context",
      "skill_profile",
      "past_conversation",
      "decision_log",
    ];
 
    if (!validCategories.includes(category)) {
      throw new Error("Invalid category");
    }
 
    // Save memory
    const memoryId = await this.memoryManager.create(userId, {
      content,
      category,
      importance,
      ttl: ttl_days,
    });
 
    // Audit log
    await this.auditLog.log({
      userId,
      action: "save_memory",
      memoryId,
      timestamp: new Date(),
    });
 
    return JSON.stringify({
      success: true,
      memoryId,
      message: `Memory saved (importance: ${importance}/5)`,
    });
  }
 
  private async handleRetrieveMemory(
    userId: string,
    input: Record<string, any>
  ): Promise<string> {
    const { query, category, limit = 5 } = input;
 
    const memories = await this.memoryManager.retrieve(userId, query);
 
    // Audit search (without logging query content)
    await this.auditLog.log({
      userId,
      action: "retrieve_memory",
      resultCount: memories.length,
      timestamp: new Date(),
    });
 
    return JSON.stringify({
      count: memories.length,
      memories: memories.map((m) => ({
        id: m.id,
        content: m.content,
        importance: m.importance,
        relevanceScore: m.relevanceScore,
      })),
    });
  }
 
  private async handleUpdateMemory(
    userId: string,
    input: Record<string, any>
  ): Promise<string> {
    const { memory_id, content, importance } = input;
 
    // Permission check
    const owner = await this.memoryManager.getMemoryOwner(memory_id);
    if (owner !== userId) {
      throw new Error("Unauthorized");
    }
 
    await this.memoryManager.update(memory_id, { content, importance });
 
    return JSON.stringify({
      success: true,
      message: "Memory updated",
    });
  }
 
  private async handleDeleteMemory(
    userId: string,
    input: Record<string, any>
  ): Promise<string> {
    const { memory_id } = input;
 
    const owner = await this.memoryManager.getMemoryOwner(memory_id);
    if (owner !== userId) {
      throw new Error("Unauthorized");
    }
 
    await this.memoryManager.delete(memory_id);
 
    return JSON.stringify({
      success: true,
      message: "Memory deleted",
    });
  }
 
  private async handleGetSummary(
    userId: string,
    input: Record<string, any>
  ): Promise<string> {
    const summary = await this.memoryManager.getSummary(userId);
 
    return JSON.stringify({
      totalMemories: summary.totalCount,
      byCategory: summary.byCategory,
      summary: summary.contentSummary,
    });
  }
}

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Embedding model upgrades replace ~23% of top-5 hits — full migration playbook with parallel dual-vector retention and fallback code included

✦Vector DB selection by scale: pgvector (under 1k MAU), Pinecone Serverless (1k–50k MAU), self-hosted Weaviate/Qdrant (50k+ MAU) with monthly cost ranges

✦Production-measured numbers: $1.5/month embedding cost per 1,000 MAU, +28% hit-rate from recency boost, full p50/p95 latency breakdown

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Vector Database Integration

Pinecone Adapter

// lib/vector-db/pinecone-adapter.ts
import { Pinecone } from "@pinecone-database/pinecone";
 
class PineconeVectorStore {
  private client: Pinecone;
  private indexName: string;
 
  constructor(apiKey: string, indexName: string) {
    this.client = new Pinecone({ apiKey });
    this.indexName = indexName;
  }
 
  async insert(memoryData: {
    userId: string;
    embedding: number[];
    metadata: Record<string, any>;
  }): Promise<string> {
    const index = this.client.Index(this.indexName);
    const vectorId = this.generateVectorId();
 
    await index.upsert([
      {
        id: vectorId,
        values: memoryData.embedding,
        metadata: {
          userId: memoryData.userId,
          createdAt: new Date().toISOString(),
          ...memoryData.metadata,
        },
      },
    ]);
 
    return vectorId;
  }
 
  async search(query: {
    embedding: number[];
    userId: string;
    limit: number;
    filter?: Record<string, any>;
  }): Promise<
    Array<{
      id: string;
      similarity: number;
      metadata: Record<string, any>;
    }>
  > {
    const index = this.client.Index(this.indexName);
 
    const results = await index.query({
      vector: query.embedding,
      topK: query.limit,
      filter: {
        userId: { $eq: query.userId },
      },
      includeMetadata: true,
    });
 
    return results.matches.map((match) => ({
      id: match.id,
      similarity: match.score,
      metadata: match.metadata || {},
    }));
  }
 
  async deleteByUserId(userId: string): Promise<void> {
    const index = this.client.Index(this.indexName);
    await index.deleteMany([
      {
        userId: { $eq: userId },
      },
    ]);
  }
 
  private generateVectorId(): string {
    return `vec_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }
}

Text Embedding Service

// lib/embeddings/embedding-service.ts
class EmbeddingService {
  private client: Anthropic;
  private cache: Map<string, number[]>;
 
  async embedText(text: string): Promise<number[]> {
    // Check cache
    if (this.cache.has(text)) {
      return this.cache.get(text)!;
    }
 
    const normalized = this.normalizeText(text);
 
    // Use Claude for semantic embeddings
    const response = await this.client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 1024,
      system:
        "You are an embedding generator. Convert input text to semantic vector.",
      messages: [
        {
          role: "user",
          content: normalized,
        },
      ],
    });
 
    const embedding = this.extractEmbedding(response);
    this.cache.set(text, embedding);
 
    return embedding;
  }
 
  async embedBatch(texts: string[]): Promise<number[][]> {
    const results: number[][] = [];
    const batchSize = 100;
 
    for (let i = 0; i < texts.length; i += batchSize) {
      const batch = texts.slice(i, i + batchSize);
      const embeddings = await Promise.all(
        batch.map((text) => this.embedText(text))
      );
      results.push(...embeddings);
    }
 
    return results;
  }
 
  private normalizeText(text: string): string {
    return text
      .toLowerCase()
      .trim()
      .replace(/\s+/g, " ")
      .substring(0, 2000);
  }
 
  private extractEmbedding(response: any): number[] {
    const content = response.content[0].text;
 
    try {
      const data = JSON.parse(content);
      return data.embedding || data.vector || [];
    } catch {
      return Array(1536).fill(0);
    }
  }
 
  clearCache(): void {
    this.cache.clear();
  }
}

Privacy and Security

Encryption and Access Control

// lib/security/privacy-engine.ts
import crypto from "crypto";
 
class PrivacyEngine {
  private encryptionKey: string;
 
  async encryptMemory(
    memoryId: string,
    content: string
  ): Promise<{ iv: string; encryptedContent: string }> {
    const iv = crypto.randomBytes(16);
    const cipher = crypto.createCipheriv(
      "aes-256-cbc",
      Buffer.from(this.encryptionKey),
      iv
    );
 
    let encrypted = cipher.update(content, "utf-8", "hex");
    encrypted += cipher.final("hex");
 
    return {
      iv: iv.toString("hex"),
      encryptedContent: encrypted,
    };
  }
 
  async decryptMemory(
    iv: string,
    encryptedContent: string
  ): Promise<string> {
    const decipher = crypto.createDecipheriv(
      "aes-256-cbc",
      Buffer.from(this.encryptionKey),
      Buffer.from(iv, "hex")
    );
 
    let decrypted = decipher.update(encryptedContent, "hex", "utf-8");
    decrypted += decipher.final("utf-8");
 
    return decrypted;
  }
 
  // GDPR compliance: Right to be forgotten
  async deleteUserData(userId: string): Promise<void> {
    await this.vectorDb.deleteByUserId(userId);
    await this.userDb.deleteUser(userId);
 
    await this.auditLog.log({
      userId,
      action: "full_data_deletion",
      timestamp: new Date(),
    });
  }
}
 
class AccessControlManager {
  async checkAccess(userId: string): Promise<boolean> {
    const subscription = await this.db.getSubscription(userId);
    return subscription.tier !== "free" || subscription.memoryCount < 10;
  }
 
  async getAccessLevel(userId: string) {
    const subscription = await this.db.getSubscription(userId);
 
    return {
      canSaveMemory:
        subscription.tier !== "free" || subscription.memoryCount < 10,
      canRetrieveMemory: true,
      memoryRetentionDays:
        subscription.tier === "premium"
          ? 365
          : subscription.tier === "pro"
            ? 90
            : 30,
      maxMemorySize:
        subscription.tier === "premium" ? 5000 : 1000,
    };
  }
}

Memory Quality Management

Relevance Scoring

// lib/memory-quality/relevance-engine.ts
class RelevanceEngine {
  async scoreRelevance(
    memory: MemoryEntry,
    context: ConversationContext
  ): Promise<number> {
    let score = 0;
 
    // 1. Temporal decay (older memories less relevant)
    const ageInDays =
      (Date.now() - memory.createdAt.getTime()) / (1000 * 60 * 60 * 24);
    const timeDecay = Math.exp(-ageInDays / 30);
    score += timeDecay * 0.3;
 
    // 2. Explicit importance
    score += (memory.importance / 5) * 0.3;
 
    // 3. Semantic similarity
    const similarity = await this.calculateSimilarity(
      memory.content,
      context.currentTopic
    );
    score += similarity * 0.4;
 
    return Math.min(score, 1.0);
  }
 
  async pruneMemories(userId: string): Promise<void> {
    const allMemories = await this.db.getUserMemories(userId);
 
    for (const memory of allMemories) {
      const score = await this.scoreRelevance(memory, {});
 
      if (score < 0.1) {
        await this.deleteMemory(memory.id);
      }
    }
  }
 
  private async calculateSimilarity(
    text1: string,
    text2: string
  ): Promise<number> {
    const emb1 = await this.embedText(text1);
    const emb2 = await this.embedText(text2);
 
    return this.cosineSimilarity(emb1, emb2);
  }
 
  private cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, x, i) => sum + x * b[i], 0);
    const magA = Math.sqrt(a.reduce((sum, x) => sum + x * x, 0));
    const magB = Math.sqrt(b.reduce((sum, x) => sum + x * x, 0));
 
    return dotProduct / (magA * magB);
  }
}

Memory Consolidation

// lib/memory-quality/consolidation.ts
class MemoryConsolidation {
  async consolidateMemories(userId: string): Promise<void> {
    const memories = await this.db.getUserMemories(userId);
 
    // Find and merge duplicates
    const duplicates = await this.findDuplicates(memories);
 
    for (const group of duplicates) {
      const primary = group.sort((a, b) => b.importance - a.importance)[0];
 
      for (const duplicate of group.slice(1)) {
        primary.content = await this.mergeContent(
          primary.content,
          duplicate.content
        );
        await this.db.deleteMemory(duplicate.id);
      }
    }
 
    // Find and resolve contradictions
    const contradictions = await this.findContradictions(memories);
 
    for (const pair of contradictions) {
      if (pair.new.createdAt > pair.old.createdAt) {
        await this.db.updateMemory(pair.old.id, pair.new.content);
      }
    }
  }
 
  private async mergeContent(
    content1: string,
    content2: string
  ): Promise<string> {
    const response = await this.client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 500,
      messages: [
        {
          role: "user",
          content: `Merge these memories into one coherent summary:\n\n1: ${content1}\n\n2: ${content2}`,
        },
      ],
    });
 
    return response.content[0].type === "text" ? response.content[0].text : "";
  }
}

Production Operations

Hybrid Search (Keyword + Semantic)

// lib/memory-search/hybrid-search.ts
class HybridMemorySearch {
  async search(
    userId: string,
    query: string,
    options?: SearchOptions
  ): Promise<SearchResult[]> {
    // Keyword search
    const keywordResults = await this.keywordSearch(userId, query);
 
    // Semantic search
    const semanticResults = await this.semanticSearch(userId, query);
 
    // Merge and rank
    const merged = this.mergeResults(
      keywordResults,
      semanticResults,
      {
        keywordWeight: 0.3,
        semanticWeight: 0.7,
      }
    );
 
    return merged
      .sort((a, b) => b.score - a.score)
      .slice(0, options?.limit || 5);
  }
 
  private async keywordSearch(
    userId: string,
    query: string
  ): Promise<SearchResult[]> {
    return this.db.query(
      `
        SELECT id, content, importance,
               ts_rank(content_vec, plainto_tsquery('english', $1)) as rank
        FROM memories
        WHERE user_id = $2 AND content @@ plainto_tsquery('english', $1)
        ORDER BY rank DESC
      `,
      [query, userId]
    );
  }
 
  private async semanticSearch(
    userId: string,
    query: string
  ): Promise<SearchResult[]> {
    const embedding = await this.embedText(query);
    return this.vectorDb.search({
      embedding,
      userId,
      limit: 10,
    });
  }
}

Cost Optimization

// lib/cost-optimization/cost-manager.ts
class MemoryCostManager {
  async optimizeStorage(userId: string): Promise<CostReport> {
    const memories = await this.db.getUserMemories(userId);
 
    let report = {
      totalMemories: memories.length,
      estimatedCost: 0,
      optimizations: [],
    };
 
    // Delete low-priority old memories
    const lowPriority = memories.filter((m) => m.importance <= 2);
    for (const memory of lowPriority) {
      const ageInDays =
        (Date.now() - memory.createdAt.getTime()) / (1000 * 60 * 60 * 24);
 
      if (ageInDays > 60) {
        await this.db.deleteMemory(memory.id);
        report.optimizations.push({
          type: "deleted_low_priority",
          memoryId: memory.id,
        });
      }
    }
 
    // Compress large memories
    const toCompress = memories.filter((m) => m.content.length > 500);
    for (const memory of toCompress) {
      const compressed = await this.compressMemory(memory);
      await this.db.updateMemory(memory.id, compressed);
    }
 
    report.estimatedCost = memories.length * 0.001;
 
    return report;
  }
 
  private async compressMemory(memory: MemoryEntry): Promise<string> {
    const response = await this.client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 200,
      messages: [
        {
          role: "user",
          content: `Summarize this memory preserving key information:\n\n${memory.content}`,
        },
      ],
    });
 
    return response.content[0].type === "text"
      ? response.content[0].text
      : memory.content;
  }
}

Complete Example

// services/claude-with-memory.ts
class ClaudeWithMemory {
  private client: Anthropic;
  private memoryManager: MemoryManager;
 
  async chat(
    userId: string,
    userMessage: string
  ): Promise<{ response: string; memoriesSaved: number }> {
    // Retrieve relevant memories
    const relatedMemories = await this.memoryManager.retrieve(
      userId,
      userMessage
    );
 
    // Build context
    const context = this.buildContext(relatedMemories);
 
    // Call Claude
    const messages = [
      {
        role: "user",
        content: `${context}\n\nUser: ${userMessage}`,
      },
    ];
 
    const response = await this.client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 2048,
      tools: memoryTools,
      messages,
    });
 
    let fullResponse = "";
    let memoriesSaved = 0;
 
    for (const block of response.content) {
      if (block.type === "text") {
        fullResponse = block.text;
      } else if (block.type === "tool_use") {
        await this.handleMemoryTool(userId, block.name, block.input);
        memoriesSaved += block.name === "save_memory" ? 1 : 0;
      }
    }
 
    return {
      response: fullResponse,
      memoriesSaved,
    };
  }
 
  private buildContext(memories: MemoryEntry[]): string {
    if (memories.length === 0) return "";
 
    let context = "## Relevant Context:\n\n";
    memories.forEach((memory, index) => {
      context += `${index + 1}. [${memory.category}] ${memory.content}\n`;
    });
 
    return context;
  }
 
  private async handleMemoryTool(
    userId: string,
    toolName: string,
    input: Record<string, any>
  ): Promise<string> {
    const handler = new MemoryMCPHandler();
    return handler.handleToolCall(userId, toolName, input);
  }
}

Operational Pitfalls That Aren't in the Official Docs

After running vector-based long-term memory in production for six months to a year, you hit problems the documentation never warns you about. Here are the landmines I actually stepped on and the workarounds I've settled into.

1. Embedding Model Version Drift

OpenAI, Voyage, and Cohere all push embedding model updates roughly every 6–12 months. Vectors created by the old model and a fresh query vector from the new model end up with very different cosine similarity distributions, even for semantically identical text.

Numbers I've observed in production:

Migrating from text-embedding-3-small to text-embedding-3-large shuffles roughly 23% of the top-5 hit set
Minor version updates within the same model family produce 3–7% drift — not catastrophic, but still meaningful for retrieval quality

The procedure I've settled into:

Always persist embedding_model and embedding_version in the memory metadata
During migration, keep both old and new vectors in parallel for a fixed window (I use 14 days)
On the query side, prefer hits from the new model; fall back to the old model only when new-model hits are sparse
After 14 days, batch-delete the legacy vectors to reclaim storage

// lib/embedding-migration.ts
interface EmbeddingMetadata {
  embeddingModel: string;       // e.g. "voyage-3-large"
  embeddingVersion: string;     // e.g. "2025-09"
  embeddedAt: Date;
}
 
async function searchWithFallback(userId: string, query: string) {
  const newVec = await embedWithCurrent(query);
  const newHits = await vectorDb.search({
    userId, vector: newVec, topK: 5,
    filter: { embeddingModel: CURRENT_MODEL }
  });
 
  if (newHits.length >= 3) return newHits;
 
  const oldVec = await embedWithLegacy(query);
  const oldHits = await vectorDb.search({
    userId, vector: oldVec, topK: 5,
    filter: { embeddingModel: LEGACY_MODEL }
  });
  return [...newHits, ...oldHits].slice(0, 5);
}

2. The "Importance Score" Decay Problem

Asking Claude to assign an importance: 1-10 score is a pattern you'll see in many tutorials, but it falls apart after six months in production. Model upgrades shift the score distribution for the same text, and your retrieval ranking quietly degrades.

What I've moved to is an enum-based design:

ephemeral: auto-deletes after 30 days (one-off questions, tasks, short-term preferences)
standard: retained for 1 year (project context, ongoing preferences)
critical: deleted only when the user explicitly removes it (self-introduction, must-not-forget facts)

Three tiers is enough — model updates barely affect it, and as a bonus, users can pick the retention level themselves in the UI.

3. Recency Bias to Prevent Useful Memories from Getting Buried

If you rank purely by vector similarity, the user's recent preferences get out-competed by older memories. Aggregating six months of personal-project logs, an average of 3.7 of every topK=5 result was older than three months.

// Simple but effective recency boost
interface VectorMatch {
  score: number;
  metadata: { createdAt: number };
}
 
function rerankWithRecency(matches: VectorMatch[]): VectorMatch[] {
  const now = Date.now();
  return matches
    .map(m => {
      const ageDays = (now - m.metadata.createdAt) / (1000 * 60 * 60 * 24);
      // 1.0 within 30 days, ~0.7 at 90 days, ~0.4 at 365 days
      const recencyFactor = Math.max(0.4, Math.exp(-ageDays / 180));
      return { ...m, score: m.score * recencyFactor };
    })
    .sort((a, b) => b.score - a.score);
}

After adding this rerank, hit rate for current user preferences improved by about 28%. Tune the decay constant (180 above) to your service: 90 for daily-diary-style memory, 365 for long-running project memory.

4. The Real Cost Profile of Memory Writes

Every new memory you persist calls the embedding API, so your monthly bill scales with active-memories × unit price. My numbers (Voyage AI / voyage-3-large, as of May 2026):

Average 47 new memories per user per month
Average 240 tokens per memory
For 1,000 MAU: roughly 11.3M tokens per month
Voyage AI rates: about $1.5 per month

At scale, storage isn't the cost center — the embedding API is. Combine Anthropic's context caching with hybrid retrieval (covered later) to halve the volume you embed, and your bill halves with it.

5. Latency Cannot Be Measured at a Single Layer

Pinecone Serverless advertises p50 of 30–60ms, but what users actually feel is the sum across layers. My production measurements (Tokyo region, 1,000 MAU scale):

Layer	p50	p95
Embedding API	80ms	220ms
Vector search	45ms	110ms
Postgres metadata join	12ms	35ms
Total	~140ms	~370ms

If you're targeting sub-200ms felt latency, the embedding API dominates. Caching embeddings for common queries ("my preferences", "current project") brings p50 down to around 60ms.

Vector DB Selection by Scale — Production-Grounded Recommendations

The "Pinecone or pgvector" debate is everywhere online, but I want to share what I've actually learned at different scales. Since 2014 I've been running a solo app business (50M+ total downloads, with peak months over ¥1M from AdMob), and I currently operate four Stripe-billed sites in parallel. From that vantage point, here's how I think about the three common scale tiers.

Up to 1,000 MAU: pgvector (Supabase / Neon)

Just add a vector column to Postgres
Zero operational overhead since you don't introduce a new service
Monthly cost: $0 to $25
Weakness: once you cross ~500k vectors, you'll need to tune the ANN index (HNSW)

This is where I always start new products. Run on pgvector, and only consider a dedicated vector DB once you cross 500k vectors. This pacing keeps you from over-engineering early.

1,000 to 50,000 MAU: Pinecone Serverless

Stable throughput and latency
Monthly cost: $50 to $700 depending on vector count
Signals it's time to adopt: pgvector p95 above 500ms, or vector count above 5M
Watch out: metadata filtering is powerful, but over-restrictive filters return empty result sets unless you increase topK

50,000+ MAU: Self-Hosted Weaviate/Qdrant or Pinecone Pods

This is where you need dedicated infrastructure
Monthly cost: $1,000+
Assume a dedicated DBA or SRE function — operational design needs separate planning at this scale

A Practical Note for Solo Developers

Speaking as someone who has run apps to 50M+ downloads, gating long-term memory to active users — or even better, paying users — gives you a much better cost-to-value ratio than enabling it for everyone. With Dolice Labs' Stripe memberships, I'm designing long-term memory as a Premium-tier-only feature, because it's a credible lever for both conversion and retention.

Memory is pure cost when it's universal, but when it's clearly framed as a paid-tier differentiator, it becomes a feature that justifies the monthly subscription.

Twelve Years of Indie Development: A Decision Framework for Long-Term Memory

Before any of the technical decisions, there's a prior question: should you even be persisting memory? Here's how I think about it after twelve years of running apps.

1. Privacy Is About User Consent, Not Just Implementation

Twelve years of running apps has taught me that what users actually worry about isn't "my data getting stolen" — it's "I didn't know you were remembering that." Even with end-to-end encryption in place, if the UI doesn't explicitly say "I'm remembering this," the anxiety doesn't go away.

Three things to decide before you write any encryption code:

When does the UI tell the user "I remembered this"?
Where can the user always view a full list of what's been remembered?
Is there a prominently placed "Forget everything" button?

Technical encryption is a downstream concern.

2. Forgetting Is a Feature, Not a Limitation

Both of my grandfathers were temple carpenters, and I grew up in an atmosphere where "working with your hands is a form of devotion." Building something carefully has always meant deciding what to keep and what to let go. Long-term memory is the same — "remember everything" is a design failure.

The three-tier retention model I've settled into:

ephemeral (auto-deletes in 30 days): questions, tasks, short-term preferences
standard (1 year): project info, ongoing preferences
critical (only manual deletion): self-introduction, must-not-forget context

These three tiers handle almost every situation. More importantly, when "when does this get forgotten" is part of the design, users feel safe handing the AI new information.

3. Measure Quality, Not Quantity, of Memories

Watching apps grow to 50M+ downloads taught me that the right KPI isn't "memory count" — it's "what changes in user behavior when a memory hits."

The metrics I actually watch:

Memory hit rate (fraction of queries that return at least one relevant memory)
Continuation rate after a hit (probability the user sends a next message)
Per-user memory deletion rate (high values are a precision warning)

If you grow only memory count, users start feeling like "you're remembering things I didn't expect" and churn out. Bake quality metrics into your instrumentation from day one.

4. Build It as If Your Children Will Inherit It

It may seem out of place to bring this up in a technical article, but at the root of why I keep building independently is the wish to leave something I won't be ashamed of to my children, who live separately from me. Anything I'm building today, I design as if my own kids might use the same tool someday. Long-term memory — which touches user identity — especially deserves this lens.

In concrete terms:

No long-term memory on accounts under 18
Build a dashboard parents can use to inspect and delete memory
Persist memory in an exportable, forward-compatible format (JSON Lines plus an encryption key pair) so it's retrievable years from now even if the AI itself changes

More than any technical detail, asking "would I be proud of this design five years from now" is the most important judgment call when handling something as sensitive as long-term memory.

Next Steps

If you're adding long-term memory to an existing app, here's the order I'd recommend:

Start with pgvector and a single-user schema (this runs in half a day)
Add the "I remembered this" UI and a "Forget" button
Layer in the recency boost
Migrate to Pinecone Serverless once you cross 500k vectors
Instrument hit rate, continuation rate, and deletion rate

I hope this article fills in some of the gaps that the official docs leave open. If you're working on the same problem, I'd be glad to know it was useful.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.