⬡ API & SDK/2026-06-14Advanced

Wiring Claude's Dreaming Into Memory Hygiene for Long-Running Agents

Managed Agents' Dreaming reviews past sessions and rewrites memory as a self-improvement loop. We unpack the published Harvey and Wisedocs numbers, build a complete self-hosted consolidation loop with the Anthropic SDK, and cover the production pitfalls.

managed-agents⁵ memory⁷ agent-sdk³ production⁸⁹

✦ Premium Article

When you let autonomous agents run four technical blogs as an indie developer, the memory files those agents write balloon to dozens of entries within a few months. At first it was a gift: the agent remembered "we tried this fix before and it failed." But as the count grew, old facts and new facts started colliding quietly. A note from six months ago said "this flag was disabled," while a newer note said "the same flag was re-enabled." The agent had no way to know which to trust, and its judgment dulled.

The Dreaming feature (research preview) for Managed Agents, announced at the June 2026 Code w/ Claude developer conference, tackles exactly this problem. It periodically reviews past sessions and memory, extracts patterns, and updates the agent itself. It folds the "memory inventory" I had been doing by hand into the agent's own operating loop. This article unpacks the thinking behind Dreaming, then shows a self-hosted consolidation loop you can build today with the Anthropic SDK, complete with working code.

Why agents with more memory often make worse decisions

Left alone, a long-running agent's memory grows monotonically. Append "what I learned" every session and you reach dozens of entries in half a year, sometimes past a hundred in a year.

The problem isn't the count itself; it's that facts of wildly different freshness sit side by side. In my case, three notes about the same setting existed across different points in time. Only the newest was correct; the other two were already wrong. The agent loaded all three as equally valid "past learnings" and got dragged back toward the old instructions, re-introducing a bug I had already fixed.

A human team periodically reviews its docs and deletes outdated lines. An agent's memory, however, tends to become an append-only log that nobody prunes. Dreaming makes more sense once you see it as a mechanism for handing that pruning to the agent itself.

Dreaming folds session records into knowledge you can use

The core of Dreaming is that it does not hoard raw session history. Instead it extracts recurring patterns and folds them into more general knowledge.

Suppose several sessions in a row read "the API timed out, so I retried with exponential backoff." A naive memory adds one record each time: "On date X, resolved a timeout with a retry." Dreaming looks across them and rewrites them into one level-up policy: "This workload sees intermittent timeouts, so exponential backoff should be the default from the start."

This is the same thing I was attempting by hand: delete the individual events, keep only the judgment they taught. The difference is that Dreaming runs that loop autonomously on a schedule.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦How Dreaming folds past sessions into reusable knowledge, and how to read the published Harvey (~6x task completion) and Wisedocs (50% review-time cut) numbers

✦A complete Python consolidation loop built on the Anthropic SDK: duplicate detection, staleness flags, and index regeneration

✦Which workloads Dreaming helps versus hurts, plus three production pitfalls and their fixes

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

How to read the published numbers — Harvey 6x, Wisedocs 50%

Anthropic reports that with Dreaming, Harvey raised task completion roughly sixfold and Wisedocs cut review time by 50%. Those are striking figures, but you should discount them somewhat when carrying them into an indie-developer context.

First, these are workloads where accumulated memory maps directly to outcomes. Legal and medical document review treats consistency with past judgments as the quality itself, so memory hygiene flows straight into the success metric. For tasks that are nearly independent each time (a one-off translation, say), folding memory barely moves completion rates.

Second, "6x" assumes the baseline is an unmaintained memory. In my own experience, an agent with dirty memory degrades visibly, so the improvement from that state looks large. A big upside is, flipped around, a measure of how deeply things rot when you neglect it.

So I read these numbers not as "add Dreaming and go 6x faster," but as a warning: "neglect memory freshness and outcomes alone can drop to a fraction."

Building a self-hosted consolidation loop before Dreaming ships

Dreaming is still a research preview and isn't available everywhere yet. But its essence is a simple loop: periodically read memory, judge duplicates and staleness, fold, and write back. You can reproduce that with the Anthropic SDK alone.

The complete script below loads memory accumulated as JSON Lines, has Claude consolidate it, and writes it back. Each memory carries id, created, text, and status, and the result comes back as structured output.

import json
import os
from datetime import datetime, timezone
from pathlib import Path
from anthropic import Anthropic
 
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MEMORY_FILE = Path("memory.jsonl")
MODEL = "claude-sonnet-4-6"
 
def load_memories() -> list[dict]:
    if not MEMORY_FILE.exists():
        return []
    rows = []
    for line in MEMORY_FILE.read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if line:
            rows.append(json.loads(line))
    # only consolidate entries that are still active
    return [m for m in rows if m.get("status", "active") == "active"]
 
CONSOLIDATION_PROMPT = """\
You manage memory for a long-running agent.
Below is a list of operational notes accumulated over time (oldest first).
Consolidate them by the following rules and return JSON only.
 
1. If new and old notes about the same target conflict, treat the newest as
   authoritative and mark the older ones stale.
2. If the same lesson appears across 3+ separate events, fold them into one
   generalized policy.
3. Mark one-off, non-reproducible records as archive.
4. Do not alter the underlying facts of notes you keep.
 
Return format:
{
  "keep": [{"id": "...", "text": "...", "merged_from": ["id1","id2"]}],
  "stale": ["id3", "id4"],
  "archive": ["id5"]
}
 
Notes:
"""
 
def consolidate(memories: list[dict]) -> dict:
    listing = "\n".join(
        f'- id={m["id"]} created={m["created"]} : {m["text"]}'
        for m in memories
    )
    resp = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        messages=[{"role": "user", "content": CONSOLIDATION_PROMPT + listing}],
    )
    raw = resp.content[0].text
    start, end = raw.find("{"), raw.rfind("}")
    return json.loads(raw[start : end + 1])
 
def apply_plan(memories: list[dict], plan: dict) -> list[dict]:
    by_id = {m["id"]: m for m in memories}
    now = datetime.now(timezone.utc).isoformat()
    result = []
    demoted = set(plan.get("stale", [])) | set(plan.get("archive", []))
    for item in plan.get("keep", []):
        base = by_id.get(item["id"], {})
        result.append({
            "id": item["id"],
            "created": base.get("created", now),
            "updated": now,
            "text": item["text"],
            "status": "active",
            "merged_from": item.get("merged_from", []),
        })
    for mid in demoted:
        if mid in by_id:
            row = dict(by_id[mid])
            row["status"] = "stale" if mid in plan.get("stale", []) else "archived"
            row["updated"] = now
            result.append(row)
    return result
 
def main():
    memories = load_memories()
    if len(memories) < 10:
        print(f"active memories = {len(memories)}: no consolidation needed")
        return
    plan = consolidate(memories)
    merged = apply_plan(memories, plan)
    MEMORY_FILE.write_text(
        "\n".join(json.dumps(m, ensure_ascii=False) for m in merged) + "\n",
        encoding="utf-8",
    )
    kept = sum(1 for m in merged if m["status"] == "active")
    print(f"consolidated: active {len(memories)} -> {kept} entries")
 
if __name__ == "__main__":
    main()

Three things matter here. First, carry a three-valued status (active, stale, archived) and never physically delete. Dreaming similarly keeps history, which is the safe design: a wrong call is reversible. Second, restrict the consolidation input to active only so stale notes are never re-read. Third, keep merged_from so you can trace why a policy emerged later.

Designing the consolidation prompt — what to keep, what to drop

The loop's quality is almost entirely down to the prompt. After several failures, here is the rubric I settled on.

keep: reproducible judgment rules, facts confirmed multiple times,
      constraints still in force
mark stale: old facts that conflict with a newer note
mark archive: one-off events, records meaningful only in a single session
do not touch: the factual core of notes you keep (you may tidy wording,
      but never change the content)

Early on, I only told the model to "summarize and shorten." It helpfully rewrote the facts themselves: "disabled" got softened into "adjusted," and the crucial decision turned vague. Stating that folding (generalize) and summarizing are different made the difference. Folding derives a policy from many facts; summarizing shortens a single fact — and you must not do the latter to memory.

Frequency matters too. I run this not every session but only once active crosses ten entries. Running it every time forces folding before enough of a pattern exists, which loses information instead.

When it helps and when it backfires

Neither Dreaming nor a self-hosted loop is universal. In my operation, the workloads split cleanly.

It helps on long tasks where consistency with past judgment maps to quality: continuously touching the same codebase, handling the same customer, operating the same set of sites. Memory freshness directly governs the outcome. Harvey and Wisedocs fall into this shape.

It tends to backfire when tasks are mutually independent. For a workload that processes a different user's one-off request each time, "patterns" extracted from past sessions act as biases that don't fit the task in front of you. When I added a consolidation loop to a translation agent, it over-generalized past stylistic habits and started ignoring the new request's instructions. I fixed it by narrowing the consolidation scope to operational constraints only and keeping content-specific judgments out of memory entirely.

The rule of thumb in one sentence: do you want to reference past judgment on the next task? If yes, folding is worth it. If no, it's safer not to keep that in memory at all.

Three pitfalls I hit in production

Wiring a self-hosted consolidation loop into the Dolice Labs operation, I hit three reproducible pitfalls.

First, a race where another process reads memory mid-consolidation. If the agent reads memory while the write-back is in progress, it grabs a half-updated state. Writing to a temp file and swapping atomically with os.replace() fixed it. Overwriting directly with write_text opens that window.

Second, runaway folding. Over repeated passes, the model decided it could "make this cleaner" and merged two policies that should have stayed independent, losing one context. As a guard, I discard the consolidation and notify a human if keep drops by 30% or more versus the previous run. A sharp shrink is almost always a sign of over-folding.

Third, misjudged staleness. The newest note isn't always correct. A note from a temporary experiment, treated as "latest," once marked a correct older policy stale. I prevented this by labeling notes as "confirmed" or "experimental" beyond just created, and forbidding experimental notes from marking others stale.

While Dreaming is still a research preview, building these guards into your own loop means that when the official version lands, you already have your own sense of where to draw the line between what to delegate and what to keep in hand.

Your next step

Start by dumping your current agent's memory once and counting the active entries and how many conflicting pairs sit inside. If you're past ten entries with even one conflict, it's already dulling your agent's judgment. Run just the consolidate function above in a read-only mode with no write-back, and you'll see which notes Claude considers stale. Write back only once you agree with that call.

Since I made memory-freshness management an explicit part of operations, I've hit the same bug twice far less often. The longer you run an agent, the more tending its memory beats growing it.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.