CLAUDE LABJP
MODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per taskMODEL — Claude Fable 5 reached general availability on June 9 with a 1M-token context, always-on adaptive thinking, and 128K outputPLATFORM — The Developer Platform adds code execution, an MCP connector, a Files API, and prompt caching up to one hourMCP — Admins can provision MCP connectors org-wide via Okta, giving users zero-touch access on first loginSANDBOX — Claude Managed Agents now run in your own sandbox and connect to private MCP serversCODING — Opus 4.8 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, excelling at long-running workLINEUP — Opus 4.8, Sonnet 4.6, and Haiku 4.5 lead the lineup; pick the right one per task
Articles/API & SDK
API & SDK/2026-06-22Advanced

Your Claude Files API Storage Is Quietly Filling Up — Dedup With a Content-Hash Ledger and Reap the Orphans

Use the Files API in an automated pipeline and the same file gets uploaded again and again while orphaned files pile up unnoticed. Here is a content-hash dedup ledger plus an orphan GC design, with working code.

Claude API83Files API2Cost Optimization7Operations5

Premium Article

About two weeks after wiring the Files API into an automated pipeline, I ran GET /v1/files on a whim and went a little pale: dozens of copies of the same reference data, distinguished only by their dates. The upload code was working correctly. The problem was that neither a "don't upload what I already uploaded" mechanism nor a "clean up what I no longer use" mechanism existed anywhere in my code.

The Files API exists so you can "upload once and reference many times" — but the once is the caller's responsibility, not the API's. Used occasionally by hand, nothing accumulates. In an automated workflow that re-uploads the same reference data every day, orphaned files and storage charges pile up silently. This article describes how to stop that pileup with two layers — a content-hash ledger and an orphan GC — based on what I actually built into the Dolice Labs auto-publishing pipeline as an indie developer.

Basic upload steps are covered in the Claude Files API basics guide, so here I focus only on not letting files accumulate.

Why the Files API Fills Up Silently

Files API uploads are not idempotent. POST /v1/files the same PDF twice and you get two distinct entities with different file_id values. To the API they are separate objects, so there is no error and no warning.

This bites hard in automation. Run a job that "uploads reference data every morning and generates articles from that file_id" daily, and even when the content is byte-for-byte identical to yesterday, a new file_id is born each time. Thirty per month; a hundred and twenty across four sites. Each is small, but unreferenced entities keep sitting in storage.

Files API storage is billed on the bytes you retain. So "files you no longer use but never deleted" remain a daily billing line item while doing nothing useful. By the time I noticed, reference data that should have needed four entries had ballooned to nearly eighty.

It helps to split the trap in two. One is duplicate uploads (uploading the same content twice); the other is orphaned files (still present though nothing references them anymore). You stop the former before uploading and reap the latter on a schedule.

The Content-Hash Ledger

The most straightforward way to stop duplicates is to "check, before uploading, whether you have uploaded this content before." The key to that check is a hash of the content.

Compute SHA-256 over the file's bytes and keep a hash → file_id map in a ledger on your side. If the hash of the content you want to upload is in the ledger, reuse the existing file_id. If not, upload for the first time and write the result back into the ledger. That alone guarantees "never upload the same content twice."

Where the ledger lives depends on your workload. I started with a local JSON file and moved to a KV store once multiple processes began running concurrently. The requirement is simple: look up a file_id by hash, by whatever means.

I chose SHA-256 because the chance of a collision is effectively zero in practice. The odds of two different files happening to share a hash and getting mixed up are astronomically small — not worth considering at the scale of reference data. I avoided older hashes like MD5 precisely so that, if a day ever came when I suspected a collision, I would not waste time ruling it out. The longer-lived the mechanism — and a ledger is long-lived — the more the foundation pays to be on the safe side.

import hashlib
import json
from pathlib import Path
from anthropic import Anthropic
 
client = Anthropic()  # ANTHROPIC_API_KEY from the environment
LEDGER_PATH = Path("file_ledger.json")
BETA = "files-api-2025-04-14"  # the Files API needs a beta header
 
 
def load_ledger() -> dict:
    if LEDGER_PATH.exists():
        return json.loads(LEDGER_PATH.read_text())
    return {}
 
 
def save_ledger(ledger: dict) -> None:
    # write to a temp file then swap, to avoid corruption mid-write
    tmp = LEDGER_PATH.with_suffix(".tmp")
    tmp.write_text(json.dumps(ledger, ensure_ascii=False, indent=2))
    tmp.replace(LEDGER_PATH)
 
 
def content_hash(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()
 
 
def get_or_upload(path: str) -> str:
    """Return an existing file_id for identical content; upload if unseen."""
    data = Path(path).read_bytes()
    digest = content_hash(data)
 
    ledger = load_ledger()
    if digest in ledger:
        return ledger[digest]["file_id"]  # reuse — no upload
 
    uploaded = client.beta.files.upload(
        file=(Path(path).name, data, "application/octet-stream"),
        betas=[BETA],
    )
    ledger[digest] = {"file_id": uploaded.id, "name": Path(path).name}
    save_ledger(ledger)
    return uploaded.id

The crucial point is that the hash is computed over the bytes, not the filename. Judge by filename and you miss the case where the name stays the same but the content changes. Judge by content and, conversely, files with different names but identical content collapse into one.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
If you have been ignoring the orphaned files and storage charges that grow with every upload, you can now stop the duplication at its root with a content-hash ledger
You will get working orphan-GC logic built on list and delete, with a double-check that prevents accidental deletion of files still in use
For pipelines that upload files every day, you can keep storage flat while raising your reference reuse rate
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

API & SDK2026-06-22
When Your Claude API Cost Math Doesn't Match the Bill: Accounting for the Four Token Buckets
Turn on prompt caching and your homegrown cost tally drifts from the console bill. Here is how to weight the four token buckets the usage object returns and build a ledger you can reconcile.
API & SDK2026-06-20
Putting Cloudflare AI Gateway in Front of Claude Made the Numbers I Needed Disappear — Field Notes on Instrumentation
After putting Cloudflare AI Gateway in front of Claude API, here is where I actually got stung — cost attribution, semantic-cache false hits, fallback quietly lowering quality, and budgets that don't really stop anything — with the code I used to fix each.
API & SDK2026-04-11
Claude API Batch Processing — Reduce API Costs by Up to 90% with Asynchronous Batch Implementation
Master Claude API batch processing for efficient large-scale requests. Learn async batch patterns to reduce costs and avoid rate limits.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →