⬡ API & SDK/2026-06-22Advanced

Your Claude Files API Storage Is Quietly Filling Up — Dedup With a Content-Hash Ledger and Reap the Orphans

Use the Files API in an automated pipeline and the same file gets uploaded again and again while orphaned files pile up unnoticed. Here is a content-hash dedup ledger plus an orphan GC design, with working code.

Claude API⁸³ Files API² Cost Optimization⁷ Operations⁵

✦ Premium Article

About two weeks after wiring the Files API into an automated pipeline, I ran GET /v1/files on a whim and went a little pale: dozens of copies of the same reference data, distinguished only by their dates. The upload code was working correctly. The problem was that neither a "don't upload what I already uploaded" mechanism nor a "clean up what I no longer use" mechanism existed anywhere in my code.

The Files API exists so you can "upload once and reference many times" — but the once is the caller's responsibility, not the API's. Used occasionally by hand, nothing accumulates. In an automated workflow that re-uploads the same reference data every day, orphaned files and storage charges pile up silently. This article describes how to stop that pileup with two layers — a content-hash ledger and an orphan GC — based on what I actually built into the Dolice Labs auto-publishing pipeline as an indie developer.

Basic upload steps are covered in the Claude Files API basics guide, so here I focus only on not letting files accumulate.

Why the Files API Fills Up Silently

Files API uploads are not idempotent. POST /v1/files the same PDF twice and you get two distinct entities with different file_id values. To the API they are separate objects, so there is no error and no warning.

This bites hard in automation. Run a job that "uploads reference data every morning and generates articles from that file_id" daily, and even when the content is byte-for-byte identical to yesterday, a new file_id is born each time. Thirty per month; a hundred and twenty across four sites. Each is small, but unreferenced entities keep sitting in storage.

Files API storage is billed on the bytes you retain. So "files you no longer use but never deleted" remain a daily billing line item while doing nothing useful. By the time I noticed, reference data that should have needed four entries had ballooned to nearly eighty.

It helps to split the trap in two. One is duplicate uploads (uploading the same content twice); the other is orphaned files (still present though nothing references them anymore). You stop the former before uploading and reap the latter on a schedule.

The Content-Hash Ledger

The most straightforward way to stop duplicates is to "check, before uploading, whether you have uploaded this content before." The key to that check is a hash of the content.

Compute SHA-256 over the file's bytes and keep a hash → file_id map in a ledger on your side. If the hash of the content you want to upload is in the ledger, reuse the existing file_id. If not, upload for the first time and write the result back into the ledger. That alone guarantees "never upload the same content twice."

Where the ledger lives depends on your workload. I started with a local JSON file and moved to a KV store once multiple processes began running concurrently. The requirement is simple: look up a file_id by hash, by whatever means.

I chose SHA-256 because the chance of a collision is effectively zero in practice. The odds of two different files happening to share a hash and getting mixed up are astronomically small — not worth considering at the scale of reference data. I avoided older hashes like MD5 precisely so that, if a day ever came when I suspected a collision, I would not waste time ruling it out. The longer-lived the mechanism — and a ledger is long-lived — the more the foundation pays to be on the safe side.

import hashlib
import json
from pathlib import Path
from anthropic import Anthropic
 
client = Anthropic()  # ANTHROPIC_API_KEY from the environment
LEDGER_PATH = Path("file_ledger.json")
BETA = "files-api-2025-04-14"  # the Files API needs a beta header
 
 
def load_ledger() -> dict:
    if LEDGER_PATH.exists():
        return json.loads(LEDGER_PATH.read_text())
    return {}
 
 
def save_ledger(ledger: dict) -> None:
    # write to a temp file then swap, to avoid corruption mid-write
    tmp = LEDGER_PATH.with_suffix(".tmp")
    tmp.write_text(json.dumps(ledger, ensure_ascii=False, indent=2))
    tmp.replace(LEDGER_PATH)
 
 
def content_hash(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()
 
 
def get_or_upload(path: str) -> str:
    """Return an existing file_id for identical content; upload if unseen."""
    data = Path(path).read_bytes()
    digest = content_hash(data)
 
    ledger = load_ledger()
    if digest in ledger:
        return ledger[digest]["file_id"]  # reuse — no upload
 
    uploaded = client.beta.files.upload(
        file=(Path(path).name, data, "application/octet-stream"),
        betas=[BETA],
    )
    ledger[digest] = {"file_id": uploaded.id, "name": Path(path).name}
    save_ledger(ledger)
    return uploaded.id

The crucial point is that the hash is computed over the bytes, not the filename. Judge by filename and you miss the case where the name stays the same but the content changes. Judge by content and, conversely, files with different names but identical content collapse into one.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦If you have been ignoring the orphaned files and storage charges that grow with every upload, you can now stop the duplication at its root with a content-hash ledger

✦You will get working orphan-GC logic built on list and delete, with a double-check that prevents accidental deletion of files still in use

✦For pipelines that upload files every day, you can keep storage flat while raising your reference reuse rate

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Give the Ledger a "Last Used" Date

A little extra effort here makes the later GC much easier: record, in each ledger entry, "the last day this file_id was used."

import datetime
 
 
def touch(digest: str, ledger: dict) -> None:
    """Stamp the ledger to mark that we referenced this entry."""
    if digest in ledger:
        ledger[digest]["last_used"] = datetime.date.today().isoformat()
        save_ledger(ledger)

Update last_used inside get_or_upload both when you reuse and when you upload. With that in place, you can mechanically pick out "file_ids not referenced recently" from the ledger, which becomes the basis for the GC decision.

During a period when I had no last_used, I once deleted a file I "might still need" on a hunch, and the next morning's job fell over because the file_id was gone. Keep the justification for deletion on the ledger side and you no longer have to delete on a hunch.

Orphan GC — Wire list and delete Carefully

With the ledger in place, move on to reaping orphans. The Files API GET /v1/files returns every file on the account. Cross-reference that against your ledger and you find files "present on the API but not acknowledged as active by your ledger" — the orphan candidates.

The most important rule here: do not delete a file just because it is absent from the ledger. Sweeping up a file uploaded by another process or tool, or one uploaded moments ago and not yet written to the ledger, will break production jobs. I always run the following double check.

First, the deletion target must not be in the set of file_ids the ledger acknowledges as active. Second, enough time must have passed since creation (to avoid mix-ups right after an upload, I use 24 hours as the floor).

import datetime
 
KEEP_DAYS = 14          # referenced within this many days counts as active
MIN_AGE_HOURS = 24      # never touch files created very recently
 
 
def active_file_ids(ledger: dict) -> set:
    cutoff = datetime.date.today() - datetime.timedelta(days=KEEP_DAYS)
    ids = set()
    for entry in ledger.values():
        last = entry.get("last_used")
        if last and datetime.date.fromisoformat(last) >= cutoff:
            ids.add(entry["file_id"])
    return ids
 
 
def collect_orphans(dry_run: bool = True) -> list:
    ledger = load_ledger()
    keep = active_file_ids(ledger)
    now = datetime.datetime.now(datetime.timezone.utc)
    deleted = []
 
    for f in client.beta.files.list(betas=[BETA]):
        if f.id in keep:
            continue  # active, protected
        age = now - f.created_at
        if age < datetime.timedelta(hours=MIN_AGE_HOURS):
            continue  # too new, protected
        if dry_run:
            deleted.append(f.id)  # only report the candidate
        else:
            client.beta.files.delete(f.id, betas=[BETA])
            deleted.append(f.id)
    return deleted

Defaulting dry_run=True is deliberate. For the first several runs, always print only the candidate list and confirm with your own eyes that only deletable files appear, then flip to dry_run=False. I do not recommend running auto-deletion unattended from day one. Exactly once, I ran delete without verification and swept up files I actually needed.

Fixing Drift Between Ledger and Reality

Keep operating and the opposite drift appears too: present in the ledger, but no entity on the API side. Someone deleted a file by hand, a TTL expired it, a ledger from another environment got mixed in.

Leave that drift alone and get_or_upload decides "it is in the ledger, so reuse," returns a dead file_id, and the message send using it fails with not found. So adding a path that lightly confirms the entity exists before reuse keeps you on the safe side.

def resolve(path: str) -> str:
    """Confirm the entity exists before reuse; re-upload if it vanished."""
    data = Path(path).read_bytes()
    digest = content_hash(data)
    ledger = load_ledger()
 
    if digest in ledger:
        fid = ledger[digest]["file_id"]
        try:
            client.beta.files.retrieve_metadata(fid, betas=[BETA])
            touch(digest, ledger)
            return fid
        except Exception:
            del ledger[digest]  # drop the dead entry
            save_ledger(ledger)
    return get_or_upload(path)

Inserting retrieve_metadata every time adds one round trip of latency, so I run it as "reconcile once at the head of the job, then trust the ledger from there." Where you pay the existence-check cost is a decision to fit to the shape of your job.

A Small Habit: Watch Storage at a Fixed Point

Finally, one habit that helped more than the design itself, mundane as it is. Once a week, I added a job that does nothing but log the total file count and total bytes.

def storage_snapshot() -> dict:
    files = list(client.beta.files.list(betas=[BETA]))
    total_bytes = sum(f.size_bytes for f in files)
    return {"count": len(files), "total_mb": round(total_bytes / 1024 / 1024, 2)}

If the numbers keep climbing in a staircase, that is a sign the ledger or the GC has stopped working somewhere. Since adding this fixed-point check, I have kept files that once swelled toward eighty stable at four to six. The absolute cost is small, but for a workflow you intend to run for a long time, having "no longer paying for things you do not use" lightens the load mentally as well.

As a first step, run GET /v1/files once against the pipeline you are running now. If the number that appears is larger than you expected, building the ledger and GC from this article is worth your time.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.