⬡ API & SDK/2026-06-12Intermediate

Handing Monthly Revenue CSVs to the Claude API Code Execution Tool — Files API Wiring, Container Reuse, and Billing Traps

How I moved monthly revenue-CSV reconciliation for four apps into the Claude API Code Execution tool — Files API integration, container reuse, the 5-minute billing minimum, and the file-preload charge that surprises almost everyone.

claude-api⁸¹ code-execution³ files-api³ python²² data-analysis

✦ Premium Article

At the end of every month, my downloads folder fills up with CSVs. There is the financial report from App Store Connect, the estimated earnings export from AdMob, and then the reports from the three networks riding in my mediation stack — Unity Ads, Liftoff, and InMobi. The currencies do not match, the reporting periods do not match, and the column names certainly do not match. Reconciling them all into one trustworthy monthly number has been a fixture of my month-end for years.

I started building apps as an indie developer in 2014, beginning with a wallpaper app, and today I run four apps centered on wallpapers and relaxation content. Cumulative downloads passed 50 million along the way, which I am grateful for — but every bit of that growth added another report to the month-end pile. For a long time I handled the reconciliation with local pandas scripts, and for just as long I kept performing the same small repair: some network renames a column or shuffles its export layout, the script falls over on a missing key, I patch the column mapping, and the cycle repeats a month or two later. None of the individual fixes were hard. It was the recurrence that wore me down.

Earlier this year I moved this monthly closing work onto the Claude API Code Execution tool. The shape of the change matters: instead of pasting CSV contents into the prompt as tokens, the files travel into a sandboxed container as files, and pandas runs on the other side. Claude writes the code, executes it next to the data, and sends back only the conclusions. Actually wiring this up revealed a few gaps between the minimal examples in the documentation and what production use requires. Billing that starts the moment a file is attached — even if the tool never runs — and the pause_turn stop reason that surfaces mid-task are the two big ones. What follows is the working code, together with how to step over each of those gaps.

Why I Stopped Maintaining the Local Aggregation Scripts

To be clear, I did not replace everything, and I want to draw the boundary precisely because the boundary is the useful part. My daily KPI check still runs on the local pipeline I described in A Daily Revenue Pipeline for 4 Wallpaper Apps: 8 Weeks Running App Store Connect API + AdMob With Claude Code, and it runs reliably. When the input format is stable and the question is identical every morning — yesterday's revenue, by app, by network — a plain script is faster, cheaper, and easier to reason about. Nothing about the Code Execution tool changes that calculus.

What kept breaking was the monthly close, and the reason was structural rather than accidental. After I added Unity Ads, Liftoff, and InMobi to my mediation setup last year — and finished the payment profiles and W-8BEN paperwork that made their payouts real — the variety of incoming reports jumped from two formats to five. Revenue arrives in a mix of USD and JPY. One network cuts its reporting periods on calendar days, another on Pacific time. The same concept is called revenue in one file and estimated_earnings in another, and a column that existed in March quietly disappears in April. Code that absorbs this kind of naming drift case-by-case accumulates conditionals, and it becomes less readable with every patch. I knew the script had become a liability the day I hesitated to open it.

So I redrew the boundary. Stable, repetitive aggregation stays local; format-drifting, exploratory reconciliation goes to the sandbox. The reasoning is almost mechanical: the maintenance cost of a script is proportional to how often its input format changes. For inputs that never change, that cost rounds to zero and a script wins. For inputs that change every few months in unannounced ways, you are effectively rewriting the parser on a schedule — and letting Claude absorb the column-name drift at read time is simply cheaper than editing code every time a network ships a new export layout.

The second reason is context economy, and at this file size it is decisive. Paste a CSV into the message body and every row becomes billable input tokens that also crowd the context window. With the Code Execution tool plus the Files API, files load directly into the container and Claude reads them with pandas. The only things that land in your token budget are the code Claude writes and the printed output it chooses to show. For a monthly job that touches several megabytes of reports, this is not an optimization — it is the difference between the approach working and not working at all. I will put numbers on that in the billing section.

Where the Tool Stands Today — Old Tutorials Will Mislead You

The tool has been revised several times since its first public beta, and the gap between early write-ups and the current shape is now wide enough to cause real confusion. I lost a morning to this myself, so here is the sorted version.

code_execution_20250522 (legacy): the original Python-only release. Many tutorials still assume it, along with its code-execution-2025-05-22 beta header and its old single response-block format
code_execution_20250825 (current standard): adds Bash commands and file operations on top of Python, and is available on every supported model. The response format changed too — results now arrive as separate bash and text-editor block families
code_execution_20260120 (newest): adds REPL state persistence and programmatic tool calling from inside the sandbox, but only on newer generations such as Opus 4.5+ and Sonnet 4.5+. Haiku 4.5 stops at 20250825, which matters if you route cheap aggregation jobs to smaller models

The runtime, as of this writing, is a Linux container running Python 3.11 with 5GiB of memory, 5GiB of disk, and one CPU. The libraries that matter for reconciliation work are preinstalled — pandas, numpy, scipy, statsmodels for the analysis itself; matplotlib and seaborn for charts; openpyxl, xlsxwriter, and pyarrow for the file formats finance teams love. One constraint shapes everything else: the container has no internet access whatsoever. No pip install, no calling out to an exchange-rate API, no fetching a reference table from a URL. Whatever facts the analysis needs from the outside world must arrive in the prompt or in the uploaded files. The full specification lives in the official documentation.

One more practical note before the code: the current minimal examples run code execution without any beta header. The only beta flag you need is files-api-2025-04-14, and only when you bring the Files API into the picture. Update the anthropic package before you start — older SDK versions do not know the newer response types, and the resulting attribute errors look far more mysterious than they are.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Stop re-patching brittle local aggregation scripts — hand raw CSVs to a sandbox that handles the cleanup, joins, and charting

✦Walk away with working Python for Files API uploads, response-block parsing, and container reuse that you can run as-is

✦Avoid the traps that bite first, including billing that starts the moment files are attached, pause_turn handling, and expired containers

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Start Minimal — Learn the Response Blocks First

The first step worth taking is a file-free request, purely to understand the response shape. Skipping this and jumping straight to Files API integration is how people end up lost in response parsing, because the response of a code-execution request looks unlike anything the standard tool-use loop has taught you to expect.

import anthropic
 
client = anthropic.Anthropic()  # ANTHROPIC_API_KEY comes from the environment
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": "Calculate the mean and standard deviation of monthly revenue [412000, 389000, 455000, 401000, 478000] JPY",
    }],
    tools=[{"type": "code_execution_20250825", "name": "code_execution"}],
)
 
# The response is a sequence of content blocks
for block in response.content:
    if block.type == "server_tool_use":
        # The command or code Claude ran in the sandbox
        print("Executed:", block.input)
    elif block.type == "bash_code_execution_tool_result":
        result = block.content
        print("stdout:", result.stdout)
        print("return_code:", result.return_code)
    elif block.type == "text":
        print("Claude's explanation:", block.text)

Run it and you will see the anatomy clearly: a server_tool_use block holding the code Claude wrote, a bash_code_execution_tool_result block whose stdout contains the mean of 427,000 JPY along with the standard deviation, and a closing text block where Claude explains the result in plain language.

The point to internalize is that execution completes entirely server-side. Unlike client tools, you never assemble a tool_result yourself and there is no second round trip; the server_tool_use and *_tool_result pairs simply appear, already resolved, inside a single response. In the current version those result blocks come in two families. Shell runs come back as bash_code_execution_tool_result carrying stdout, stderr, and return_code. File operations come back as text_editor_code_execution_tool_result, whose shape varies by operation — a view returns the file content with line counts, a create reports whether it overwrote an existing file, and a str_replace edit returns a small diff. Your parser should expect both families. And make a habit of checking stderr and return_code on every result: it is the difference between catching a silently failed aggregation and shipping it to your accounting folder.

Sending Revenue CSVs Into the Container via the Files API

With the response shape understood, the real work starts. Upload the monthly reports through the Files API, then reference them in the message with container_upload blocks — one per file. The files are placed into the container before the model starts working, so Claude can list, inspect, and read them with ordinary Python.

import anthropic
from pathlib import Path
 
client = anthropic.Anthropic()
 
# The monthly report set (App Store Connect / AdMob / three mediation networks)
report_files = [
    "asc_financial_2026_05.csv",
    "admob_2026_05.csv",
    "unity_ads_2026_05.csv",
    "liftoff_2026_05.csv",
    "inmobi_2026_05.csv",
]
 
uploaded = []
for path in report_files:
    f = client.beta.files.upload(file=Path(path).open("rb"))
    uploaded.append({"type": "container_upload", "file_id": f.id})
 
AGGREGATION_RULES = """You are performing a monthly revenue aggregation for mobile apps. Follow these rules strictly.
 
- Convert all amounts to JPY. For USD rows, use the exchange_rate column if present; otherwise use 1 USD = 155 JPY
- Align the period to calendar days from 2026-05-01 through 2026-05-31. Exclude rows outside the range
- Identify apps by whichever of bundle_id / package_name / app_id exists, and build a mapping table for naming variants first
- Write the aggregated result to summary_2026_05.csv, broken down by network and by app
- Save a stacked bar chart (network x app) to revenue_2026_05.png
- Never fill in missing numbers by guessing. Write any row you are unsure about to excluded_rows.csv with a reason
"""
 
response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    betas=["files-api-2025-04-14"],
    max_tokens=8192,
    messages=[{
        "role": "user",
        "content": [{"type": "text", "text": AGGREGATION_RULES}] + uploaded,
    }],
    tools=[{"type": "code_execution_20250825", "name": "code_execution"}],
)

The heart of this code is not the upload loop — it is AGGREGATION_RULES. Exchange rates, period definitions, exclusion criteria: these are accounting decisions, and they belong to the human no matter how capable the sandbox is. This is also why the prompt asks for a naming-variant mapping table before any aggregation: when the same app appears as a bundle ID in one report and a marketing name in another, I want that reconciliation made explicit and inspectable rather than implied inside a join.

The line that earns its keep is the last one. Instructing Claude to write every questionable row to excluded_rows.csv with a reason means nothing ever drops out of the aggregation untraceably. I added that rule after watching a run come close to quietly imputing averages into rows with missing values — exactly the failure mode you never notice until a quarter later. Since then, every aggregation prompt I write includes an exclusion log, and I would recommend the habit for any LLM-driven data work, not just this tool.

The exchange rate is passed in the prompt because of the constraint mentioned earlier: the container cannot reach the network. In my first month I overlooked that and watched Claude attempt to call a currency API and fail, politely, several times. Deciding up front which information must be supplied from outside, versus computed inside, removes a whole class of confusion — I now keep a short written list of externally-sourced facts that the monthly prompt must carry.

Retrieving the Generated Charts and CSVs

The files created inside the sandbox — summary_2026_05.csv, revenue_2026_05.png, excluded_rows.csv — come back as file IDs referenced inside the response, which you then download through the Files API. Two details are worth knowing: the IDs are nested fairly deep inside the result blocks, and the official sample walks only the bash result family. I scan all result blocks defensively instead.

def extract_generated_file_ids(response) -> list[str]:
    """Collect file_ids of generated files from every tool-result block."""
    file_ids = []
    for block in response.content:
        if not block.type.endswith("_tool_result"):
            continue
        content = getattr(block, "content", None)
        if content is None:
            continue
        # Output-file references arrive in the result's content list
        for item in getattr(content, "content", None) or []:
            file_id = getattr(item, "file_id", None)
            if file_id:
                file_ids.append(file_id)
    return file_ids
 
 
for file_id in extract_generated_file_ids(response):
    meta = client.beta.files.retrieve_metadata(file_id)
    client.beta.files.download(file_id).write_to_file(meta.filename)
    print(f"Downloaded: {meta.filename}")

The getattr-based defensive style is deliberate. The result block types have already changed once between tool versions, and for an API that is still evolving, encoding the intent — collect anything that carries a file_id — has proven more durable than pinning exact types with isinstance. When the SDK adds a new result variant, this function keeps working; a type-pinned version would need another patch, which is the exact failure pattern I left local scripts to escape.

One lifecycle note: files generated by code execution persist on the Files API side until you explicitly delete them, while the container's own working data expires on the 30-day schedule. If you produce a chart every month and never clean up, the file list grows quietly — worth a periodic sweep.

Of everything that comes back, the file I open every single month is excluded_rows.csv. Reading the reasons attached to dropped rows is how I noticed that Liftoff's export had test-campaign rows mixed in, and that a timezone notation had changed in one date column. The aggregated answer is useful, but the rows that fell out of the answer carry more operational signal — that is an honest takeaway from twelve years of monthly closes.

Container Reuse Turns the Close Into a Conversation

The part I have come to appreciate most is that the close no longer has to finish in a single round trip. Every response from a code-execution request carries a container reference; pass that ID into the next request and you get the same sandbox back, working files included.

container_id = response.container.id
 
followup = client.beta.messages.create(
    container=container_id,  # Reuse the same sandbox, working files included
    model="claude-sonnet-4-6",
    betas=["files-api-2025-04-14"],
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": "Re-read summary_2026_05.csv and list the three app x network pairs with the largest month-over-month decline",
    }],
    tools=[{"type": "code_execution_20250825", "name": "code_execution"}],
)

The conversation is brand new — no prior messages carried over — but summary_2026_05.csv is still sitting in the container, so follow-up questions need no re-upload and no re-aggregation. "Is that drop coming from eCPM or from impressions?" "Show me this network by day." "Recompute with last month's exchange rate so I can see how much is currency movement." The closing numbers become something you can interrogate interactively, which is precisely what a static script never gave me.

The caveat: containers expire 30 days after creation. Persist a container_id in your database and reuse it next month, and you will meet the container_expired error instead of your data. My rule is simple — reuse within the same working day, recreate every month from the source CSVs, which I keep anyway. The newer code_execution_20260120 version also persists REPL variable state between requests, not just files, which suits longer analytical sessions; but for a monthly close, the file persistence in 20250825 has been entirely sufficient for me.

Understand the Billing Before You Ship — the 5-Minute Minimum and the Free Tier

Code execution is billed by execution time, tracked separately from tokens, and the published terms have a few corners worth knowing before you wire this into anything recurring.

Each billed session counts a minimum of 5 minutes, even if the actual run took forty seconds
Every organization gets 1,550 free hours per month; beyond that it is $0.05 per hour, per container
Model token charges (input and output) apply as usual, on top of execution time
When web_search_20260209 or web_fetch_20260209 is in the same request, code execution itself carries no extra charge beyond tokens
The response reports tool activity under usage.server_tool_use.code_execution_requests, which is the number to watch in monitoring

And then there is the clause that catches nearly everyone. If a request includes files, execution time is billed even when the tool is never invoked, because the files are preloaded onto a container regardless of whether Claude decides to run code. While you are still iterating on prompt wording, leave the files out and tune the text alone; attach the files only once the instructions have settled. That ordering alone prevents a pile of pointless 5-minute charges, and it is not a hypothetical — my first week of experimentation was mostly prompt tuning with files attached, which in hindsight was paying for preloads I never used.

For scale, here is what the real operation looks like. My monthly close runs one container for 5 to 15 minutes, follow-ups included, and never approaches one hour per month. Execution time therefore stays entirely inside the free tier, and the actual spend is the token side — a close costs somewhere in the tens of thousands of tokens including output, which lands under a dollar. For contrast, pushing five files totaling roughly 8MB through the prompt as raw text would be on the order of two million tokens. That does not fit a standard context window at all; even split across chunked requests it would cost several dollars in input tokens alone, before you account for the accuracy loss of making a language model eyeball-parse half a million CSV rows instead of running pandas over them. Files go to the container; the prompt carries the judgment rules. That split has been the right one on both cost and precision, and I expect it generalizes well beyond revenue reports.

Common Mistakes and Pitfalls

These are the ones I hit, or came close to hitting, in real operation.

1. Mixing in legacy sample code. Search results still surface tutorials from the Python-only code_execution_20250522 era. Copy response-parsing code from them and you will silently miss the current bash_code_execution_tool_result / text_editor_code_execution_tool_result blocks — your loop simply never matches, and the symptom looks like "Claude ran nothing." Standardize on 20250825 and parse both result families.

2. Assuming "no tool call, no charge." As above: attaching files starts the execution-time meter on its own. Ten prompt-tuning iterations with files attached burn fifty minutes of free tier without a single execution. Iterate file-free, attach when stable.

3. Treating pause_turn as the end of the turn. On long reconciliations across five files, the API may return mid-task with stop_reason set to pause_turn. It is not an error and not a refusal — it means "there is more." Feed the response back as-is and execution continues:

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": AGGREGATION_RULES}] + uploaded,
}]
 
while True:
    resp = client.beta.messages.create(
        model="claude-sonnet-4-6",
        betas=["files-api-2025-04-14"],
        max_tokens=8192,
        messages=messages,
        tools=[{"type": "code_execution_20250825", "name": "code_execution"}],
    )
    if resp.stop_reason != "pause_turn":
        break
    # Append the paused turn as-is and the run resumes where it stopped
    messages.append({"role": "assistant", "content": resp.content})

Before I understood this, I nearly read a half-finished aggregation as the final result — the partial output looked plausible, which is what makes this trap dangerous. Build the stop_reason loop in from day one.

4. Storing container IDs as if they were permanent. Thirty days after creation, the container is gone, and container_expired is all you get back. Never design for cross-month reuse; recreate monthly from source files.

5. Forgetting the sandbox is offline. No exchange-rate lookups, no pip install, no reference data fetched from a URL. State externally-sourced facts in the prompt, and check the preinstalled library list covers your needs before committing to the design. For aggregation work, pandas, numpy, matplotlib, and openpyxl were all I needed; if your workflow depends on a niche package, this tool may not be the right home for it.

6. Sending sensitive data raw. This feature is not eligible for zero data retention, and container data persists for up to 30 days on Anthropic's side. Revenue reports contain little personal data, but the moment you route user reviews or support logs through the same pipeline, put a masking step in front — the approach I use is in Before You Send Reviews and Crash Logs to the Claude API: A Reversible PII Masking Design. For managing the uploaded files themselves — listing, lifecycle, deletion — Claude API Files API — Persist Documents and Slash API Costs covers the mechanics in more depth.

Start by Handing Over a Single AdMob CSV

There is no need to build the whole pipeline at once, and I would actively advise against trying. The first step I would actually recommend: upload just this month's AdMob report, add a single container_upload block to the minimal example from earlier, and ask for nothing more than "plot estimated daily earnings as a line chart and save it as a png." That one exchange teaches you the response-block anatomy, the file_id retrieval, and the download path — the three things every later iteration builds on. Once you have been around that loop, growing the rule set line by line carries you the rest of the way to a full monthly close.

When month-end stops being about chasing format drift and becomes about writing down judgment rules in plain language, the close itself turns into time spent reviewing how the business actually runs. I am still adding conversion rules every time excluded_rows.csv teaches me something new — and I find I do not mind the loop at all. Thank you for reading along this far.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.