◈ Cowork/2026-06-28Advanced

When Your Cowork × GitHub MCP Triage Quietly Drifts Each Run — Field Notes on Idempotency and Label Boundaries

Cowork + GitHub MCP issue triage looks perfect on the first run, then quietly breaks when left unattended: duplicate comments, reclassification churn, rate exhaustion. Field notes on idempotent prompts, label-boundary self-audits, and request budgets that keep weekly triage stable.

Cowork²⁸ GitHub MCP Issue Management Idempotency Scheduled Tasks⁶ Indie Dev¹⁸

✦ Premium Article

Two weeks into running automated triage, the Monday-morning report stopped me. An old issue I had already left a "checking in" comment on the week before had received a second, nearly identical comment. To the person subscribed to that issue, a bot had simply said the same thing twice. A few other issues that had been type: feature last week were now type: improvement — even though nothing about them had changed.

The first, manual run had gone beautifully. It cleared a pile of unlabeled issues and handed me a prioritized backlog. The trouble was that I had designed triage as a one-time cleanup, not as something that runs again every week. Unattended automation has to be judged less on first-run accuracy and more on whether the hundredth run produces the same result. Otherwise it erodes trust without you noticing.

As an indie developer running several apps and four technical blogs solo, my GitHub backlog is permanently the thing I get to last. That is exactly why unattended automation is worth it — but unattended work fails in a specific way. These are field notes on the three sources of that quiet drift and the implementation that closes each one. They assume you already have GitHub MCP connected and want to put triage on a schedule.

The three drift sources that break unattended triage

There is a clear taxonomy to how triage that works by hand falls apart once it runs unattended. If you lump it all under "the accuracy is bad," no amount of prompt polishing will fix it.

Drift source	Symptom	Root cause
Non-idempotent state	Comments pile up on the same issue; labels get reapplied	Each run decides from scratch without reading its own prior output
Label-boundary drift	Classification wobbles week to week with no change in content	The criteria are not written down, so the week's wording decides
Rate cascade	Later issues go unprocessed; the run stops silently mid-way	Combined fetch/update/comment requests hit the hourly ceiling

Each is solved not by issuing stronger instructions but by observing state before acting.

Drift source 1: non-idempotent work double-posts comments

Idempotency — the property that repeating an operation does not change the result — is the foundation of unattended work. By hand, you remember "I already did this one last week." A scheduled session starts blank every time. Unless you record what you did and read it back on every run, double-processing is guaranteed.

The real damage shows up in comments, because they notify people. Relabeling is silent; a comment rings someone's inbox. The fix is to embed a machine-readable marker in the comments you author, and check for it next time.

Before commenting on Issue #<n> in [owner/repo],
fetch its existing comments with get_issue.

Skip condition (if met, add no comment and record "skipped: already-marked"):
- A comment already contains the string "<!-- cowork-triage:stale-check -->"

Only if not met, add this comment with add_issue_comment.
Keep the trailing HTML comment — it is the marker.

---
This issue hasn't been updated in a while, so this is a status check.
If it still reproduces on the current version, let us know. If not, feel free to close.
<!-- cowork-triage:stale-check -->
---

The point is using an invisible  as the identifier. GitHub does not render HTML comments, so readers see only the natural sentence while the machine reliably detects "already handled." Do not match on the visible text — change a word and the dedup breaks and you double-post again. Always match a fixed marker.

Make label updates idempotent the same way: scope the target with a predicate over current state — "only the unlabeled ones," "only those without status: triage yet." "Reclassify every open issue" looks powerful but touches everything each run, which is the breeding ground for wobble.

In [owner/repo], target only open issues with no label starting with "type:"
(do not touch already-classified ones).

If you have wrestled with the same idempotency problem on the topic-generation side of an unattended pipeline, the predicate-over-state pattern will feel familiar (Designing a duplicate-topic detection gate for unattended Cowork pipelines).

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A marker-based idempotent triage prompt that runs weekly unattended without duplicate comments or reclassification churn

✦A weekly self-audit that catches label-boundary drift (feature vs improvement) and proposes fixes instead of silently overwriting

✦A request budget that prevents GitHub API rate cascades, plus a defensive auth posture moving from static PATs toward short-lived credentials

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Drift source 2: fuzzy label boundaries make classification wobble

type: feature versus type: improvement. priority: high versus priority: medium. These adjacent buckets are hard even for humans. Hand the decision to a model without written criteria and the call sways with how that week's issue happens to be phrased. When an issue's label flips across weeks with no content change, this is almost always why.

Two fixes. First, freeze the boundary as a decision table and feed the same one every run. Second, never overwrite an auto-applied classification — remove the room for wobble entirely.

A decision table works best when scoped to the confusable pairs only. Drawing one clean line through an error-prone neighbor pair beats trying to enumerate everything.

Confusable pair	The dividing line	Examples
feature / improvement	Adds behavior that does not exist yet = feature. Makes existing behavior faster/easier = improvement	"Add dark mode" = feature / "List view is slow" = improvement
bug / improvement	Not working as specified = bug. Works as specified but inconvenient = improvement	"Save fails" = bug / "Want a confirm dialog on save" = improvement
critical / high	Data loss, billing, or lockout with no workaround = critical. Major flow blocked but workaround exists = high	"Not reflected after payment" = critical / "Layout breaks on one device" = high

Then state "do not overwrite" explicitly. This single sentence stops most of the churn.

Classification is additive only: do not change labels on issues that already
have a type: or priority: label. If you believe the classification should change,
do not overwrite it — list the number separately as "reclassify-candidate" and report it.

Surface only the boundary cases a human should review as candidates, leaving the actual relabel to manual confirmation. That stops auto-classification wobble from landing on production labels. My weekly report now lists a few reclassify-candidate items I fix by eye — across three months, an average of two or three per week. Far lighter than second-guessing every issue.

Drift source 3: rate cascades silently drop the back half

GitHub's authenticated rate limit is 5,000 requests/hour. That sounds generous, but each issue stacks several requests: fetch → update label → check existing comments → add comment. At four requests per issue, comment-bearing processing hits the ceiling around a thousand issues — and search_issues or list_issues paging gets you there faster.

The danger is that when you hit the limit, only the back half quietly disappears. The early issues get handled, the report looks fine, and you notice late. The defense is an explicit budget on how many issues one run touches.

Process at most 25 issues per run.
Use list_issues exactly once; do not paginate.
Limit comments to issues judged priority: critical or high;
medium/low get a label update only (no comment).
When the budget is reached, list the remaining numbers and report them as "deferred."

Three economies act at once: the count cap bounds total requests, the comment condition cuts the heaviest writes, and no paging keeps fetches constant. At 25 issues and about 4 requests each, a run lands around 100 requests — roughly 2% of the 5,000 ceiling. Returning deferred lets the next run pick up where this one stopped, so a tight budget never means dropped work. "A little, reliably, every time" is more stable for unattended work than "everything in one pass."

A scheduled run that "looks successful but actually stopped mid-way" is a weakness of unattended operation in general, not just triage. The habit of having the run self-check its counts at the end is written up separately (Catching silent no-op scheduled tasks with an end-of-run assertion).

Defensive auth — leaning from static PATs toward short-lived credentials

The more something runs unattended, the more your credential posture becomes the origin of incidents. The minimum for a fine-grained PAT is scoping it to what it operates on: Issues and Pull requests as Read and Write, Metadata as Read-only, and the target repositories explicitly enumerated under "Only select repositories." Under an organization, you also need org-side approval of PAT use. That much is standard.

It is worth tracking the 2026 direction too. The 2026-06-28 Claude Code update pointed toward replacing static API keys with WIF (Workload Identity Federation) — pairing with an OIDC-compliant provider to use short-lived, scoped credentials issued at request time. On the GitHub side, the same principle holds: the more you move from a long-lived PAT sitting in config toward credentials issued on demand and auto-expiring, the lower the scariest unattended risk — discovering after the fact that a key had leaked. The same update also improved MCP resilience and remote-MCP stability, which reduces dropped calls in scheduled MCP-driven setups.

You do not have to switch everything at once. The principle is one line: for unattended credentials, keep the exposure small and the lifetime short. Scope minimization and explicit repo selection you can do today; migrating to short-lived credentials belongs on the roadmap. That temperature is enough.

A near-final weekly triage prompt

The three defenses above (idempotency, frozen boundaries, request budget) fold into one scheduled prompt, meant to run unattended on Monday mornings.

Run the weekly issue triage for [owner/repo] autonomously to completion, without asking.

[BUDGET] Touch at most 30 issues per run across new classification and status checks.
Call list_issues as few times as possible. If the budget is reached, report the rest as "deferred" (numbers only).

[Step 1: Classify new issues (no overwrite)]
For open issues created in the last 7 days with no type: label, apply type and priority
using these boundaries. Do not touch issues that already have labels.
- feature=adds behavior that does not exist / improvement=makes existing behavior faster/easier
- bug=does not work as specified / question=usage/config inquiry / docs=wrong or missing text
- critical=data loss, billing, or lockout, no workaround / high=major flow blocked (workaround exists)
- medium=nice to have, minor impact / low=if time allows

[Step 2: Detect classification wobble (no relabel)]
Read issues that already have a type: label. If the boundary above suggests a different class,
do NOT change it — list "reclassify-candidate: #num current→proposed" only.

[Step 3: Stale status check (idempotent)]
For up to 10 open issues with no update in 90+ days and no status: triage, add status: triage.
Add a check-in comment only to issues that do NOT already contain a comment with
"<!-- cowork-triage:stale-check -->", and append that marker.

[Step 4: Review-waiting PRs (no action)]
List open PRs with no reviewer assigned, numbers and titles only.

[End-of-run assertion]
The report MUST include: counts classified (by type) / reclassify-candidate count /
status-checked count / deferred count / approximate API calls used.
If deferred > 0, state "resume from the top next run."

The crux is that all four steps read current state before acting, and every write operation (comment, label) carries a skip condition. For the mechanics of wiring up the schedule itself, the Cowork scheduled-task automation walkthrough has the base pattern.

Three months in: what worked and where it stops

The biggest win was that label wobble stopped. After switching to "no overwrite + candidate list," production labels never again rewrote themselves. Eyeballing two or three reclassify-candidate items each Monday turned out lighter than I expected and easy to keep up. Duplicate comments have not recurred once since the marker approach.

The limits, honestly: auto-classification still trails human judgment on boundary cases. An issue that "looks like a bug but is really a feature request" wobbles even with a decision table. That is precisely why I left relabeling as a candidate list rather than automating it — here, the realistic answer is to design for "a human glances at it," not to eliminate the judgment entirely. And on busy public repositories, automated comments can read as cold. This setup fits solo and small repositories with few contributors.

One last thing. If you have not gone unattended yet, the safest thing to try today is just Step 2 — "detect wobble, do not relabel." Have it read your existing classifications and flag the ones that look off, by number. It writes nothing, so it is safe, and it shows you at a glance how fuzzy your own label criteria were. From there, expand one idempotent write at a time, and you will move steadily toward triage that does not break even when no one is watching.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.