◈ Cowork/2026-06-23Advanced

Stopping an Unattended Writer From Publishing the Same Article Twice

When a Cowork scheduled task generates articles every day, the real danger isn't a crash — it's quietly publishing a piece that overlaps with one from a few days ago. Here is a gate that compares slug similarity and the day's log before publishing, built from a near-miss I caught this morning.

Cowork²⁶ scheduled tasks⁷ duplicate detection automation⁷² Python¹⁶ SEO² content operations

✦ Premium Article

This morning my Cowork scheduled task was about to start writing on extending the prompt cache TTL from five minutes to one hour. Had it slipped through and published, it would have produced a near-twin of claude-api-prompt-cache-5m-1h-two-tier-ttl-design, the piece I shipped half a year ago — same substance, different URL.

When you generate articles unattended every day, the scariest failure isn't a crash. A crash stops, lands in the log, and you notice it the next morning. The truly dangerous failure is the one that never stops: calmly publishing pieces that overlap with something from a few days back. Each one reads fine in isolation, so nobody catches it unless a human reviews every article. And to Google, this is the textbook behavior of a site mass-producing thin content.

Running four sites on unattended scheduled tasks for the past six months, I've come to see this "quiet duplication" as the single biggest thing eroding search standing. Today I want to share the countermeasure: a gate that, before publishing, compares slug proximity and the day's log and stops the run if a duplicate looks likely — with the actual code that's running in production.

Why a count check can't catch duplicates

Most auto-publishing pipelines confirm that the Japanese and English article counts match right before push. That's essential for avoiding 404s, but it does nothing for duplicate detection. Matching counts on overlapping content just means one well-counted duplicate has been added.

Exact title matching doesn't help either. An unattended task phrases each title slightly differently, so "Prompt cache TTL design" and "Cutting cost by extending cache lifetime" never match as strings — they sail right through.

To catch duplicates you have to compare what concept the article is about, not its wording. And fortunately we already hold a short string that summarizes the concept: the slug. A slug is a hyphen-separated list of English words, with the article's subject terms lined up directly. Compare those as sets of tokens and you get duplicate detection that's robust to rewording.

Turn the slug into a token set and measure Jaccard similarity

The idea is simple. Split the candidate slug and each existing slug on hyphens into sets of words. Measure how much the two sets overlap with the Jaccard coefficient (size of intersection ÷ size of union), and flag anything above a threshold as a suspected same-concept piece.

#!/usr/bin/env python3
"""dup_gate.py — check whether a candidate slug overlaps conceptually with existing articles.
Usage:
  python3 dup_gate.py <repo> <category> <candidate-slug>
Exit codes:
  0  no duplicate (safe to publish)
  1  suspected duplicate (re-angle or switch to enrichment)
"""
import sys
from pathlib import Path
 
# Reduce a slug to its subject terms. Noise words aren't subjects, so drop them.
STOPWORDS = {
    "claude", "api", "sdk", "cli", "guide", "the", "a", "to", "for",
    "with", "and", "of", "in", "on", "how", "your", "cowork",
}
 
def slug_tokens(slug: str) -> set:
    parts = [p for p in slug.lower().split("-") if p]
    return {p for p in parts if p not in STOPWORDS and len(p) > 1}
 
def jaccard(a: set, b: set) -> float:
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)
 
def main():
    repo, category, candidate = sys.argv[1], sys.argv[2], sys.argv[3]
    cand = slug_tokens(candidate)
    ja_dir = Path(repo) / "content" / "articles" / "ja" / category
    hits = []
    for mdx in ja_dir.glob("*.mdx"):
        existing = mdx.stem
        if existing == candidate:
            continue
        score = jaccard(cand, slug_tokens(existing))
        if score >= 0.5:
            hits.append((score, existing))
    hits.sort(reverse=True)
    if hits:
        print(f"❌ suspected duplicate: {candidate}")
        for score, existing in hits[:5]:
            print(f"   {score:.2f}  {existing}")
        sys.exit(1)
    print(f"✅ no duplicate: {candidate}")
    sys.exit(0)
 
if __name__ == "__main__":
    main()

Run this on this morning's case and the candidate claude-api-prompt-cache-ttl-5m-to-1h-refresh-design trips on the existing claude-api-prompt-cache-5m-1h-two-tier-ttl-design at 0.62. The shared tokens are prompt, cache, 5m, 1h, ttl, design — six words against a union of about ten. The wording differs, yet the number makes the conceptual overlap unambiguous.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A Python gate that tokenizes existing slugs, measures Jaccard similarity, and blocks 'same-concept' articles before they publish

✦How to pick a threshold (around 0.5) that avoids both false positives and missed near-duplicates, using a banded decision table

✦What to do after a hit — re-pick a different angle, or switch to enriching the existing article instead of adding a new URL

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Why the threshold sits around 0.5

The threshold design decides whether this gate is usable. Set it too low and, since articles in the same category tend to share subject terms, everything trips and the automation jams. Set it too high and you miss the real duplicates that escaped by rephrasing.

Across roughly 2,500 slugs on my four sites, about 0.5 was the practical boundary. Concretely, the distribution looks like this.

Jaccard score	Typical relationship	Action to take
0.6 and up	Almost the same subject — a reworded duplicate	Block publishing. Change the angle fundamentally
0.45–0.6	Adjacent topic on the same feature, large overlap	Consider enriching the existing article instead
0.3–0.45	A different angle within the same category — healthy	Safe to publish. A good internal-link candidate
Below 0.3	Only loosely related	Safe to publish

The key is to treat the threshold as a band, not a single number. Block 0.6-and-up mechanically, and route the 0.45–0.6 gray zone to a different exit — "don't publish new, enrich the existing article." With that band in place, the gate stops being a mere gatekeeper and becomes traffic control for the whole operation.

Cross-check with the day's log in a second stage

Slug similarity alone still misses one kind of duplicate: two different tasks on the same day (say, premium-tue and daily-content) heading toward the same subject before either has even committed a slug. If the slug doesn't exist as a file yet, the script above has nothing to compare against.

So as a second stage, cross-check the day's update log by subject term. My pipeline appends each task's Title and Slug to _updated_article_log/{site}/YYYY-MM-DD.txt, so grepping it with the candidate's tokens catches an "in-flight duplicate" that hasn't become a file yet.

#!/usr/bin/env bash
# dup_gate.sh — two-stage check: slug similarity (dup_gate.py) plus the day's log
set -euo pipefail
REPO="$1"; CATEGORY="$2"; CANDIDATE="$3"; LOG_DIR="$4"
 
# Stage 1: slug similarity against existing articles
python3 dup_gate.py "$REPO" "$CATEGORY" "$CANDIDATE" || exit 1
 
# Stage 2: are the subject terms already in today's log? (always derive the date in JST)
TODAY="$(TZ=Asia/Tokyo date +%Y-%m-%d)"
LOG="${LOG_DIR}/${TODAY}.txt"
if [ -f "$LOG" ]; then
  TOKENS=$(echo "$CANDIDATE" | tr '-' '\n' \
    | grep -vE '^(claude|api|sdk|guide|cowork|the|for|with|and)$' \
    | awk 'length > 2')
  HITS=0
  for t in $TOKENS; do
    if grep -qi "$t" "$LOG"; then HITS=$((HITS+1)); fi
  done
  TOTAL=$(echo "$TOKENS" | grep -c . || true)
  # if 60%+ of the subject terms already appear in today's log, suspect an in-flight duplicate
  if [ "$TOTAL" -gt 0 ] && [ "$HITS" -ge $(( (TOTAL * 6 + 9) / 10 )) ]; then
    echo "❌ ${HITS}/${TOTAL} subject terms already in today's log. Suspected in-flight duplicate"
    exit 1
  fi
fi
echo "✅ two-stage check clear"

One caution on the log check: always derive the date with TZ=Asia/Tokyo date. Bare date returns UTC, so a task running in the small hours of Japan time will read the previous day's log and miss the day's duplicates entirely. That's a mistake I made once, letting two near-identical subjects through on the same day.

After a hit, what do you re-pick?

Even when the gate stops it, an unattended task still has to decide "what do I write instead." Bolting on some arbitrary alternative just turns the task into the thin-content generator you were trying to avoid. Here's the branching I settled on in practice.

If the score is 0.6 or higher, judge the subject as already well covered and re-pick a subject with different terms from the day's news reference data. Move to a different feature or a different problem, not a reworded version of the same one.
If the score is in the 0.45–0.6 gray zone, stop publishing new and switch to enriching the existing article that tripped. Adding a single paragraph of recent firsthand experience to an existing piece — without minting a new URL — does more for the site's quality signals.
In both cases, log the re-pick as "duplicate avoided: original subject → new subject." Without it, tomorrow's task heads for the same subject again.

That second exit pays off precisely because I run this solo as an indie developer. Rather than minting new URLs forever, thickening what already exists is, in my experience, the more direct route back to search standing. The duplicate gate is also the mechanism that lets an unattended task make that call on its own.

A first step to adopt it

If you generate articles unattended too, start by dropping in a single dup_gate.py and running each candidate slug through it right before push. Begin with a threshold of 0.5, run it across your own slug set for a few days, and adjust the band — toward 0.55 if false positives pile up, toward 0.45 if misses bother you. Even just logging what the gate stopped for the first few days, and looking at it, reveals how often your own pipeline reaches for nearly the same subject.

I'm still tuning it myself, but since adding this gate the "wait, there are two similar articles" accidents have clearly dropped. I hope it helps anyone wrestling with the same unattended-operations problem.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.