Noticing From the Outside When a Scheduled Job Quietly Did Nothing

exit 0, but zero output. How to catch a silent no-op not from the job's own log but from an external heartbeat ledger and ground truth, written from running several sites on a nightly schedule as an indie developer.

Claude Code¹⁵⁸ automation⁶⁹ scheduled tasks⁵ observability¹³ reliability⁷

✦ Premium Article

One morning I opened the update log and a single line was missing — the line for a generation job that should have run overnight. The error log held nothing. The exit code was 0. So as far as the system was concerned, it had succeeded, yet not a single article had been added. There was no commit at that time in the git history either.

That state — "succeeded, but nothing was produced" — was the hardest to deal with. A crash is easier; at least you notice. This failed in silence, and worse, it believed it had succeeded. When you run several sites in sequence overnight as an indie developer, these silent no-ops slip in now and then. Today I want to write down how to notice them from the outside.

Start from the premise that "success" and "result" are different

We tend to treat the exit code as proof of a result. But exit 0 only guarantees that the last command returned no error. There are plenty of paths where the job finds nothing to generate and exits cleanly without creating anything.

In my case, the cause was usually one of these. A reference-data cat hit the wrong path and returned empty, so the job slid past topic selection in silence. The disk was full and a clone gave up halfway, yet the following steps still went through the motions. A model pause left generation empty-handed, with nothing to push. None of these are individual bugs; they share the same structure — a no-op that still counts as success.

So the goal of monitoring is not "did the job crash." It is "did the expected output actually come into existence." Narrowing to that one question makes the design much cleaner.

Why you cannot trust the job's own log

The first thing you want to reach for is writing a "done" line at the end of the job. I did exactly that. It does not work.

Silent no-ops happen precisely when the run takes a path off the main flow. And the "write the done log" step sits at the tail of that main flow. In other words, the situation that drops your log and the situation that drops your output share the same root. The one time it fails is the one time the line announcing failure never gets written. Self-reporting goes quiet at exactly the moment you need it most.

That realization became the starting point for the design. Move the observer outside the job. Base the verdict not on what the job says about itself, but on facts a third party — independent of the job — can see. Once you decide that, what to record becomes obvious.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦Why a job's self-reported log cannot be trusted, and how to record a heartbeat to a JSONL ledger

✦The dead-man's-switch idea — absence as the signal — and the expected ⊆ observed reconciliation

✦A full watchdog that double-checks output against git ground truth instead of self-reports

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Leave a heartbeat in an external ledger

Instead of a self-reported success, ask the job to leave only a heartbeat. A single line that says, "at this time I produced this output, this many of them," appended to a shared ledger. No judgment. Just a timestamped fact.

I made the ledger a JSONL file, one record per line. Append-only, so it is hard to corrupt and easy to follow by machine or by eye later. At the end of a generation job, you add just a few lines.

# heartbeat.sh — call at the tail of a generation job. No judgment; just stamp what happened.
# usage: heartbeat.sh <job_id> <repo_dir> <expected_min_articles>
set -uo pipefail
 
JOB_ID="$1"
REPO_DIR="$2"
EXPECTED_MIN="${3:-1}"
LEDGER="${HEARTBEAT_LEDGER:-$HOME/ops/heartbeat.jsonl}"
mkdir -p "$(dirname "$LEDGER")"
 
# Ground truth: count ja articles added by the latest commit (do not self-count)
cd "$REPO_DIR" || exit 0
ADDED=$(git show --name-only --pretty=format: HEAD 2>/dev/null \
  | grep -c '^content/articles/ja/.*\.mdx$')
HEAD_SHA=$(git rev-parse --short HEAD 2>/dev/null || echo "nohead")
TS=$(TZ=Asia/Tokyo date +%Y-%m-%dT%H:%M:%S%z)
 
printf '{"job":"%s","ts":"%s","sha":"%s","added":%s,"expected_min":%s}\n' \
  "$JOB_ID" "$TS" "$HEAD_SHA" "${ADDED:-0}" "$EXPECTED_MIN" >> "$LEDGER"

There is one deliberate choice here. The added value comes not from a number the job counted in its head, but from the files that actually entered the commit, via git show. The principle of excluding self-reports is enforced from the moment of stamping.

Absence is the signal — the dead-man's-switch idea

Here is the crux. Treat the heartbeat's absence, not its presence, as the signal. That is a dead-man's switch.

It helps to picture it this way. A train driver has to keep a handle held down; if they stop holding it for a set interval, the device brakes automatically. While the handle is held, nothing happens; the instant the hand lets go, it triggers. Our monitoring is the same — while jobs keep leaving heartbeats, it stays quiet, and it speaks up only the moment a beat goes missing.

In implementation, you prepare "the set of jobs that should have stamped today" (the expected set) and "the set that actually stamped" (the observed set), and check only whether expected ⊆ observed still holds.

The reconciliation runs in these steps.

Declare the job IDs scheduled for that weekday as the expected set
Read today's ledger lines and collect job IDs into the observed set
Subtract observed from expected and confirm the difference is empty
If it is not empty, report the missing job IDs as no-ops

Because you ask "did what was scheduled arrive," you also catch the complete silence of a job that never even started. Monitoring that waits for a crash log has its blind spot exactly there.

What the watchdog reads is ground truth, not self-reports

Even when a beat arrives, you cannot relax yet. There is a failure mode where the stamp exists but added is 0 — a hollow beat. So the watchdog cross-checks both the presence of the beat in the ledger and the ground truth in the repository.

Ground truth means: did the .mdx files under that repo's ja directory actually increase that day. Rather than trust the ledger's self-count, the watchdog recounts in its own cloned copy of the repo. When the two numbers disagree, it is the ledger that gets doubted.

The script below is the body of the monitor. Run it once a day, after every generation window has closed.

#!/usr/bin/env python3
# watchdog.py — reconcile expected job heartbeats against repository ground truth
import json, subprocess, sys
from datetime import datetime
from pathlib import Path
from zoneinfo import ZoneInfo
 
JST = ZoneInfo("Asia/Tokyo")
LEDGER = Path.home() / "ops" / "heartbeat.jsonl"
 
# Expected set per weekday (0=Mon .. 6=Sun). Declare rest days here to avoid false alarms.
EXPECTED_BY_WEEKDAY = {
    0: {"claudelab-premium", "gemilab-premium", "antigravitylab-premium", "rorklab-premium"},
}
for wd in range(1, 5):
    EXPECTED_BY_WEEKDAY[wd] = EXPECTED_BY_WEEKDAY[0]
EXPECTED_BY_WEEKDAY[5] = {"weekend-content"}
EXPECTED_BY_WEEKDAY[6] = {"weekend-content"}
 
REPO_OF = {
    "claudelab-premium": Path.home() / "repos" / "claudelab.net",
    "gemilab-premium": Path.home() / "repos" / "gemilab.net",
    "antigravitylab-premium": Path.home() / "repos" / "antigravitylab.net",
    "rorklab-premium": Path.home() / "repos" / "rorklab.net",
}
 
def today_jst():
    return datetime.now(JST).date()
 
def load_today_beats():
    beats = {}
    if not LEDGER.exists():
        return beats
    for line in LEDGER.read_text(encoding="utf-8").splitlines():
        try:
            rec = json.loads(line)
            ts = datetime.fromisoformat(rec["ts"])
        except (json.JSONDecodeError, KeyError, ValueError):
            continue
        if ts.astimezone(JST).date() == today_jst():
            beats[rec["job"]] = rec  # keep the last stamp for a given job
    return beats
 
def ground_truth_added(repo: Path) -> int:
    """The watchdog counts for itself. It does not trust the ledger's self-count."""
    if not (repo / ".git").exists():
        return -1  # no clone == cannot verify. Treat that as an anomaly too.
    since = today_jst().isoformat()
    out = subprocess.run(
        ["git", "-C", str(repo), "log", "--since", since,
         "--name-only", "--pretty=format:"],
        capture_output=True, text=True
    ).stdout
    return sum(
        1 for ln in out.splitlines()
        if ln.startswith("content/articles/ja/") and ln.endswith(".mdx")
    )
 
def main():
    weekday = today_jst().weekday()
    expected = EXPECTED_BY_WEEKDAY.get(weekday, set())
    beats = load_today_beats()
 
    missing, hollow = [], []
    for job in sorted(expected):
        rec = beats.get(job)
        if rec is None:
            missing.append(job)          # no beat == complete silence
            continue
        repo = REPO_OF.get(job)
        truth = ground_truth_added(repo) if repo else int(rec.get("added", 0))
        if truth <= 0:
            hollow.append((job, rec.get("added", 0), truth))  # stamped, but not on the ground
 
    if not missing and not hollow:
        print(f"[OK] {today_jst()} confirmed output for all {len(expected)} expected jobs")
        return 0
 
    if missing:
        print(f"[ALERT] silent no-op (no beat): {', '.join(missing)}")
    for job, claimed, truth in hollow:
        print(f"[ALERT] hollow beat (self={claimed} / measured={truth}): {job}")
    return 1
 
if __name__ == "__main__":
    sys.exit(main())

The reason missing and hollow are kept apart is that they call for different responses. A job with no beat points to the scheduler side (a failed launch, a broken upstream dependency), while a hollow beat points to the generation logic (topic exhaustion, missing reference data). When the symptom narrows the cause, the next morning's investigation gets shorter.

Reducing false positives — how to define the expected set

This monitor lives or dies by the correctness of the expected set. If the expected set drifts, it keeps crying "no-op!" over a rest day's job, and eventually no one reads the alerts. Not turning it into the boy who cried wolf is the condition for keeping the monitor alive.

I try to hold to three things.

Declare the expected set by weekday, and update it in the same commit as the scheduler change. Never fix just one side
Make intentional pauses explicit, as a declaration of the empty set, so they are distinguishable from an implicit absence
On special days like the start of a month, confirm the expected set by hand the day before. Do not let exceptions through quietly

The second point matters most. Once you tell the monitor "nothing is scheduled today," the machine can tell whether an absence is normal or abnormal. You stop reacting to silence itself and start picking up only "silence that contradicts the plan."

The numbers I saw in operation, and where I tripped

Over about three weeks, I caught two silent no-ops and one hollow beat, all by the next morning. Previously I could not notice these until I happened to spot the missing line in the update log myself. Time from detection to root cause settled around 15 minutes on average, helped by the symptom split.

There were stumbles. At first I stamped the ledger date with a bare date, which read as the previous day in UTC and dropped records out of the same-day reconciliation. It only stabilized once I aligned both the stamp and the watchdog to TZ=Asia/Tokyo. Time zones, I was reminded, are the one thing to pin down first in any process whose subject is "when."

One more. If the watchdog itself dies, all monitoring goes silent. This is the eternal nesting problem of monitoring, and it cannot be fully removed. I drew a line: the watchdog alone carries an outward heartbeat (one line to a separate file when it manages to run), and I folded a check into my morning routine so that if it goes quiet for two days I notice by hand. The last rung is left to a person.

If you too run several jobs overnight, start by adding this one heartbeat line to the single job whose silence would hurt most. The expected set can grow later. I expanded mine from that first job, little by little. I hope it gives you a starting point if you are wrestling with the same quiet kind of failure.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.