⬡ API & SDK/2026-06-25Advanced

When the Previous Run Hasn't Finished and the Next One Starts: Leases and Fencing Tokens for Scheduled Agents

A scheduled agent that runs on a fixed clock can overtake itself and start twice. From the moment a naive lock breaks to leases, fencing tokens, and bounded catch-up — worked through with the implementation I actually run.

Claude Agent SDK¹⁰ Scheduled jobs Distributed locks Production²¹ Design⁶

✦ Premium Article

"The previous job is still running, and the next one has already started." I noticed this one morning while watching the system that generates articles for the four Dolice Labs sites every day on a fixed schedule.

The trigger was a rate-limit increase for Claude Code. With more headroom, I tightened an interval I had kept generously wide, down to 45 minutes. Right after, one generation ran long, and while the previous run was still cleaning up, the next cron had already spun up a fresh process. Two articles were about to be pushed, and I froze looking at the git history.

Scheduled execution looks like a simple "run when the clock says so," but it carries a weak spot at the exact moment it passes its former self. Today I want to walk from where a naive lock breaks to fencing tokens, alongside the implementation I keep in place as an indie developer.

Why the next run overtakes the previous one

A cron entry, or a Cowork schedule, only promises one thing: start at this time. Nowhere does it promise start after the previous run finishes.

Most days, generation is comfortably faster than the interval, so the two never meet. In production, though, a transient API delay, a retry, an unusually large article, a network hiccup — any one of them lets a single run eat through its interval.

When that happens, the next trigger fires without hesitation. Both runs clone the same repository, reach for the same slug, and try to write the same file. One succeeds at git push; the other collides on rebase. On a bad day, two slightly different articles are born.

The heart of the problem is that neither run knows the other exists. So the first thing we need is a way for them to announce themselves to each other.

Start with a naive lock (and watch it break)

The obvious move is to raise a flag at the start and lower it at the end.

// Naive version — this breaks in production
async function runOnce(store, jobId, body) {
  if (await store.read(jobId)) {
    return { ran: false, reason: "locked" };
  }
  await store.write(jobId, { running: true });
  try {
    await body();
  } finally {
    await store.write(jobId, null); // release
  }
}

For short jobs this prevents most overlaps, and this is where I started too. But two holes remain.

The first is a lock that never releases. If the run crashes, or the VM itself dies, it never reaches finally. The flag stays raised, and from the next day every run is rejected as "locked." On an unattended system, you notice days later. I lived through a similar freeze on a job that fetched AdMob reports every morning.

The second is nastier. A run that supposedly holds the lock may already be dead while the OS still thinks it is alive — or a long-paused process may wake up after everyone has given up on it and execute only its final write. The presence of a flag cannot stop this "zombie."

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A minimal TTL-based lease lock (TypeScript, any CAS store) that stops a second run when generation outlives the interval

✦How to stop a late, expired run from writing — by validating the fencing token at the write target, with the gotchas

✦A bounded catch-up policy that collapses missed slots to one, with the real overlap-skip rate (~2% of runs)

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Leases and fencing tokens — stop the expired run before it writes

The classic pair that fills both holes is the lease and the fencing token. It is a well-known combination from distributed locking, and the idea transfers directly to scheduled agents.

A lease is a lock with an expiry. It lends out the job: "for the next ttl milliseconds, this job is yours." When the deadline passes, it expires automatically, with no explicit release required. That closes the first hole. Even after a crash, once the TTL passes, the next run can enter.

A fencing token is a number that always increases each time the lease is taken — by one, every acquisition. On its own it is just an integer; its strength is in how you use it. You give the write target itself the job of rejecting any write whose token is lower than the highest it has already accepted.

That stops the zombie. An old run that writes late, unaware its lease expired, is turned away one step before the write — because its token is too low. Whether it believes it holds the lock is irrelevant. The resource factually knows the new owner.

A minimal lease lock over any store

First, narrow what we ask of the store: just two things, a read and a conditional write (compare-and-set). Cloudflare Durable Objects, Redis, or a single Postgres row held with SELECT ... FOR UPDATE — the same abstraction covers all of them.

// lease.ts — a lease lock over any CAS-capable store
export interface LockRecord {
  owner: string;     // who holds it
  token: number;     // fencing token; always grows on acquire
  expiresAt: number; // expiry (epoch ms)
}
 
export interface LockStore {
  read(jobId: string): Promise<LockRecord | null>;
  // write next only if stored still equals expected; true on success
  cas(jobId: string, expected: LockRecord | null, next: LockRecord): Promise<boolean>;
}
 
export async function acquire(
  store: LockStore, jobId: string, owner: string, ttlMs: number, now = Date.now(),
): Promise<LockRecord | null> {
  const current = await store.read(jobId);
  // someone else holds a still-valid lease → don't run
  if (current && current.expiresAt > now && current.owner !== owner) {
    return null;
  }
  const next: LockRecord = {
    owner,
    token: (current?.token ?? 0) + 1, // increase the token monotonically
    expiresAt: now + ttlMs,
  };
  return (await store.cas(jobId, current, next)) ? next : null;
}

The reason for cas is that two runs trying to acquire at the very same instant must not both win. Even if something interleaves between the read and the write, the write fails when expected no longer matches. Replace this with a plain write and, in that one racing instant, both succeed.

The lease is renewed periodically so it doesn't expire mid-work.

export async function renew(store, jobId, held, ttlMs, now = Date.now()) {
  const current = await store.read(jobId);
  // if it is no longer our token, we have lost the lease
  if (!current || current.owner !== held.owner || current.token !== held.token) {
    return null;
  }
  const next = { ...current, expiresAt: now + ttlMs };
  return (await store.cas(jobId, current, next)) ? next : null;
}

For the renewal interval, my default is one-third of the TTL. Even if one renewal is missed, two grace windows remain before expiry.

Validate the fencing token right before the side effect

Holding the lease is only half of it. The real value of the fencing token lands right before an irreversible write — pushing an article, finalizing a charge.

Have the write target remember the highest token it has accepted, and reject any write with a lower token.

// At the write target (the push destination) — turn away zombies by fact
async function acceptPublish(store, resourceId, token: number, commit: () => Promise<void>) {
  const seen = await store.read(`fence:${resourceId}`);
  if (seen && token <= seen.token) {
    return { accepted: false, reason: `stale token ${token} <= ${seen.token}` };
  }
  // advance the token first, then commit. Reverse the order and a crash
  // between commit and record lets the same token slip through twice
  const ok = await store.cas(`fence:${resourceId}`, seen, { owner: "", token, expiresAt: 0 });
  if (!ok) return { accepted: false, reason: "raced" };
  await commit();
  return { accepted: true };
}

On the run side, re-read the lease one more time just before the side effect, confirm your token is still current, and only then call the target.

async function publishGuarded(store, jobId, held, doPublish) {
  const current = await store.read(jobId);
  if (!current || current.token !== held.token) {
    throw new Error(`lease superseded: held=${held.token} now=${current?.token}`);
  }
  await doPublish(held.token); // acceptPublish makes the final call at the target
}

What matters here is the division of labor: the run-side check is only an early safety net, and the final fortress is at the target. Judge it on the run side alone and you will still let through a zombie that expired in the sliver of time between re-reading and pushing. Only a gatekeeper on the resource itself closes that sliver.

How to collapse missed runs — bounded catch-up

Once you stop overlaps, the opposite problem shows up. When a VM comes back after a long outage, several missed slots have piled up. Run all of them honestly and you generate a burst of articles right after recovery — and now you damage the quality signal by sheer volume.

So cap the catch-up. Of the missed slots, run only the most recent one (or the last few), and quietly drop the older ones into a log.

# Bounded backfill for missed slots. Assumes idempotent execution.
def runs_to_execute(schedule, last_success, now, max_backfill=1):
    missed = list(schedule.occurrences(after=last_success, until=now))
    if not missed:
        return []
    skipped = max(0, len(missed) - max_backfill)
    if skipped:
        log.warning("catch-up: skipping %d stale slots for %s", skipped, schedule.job_id)
    return missed[-max_backfill:]  # only the most recent max_backfill

For jobs like article generation, where the "freshness only available in that slot" is thin, max_backfill=1 is my default. Putting out one careful article now serves the site's standing better than reclaiming what was missed. For jobs you cannot take back — finalizing a charge, notifying a customer — raise the cap and process each slot individually with an idempotency key.

The numbers I saw, and my current defaults

A few observations, kept as figures. Interval 45 minutes, TTL 80 minutes (about 1.8x the interval), renewal every third of the TTL. With this, runs skipped due to overlap settled at one or two per week — roughly 2% of all runs. The next run is quietly waved off only while the previous one is running long.

A skip is not a failure. I count it on a separate scheduled.skipped_overlap counter, observed apart from real errors. Fold it into "failures" and a healthy wave-off keeps your alerts ringing until no one watches them anymore.

There is judgment in setting the TTL. Too short, and it expires mid-way through a legitimately long run, opening a window for a zombie. Too long, and after a genuine crash the next run waits a long time to enter. Measure the worst-case healthy run time and start at 1.5x to 2x of it, allowing for missed renewals — that is where I land today.

And the biggest payoff was a quieter git history. The hint of double pushes vanished, and I stopped dragging needless collisions into the pre-push gates. It is unglamorous, but on an unattended system, "doesn't jam, doesn't overlap" is itself the value.

The defaults I return to when unsure

As a starting point, here are the values I keep in place today. Adjust them to the nature of your job.

TTL at 1.5x to 2x of the worst-case healthy run

Base it on the longest run you have seen, not the usual one. Set it from the usual value and an occasional long run will expire mid-way, opening a window for a zombie. I recommend anchoring on the worst case.

Renewal interval at one-third of the TTL

Even if one renewal is missed, two grace windows remain before expiry. The speed of crash detection and the scarcity of wasted writes balance neatly here.

Catch-up cap decided by freshness

For thin-freshness jobs like article generation, use 1; for jobs you cannot take back, like charges or notifications, raise the cap with an idempotency key attached. That is the split I use. When in doubt, starting from a small cap is the safe move.

A next step

First, just confirm whether your store truly offers an atomic compare-and-set. On loosely coupled stores like Cloudflare KV, there is a gap where the token's monotonic increase can break. Back the counter with Durable Objects, or a transactional row. The foundation is decided right there.

If you are growing a system that wakes up at night or early morning too, I hope this gives your design a place to stand. Thank you for reading.

Thank You for Reading

Claude Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.