Two recovery/cleanup gaps left the coordinator's containers growing without
bound and jobs stuck longer than necessary:
- reclaimStaleFactoryLeases: a crashed/partitioned factory stops heartbeating
~90s before its 900s lease TTL expires; the reaper now reclaims held leases of
stale (or vanished) holders within one stale window, via the same fence +
checkpoint-preserving path as the expiry reaper (refactored into reclaimLeaseJob).
- sweepFleetGarbage: deletes ephemeral coordination state on by default (finished
expired/released leases past a 24h TTL; factory docs with no heartbeat for 7d —
a live host just re-registers). Terminal-job retention (jobs + their runs/events/
artifacts+blobs) is OPT-IN only via FLEET_GC_RETENTION_DAYS (default 0 = never
delete history). Every delete is best-effort so one failure can't stall the sweep.
Both are wired into the existing reaper loop: recovery scans run every 30s, the
deletion sweep is throttled to hourly. New repo helpers (listHeldLeases,
listFinishedLeasesOlderThan, deleteLease, listAllFactories, deleteFactory,
listTerminalJobsOlderThan, deleteRun, deleteEvent) back the new coordinator
functions. Covered by cleanup.test.ts + expanded reaper.test.ts.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>