Two runner-side reliability improvements (additive, opt-out via env where relevant):
- Lease-renewal retry: a renewal lost to a transient blip (timeout/5xx/proxy)
previously let the lease expire and the coordinator reclaim a still-running job,
wasting the work. fleet_lease_renew now retries a TRANSIENT failure a few times
with a short backoff (AQ_FLEET_RENEW_RETRIES=2, AQ_FLEET_RENEW_BACKOFF_SEC=2);
a 409/412 FENCE is terminal and never retried.
- Graceful drain: a new `drain` command signals a running loop (via a $STATE/draining
flag) to stop taking NEW work (coordinator claim + local inbox), let in-flight
jobs finish, release their leases, and exit cleanly — ideal before a deploy.
Distinct from `stop` (kills workers immediately) and the `--drain`/--once startup
flag (drains the queue to empty). The flag is cleared at run-loop start and on exit.
bash -n clean on both files; ./selftest.sh PASS; `drain` smoke-tested.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>