fix(agent-queue): write ended= after PR/report so --once can't exit early

run_worker marked the job ended (testing) right after moving to testing/, BEFORE
opening/merging the PR and reporting to the coordinator. Once ended= is written,
_meta_active returns false, active_workers drops to 0, and "run --once" could
drain-exit (and callers could observe completion) while the background worker was
still opening the PR — a real race that made the PR-mode selftest flaky and could
free a concurrency slot prematurely in production.

Move the ended= write to the end of the success path (after PR open/merge +
testing/shipped reports). No behavior change on the autoship/ship path. Full
selftest now passes deterministically across repeated runs.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
saravanakumardb1 2026-05-31 23:36:51 -07:00
parent 41d8067724
commit fa1f1d1b30

View File

@ -971,7 +971,6 @@ run_worker() {
echo "verify_exit=$vrc" >> "$metaf" echo "verify_exit=$vrc" >> "$metaf"
if [[ $vrc -eq 0 ]]; then if [[ $vrc -eq 0 ]]; then
mv "$review_file" "$TESTING/" 2>/dev/null mv "$review_file" "$TESTING/" 2>/dev/null
_meta_end "$metaf" "testing" "$started"
echo "VERIFY PASSED — promoted to testing (QA): $(date)" >> "$logf" echo "VERIFY PASSED — promoted to testing (QA): $(date)" >> "$logf"
# PR mode (§PR): work passed verify — commit/push the job branch, open a PR, # PR mode (§PR): work passed verify — commit/push the job branch, open a PR,
# record the URL in the meta, and push it onto the coordinator run. # record the URL in the meta, and push it onto the coordinator run.
@ -995,14 +994,22 @@ run_worker() {
# is enabled, the factory's verify gate IS the test phase — advance # is enabled, the factory's verify gate IS the test phase — advance
# testing -> shipped (closing the testing->shipped gap autonomously). Default # testing -> shipped (closing the testing->shipped gap autonomously). Default
# off leaves the job resting at `testing` for the human review gate / ship. # off leaves the job resting at `testing` for the human review gate / ship.
local _ship_done=0
if fleet_enabled && _fleet_is_job "$job"; then if fleet_enabled && _fleet_is_job "$job"; then
fleet_report "$job" testing fleet_report "$job" testing
if [[ "${AQ_FLEET_AUTOSHIP:-0}" == 1 ]] && fleet_report "$job" shipped; then if [[ "${AQ_FLEET_AUTOSHIP:-0}" == 1 ]] && fleet_report "$job" shipped; then
mv "$TESTING/$job.md" "$SHIPPED/" 2>/dev/null mv "$TESTING/$job.md" "$SHIPPED/" 2>/dev/null
_meta_end "$metaf" "shipped" "$started" _meta_end "$metaf" "shipped" "$started"; _ship_done=1
echo "FLEET AUTOSHIP — testing -> shipped: $(date)" >> "$logf" echo "FLEET AUTOSHIP — testing -> shipped: $(date)" >> "$logf"
fi fi
fi fi
# Mark the concurrency slot done LAST — only after the PR open/merge + the
# coordinator reports above. Writing `ended=` here (not right after the
# testing/ move) keeps the worker counted as active until that work is
# complete, so `run --once` cannot drain-exit and a caller cannot observe
# the job as finished before the PR is actually opened/merged. Fixes a
# flaky race where the PR/report steps ran after the slot was freed.
[[ "$_ship_done" == 1 ]] || _meta_end "$metaf" "testing" "$started"
else else
echo "VERIFY FAILED (rc=$vrc): $(date)" >> "$logf" echo "VERIFY FAILED (rc=$vrc): $(date)" >> "$logf"
# verify ran on the review_file; retry policy may requeue it. # verify ran on the review_file; retry policy may requeue it.