Commit Graph

12 Commits

Author SHA1 Message Date
saravanakumardb1
2ae1af1930 feat(agent-queue): resilient lease renewal + graceful drain
Two runner-side reliability improvements (additive, opt-out via env where relevant):

- Lease-renewal retry: a renewal lost to a transient blip (timeout/5xx/proxy)
  previously let the lease expire and the coordinator reclaim a still-running job,
  wasting the work. fleet_lease_renew now retries a TRANSIENT failure a few times
  with a short backoff (AQ_FLEET_RENEW_RETRIES=2, AQ_FLEET_RENEW_BACKOFF_SEC=2);
  a 409/412 FENCE is terminal and never retried.

- Graceful drain: a new `drain` command signals a running loop (via a $STATE/draining
  flag) to stop taking NEW work (coordinator claim + local inbox), let in-flight
  jobs finish, release their leases, and exit cleanly — ideal before a deploy.
  Distinct from `stop` (kills workers immediately) and the `--drain`/--once startup
  flag (drains the queue to empty). The flag is cleared at run-loop start and on exit.

bash -n clean on both files; ./selftest.sh PASS; `drain` smoke-tested.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 12:24:45 -07:00
saravanakumardb1
14308fc382 fix(agent-queue): explicit-engine availability check, shutdown lease release, cache GC
Three runner-side robustness fixes (behavior-preserving, opt-out where relevant):

- resolve_engine now availability-checks an EXPLICIT engine (mirroring the
  engine-class path): if the requested engine's binary isn't installed it emits
  the no-engine signal so the job is marked no_engine, instead of invoking a
  missing binary and surfacing a generic crash.
- The run-loop INT/TERM trap now best-effort releases leases for in-flight
  building/ jobs (new fleet_release_all_active) so a stopped factory's jobs are
  reclaimable immediately rather than waiting out the ~900s lease TTL.
- _cache_prune GCs cached repo checkouts under $STATE/repos not accessed in
  AQ_FLEET_CACHE_TTL_DAYS days (default 14; 0 disables), run once at run-loop
  startup, to stop unbounded disk growth. Guards against rm on an empty base path.

bash -n passes on both files; ./selftest.sh PASS.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:51:56 -07:00
saravanakumardb1
79e6a8db00 feat(agent-queue): honor a job's explicit engine on fleet claim
When materializing a claimed fleet job, write `engine: <pick>` into the job
frontmatter (resolve_engine then runs it). Only a KNOWN engine
(devin/claude/codex/copilot) is honored — never the run's 'unknown'/class
placeholder — so an engineless job still falls back to the factory default
(AGENT_QUEUE_ENGINE). No behavior change for existing jobs.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 02:30:38 -07:00
saravanakumardb1
70c6d47a75 feat(agent-queue): report concrete engine + Devin session id on release
parse_usage now always emits engine=<engine>, and the devin arm extracts the
ATIF export's session_id. fleet-client includes engine + sessionId in the run
insights it reports, so the coordinator/UI can show the real engine (not
'unknown') and a session handle for traceability/recovery (devin --resume <id>).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 02:19:52 -07:00
saravanakumardb1
41d8067724 feat(agent-queue): M0 RU gate — skip the claim when the queue is unchanged
Adds AQ_FLEET_GATE (default OFF): the run loop point-reads the cheap per-product
queue version (GET /fleet/queue-state) and SKIPS the expensive /fleet/claim while
the version is unchanged and it is not mid-drain, with a periodic safety backstop
and fail-open-on-read-error so work is never stranded. Keeps POLL_SECONDS for
local job responsiveness rather than raising it globally. selftest 39b covers the
gate decisions; reconciles the M0 section of the dispatch redesign doc.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 23:19:01 -07:00
saravanakumardb1
c2dbbaf188 feat(agent-queue): report PR state (open/merged) on the run 2026-05-31 13:56:46 -07:00
saravanakumardb1
b442b95728 feat(agent-queue): per-repo verify + opt-in auto-merge for PR jobs
Claim now carries verify (drives the existing verify gate -> PR opens only if it
passes) and autoMerge (squash-merge via gh pr merge after the PR opens, non-fatal).
selftest covers both.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 06:17:28 -07:00
saravanakumardb1
cfbcc2da9d feat(agent-queue): PR mode — open a PR per fleet job (AQ_FLEET_PR)
When AQ_FLEET_PR=1 and a claimed fleet job carries a `repo`, run the agent in an
isolated checkout on branch aq/job/<fleetJobId> (off baseBranch), then on a passing
verify commit/push and `gh pr create`. The PR URL + branch are recorded in the meta
and reported on lease release (-> the coordinator stores them on the run).

- fleet-client: parse repo/baseBranch from the claim, carry them in frontmatter;
  fleet_report_insights now sends prUrl/branch.
- _fleet_pr_prepare (clone/fetch + branch, local-path aware, identity fallback) and
  _fleet_pr_open (commit/push/gh pr create). WIP checkpointing is skipped for PR jobs
  (the pushed branch is the durable artifact).
- New flags: AQ_FLEET_PR, AQ_FLEET_REPOS_DIR, GH_BIN. README documented.
- selftest: +1 case (bare-repo origin + gh stub) — branch pushed, PR opened, prUrl
  reported on release. Full self-test PASS.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 05:27:41 -07:00
saravanakumardb1
df65b7a245 feat(agent-queue): report testing + optional autoship to the fleet (close testing->shipped)
Previously the factory reported up to `review` and "shipping is always manual",
so a coordinator job never reached a terminal stage autonomously.

- On a passing local verify, always report `testing` to the coordinator so its
  stage reflects that QA passed (was stuck at `review`).
- New AQ_FLEET_AUTOSHIP=1: the factory's verify gate IS the test phase, so advance
  the coordinator job testing -> shipped and land it in shipped/ locally. This
  closes the testing->shipped gap for an autonomous submit -> shipped pipeline.
  Default off keeps the human review gate authoritative (job rests at testing).

selftest: +2 cases (autoship reports testing+shipped + lands in shipped/; autoship
OFF reports testing but withholds shipped). Full self-test PASS.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:21:44 -07:00
saravanakumardb1
57831e3e7a feat(agent-queue): report run insights to the fleet + normalize API base
#1 fleet_report_insights: on a successful fleet run the factory now reports the
parsed cost/token/effort metrics (model, tokensIn/Out/cached, costUsd, turns,
toolCalls) plus the run result onto the coordinator run via POST
.../lease/release (which also frees the lease). parse_usage already extracted
these into the job meta; they were never sent. Engines that do not expose usage
locally (devin) still land result + endedAt.

#2 normalize AQ_FLEET_API: platform-service mounts fleet under /api, so a base
without it silently returned 404 on every call. Strip a trailing slash and
append /api unless already present, so AQ_FLEET_API=http://host:4003 works too.

selftest: +2 cases (insights reported via lease/release; API-base normalization).
Full self-test PASS.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 02:27:51 -07:00
saravanakumardb1
fbecbe82b6 feat(agent-queue): fleet feature flags + shadow/dual-run (Phase 2)
Add a safe, reversible path to validate the fleet coordinator against the proven
single-host path BEFORE cutover, via three independently-toggleable flags:
  AQ_FLEET=0          pure offline (zero coordinator calls; offline path unchanged)
  AQ_FLEET_ROUTE=1    route_via_service: coordinator authoritative for claim (default = P2-S3)
  AQ_FLEET_ROUTE=0    local inbox authoritative (coordinator not used to source work)
  AQ_FLEET_SHADOW=1   dual-run (needs AQ_FLEET=1 + ROUTE=0): query coordinator in parallel,
                      record divergence, NEVER act on it
Precedence: SHADOW only when ROUTE=0; if ROUTE=1 + SHADOW=1, ROUTE wins (one-shot warning).

lib/fleet-client.sh: fleet_route_enabled / fleet_shadow_enabled / fleet_flags_warn_once /
fleet_flags_state; fleet_shadow_claim (read-only — isolated `-shadow` factoryId +
dryRun, releases any real lease, never materializes), fleet_shadow_compare
(AGREE/DIVERGE/COORD_EMPTY/LOCAL_EMPTY → .state/fleet-shadow.log), fleet_shadow_report
(shadow:true, response never acted on), cmd_fleet_shadow_report (counts + agreement rate).

agent-queue.sh: ROUTE-gate claim sourcing (claim only when route_via_service);
shadow hook after the local authoritative decision each iteration (best-effort,
error-swallowed — shadow can never fail a real job); `fleet-shadow-report` subcommand
+ help; resolved flags surfaced in `status`/`fleet-status`. tryClaim/fence/offline
paths unchanged.

Strictly side-effect-free on real job state: shadow never ships, quarantines, or
mutates real jobs. Offline path byte-for-byte unchanged when AQ_FLEET=0.

selftest.sh: +8 checks (shadow AGREE/DIVERGE/COORD_EMPTY, non-fatal 5xx, ROUTE
precedence, ROUTE=0 local-authoritative, fleet-shadow-report summary, shadow_report
unit). 60 prior checks unchanged → 68 total green. README + GIGAFACTORY_ROADMAP
document the flag model + cutover ladder.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 00:22:48 -07:00
saravanakumardb1
a10d4003e6 feat(agent-queue): fleet coordinator client library (lib/fleet-client.sh, P2-S3)
New sourced library implementing the factory side of the Phase-2 `fleet`
coordinator contract — curl-only + POSIX awk, reusing the Slice-4 HTTP/JSON
helper patterns, no new deps. Every function is a no-op unless AQ_FLEET=1.

- fleet_enabled / fleet_api (AQ_FLEET_API_CMD test seam) / _fleet_call
- fleet_detect_caps (reuses detect_capabilities) -> JSON caps array
- fleet_heartbeat (+ _maybe cadence): registration == first heartbeat
- fleet_claim: POST /fleet/claim, parse job id/bodyMd/leaseEpoch, materialize a
  transient local .md (fleet-job-id + fleet-lease-epoch in frontmatter)
- fleet_report: PATCH fenced stage transition {stage, leaseEpoch, checkpoint?};
  returns ok / FENCED(2, stale epoch -> self-abort) / degraded(1, unreachable)
- fleet_lease_renew / fleet_lease_release / fleet_renew_active (fenced)
- fleet_quarantine: park a reclaimed (fenced) job in failed/ for human triage
- cmd_fleet_status: register + print factory identity/caps

Report payloads carry only stage/epoch/checkpoint — never prompt/bodyMd/token.
2026-05-29 22:45:44 -07:00